Platformbased Design RISC Instruction Set Implementation Alternatives using

Platform-based Design RISC Instruction Set Implementation Alternatives == using MIPS as example == TU/e 5 kk 70 Henk Corporaal Bart Mesman H. Corporaal ACA 1

Topics n MIPS ISA: Instruction Set Architecture MIPS single cycle implementation MIPS multi-cycle implementation MIPS pipelined implementation Pipeline hazards Recap of RISC principles Other architectures n Based on the book: n n n n Many slides; I'll go quick and skip some H. Corporaal ACA 5 MD 00 2

Main Types of Instructions n Arithmetic u u n Memory access instructions u n Integer Floating Point Load & Store Control flow u u u Jump Conditional Branch Call & Return H. Corporaal ACA 5 MD 00 3

MIPS arithmetic n n Most instructions have 3 operands Operand order is fixed (destination first) Example: C code: A = B + C MIPS code: add $s 0, $s 1, $s 2 ($s 0, $s 1 and $s 2 are associated with variables by compiler) H. Corporaal ACA 5 MD 00 4

MIPS arithmetic C code: A = B + C + D; E = F - A; MIPS code: add $t 0, $s 1, $s 2 add $s 0, $t 0, $s 3 sub $s 4, $s 5, $s 0 n n Operands must be registers, only 32 registers provided Design Principle: smaller is faster. Why? H. Corporaal ACA 5 MD 00 5

Registers vs. Memory n n n Arithmetic instruction operands must be registers, — only 32 registers provided Compiler associates variables with registers What about programs with lots of variables ? CPU Memory register file IO H. Corporaal ACA 5 MD 00 6

Register allocation n n Compiler tries to keep as many variables in registers as possible Some variables can not be allocated u u u n large arrays (too few registers) aliased variables (variables accessible through pointers in C) dynamic allocated variables F heap F stack Compiler may run out of registers => spilling H. Corporaal ACA 5 MD 00 7

Memory Organization n Viewed as a large, single-dimension array, with an address A memory address is an index into the array "Byte addressing" means that successive addresses are one byte apart 8 bits of data 0 1 8 bits of data 2 8 bits of data 3 8 bits of data 4 8 bits of data 5 8 bits of data 6 8 bits of data . . . H. Corporaal ACA 5 MD 00 8

Memory Organization n n Bytes are nice, but most data items use larger "words" For MIPS, a word is 32 bits or 4 bytes. 0 32 bits of data 4 32 bits of data 8 32 bits of data . . . 12 32 bits of data Registers hold 32 bits of data 232 bytes with byte addresses from 0 to 232 -1 230 words with byte addresses 0, 4, 8, . . . 232 -4 H. Corporaal ACA 5 MD 00 9

Memory layout: Alignment address 0 31 23 15 7 0 this word is aligned; the others are not! 4 8 12 16 20 24 Words are aligned n What are the least 2 significant bits of a word address? H. Corporaal ACA 5 MD 00 10

Instructions: load and store Example: C code: A[8] = h + A[8]; MIPS code: lw $t 0, 32($s 3) add $t 0, $s 2, $t 0 sw $t 0, 32($s 3) n n Store word operation has no destination (reg) operand Remember arithmetic operands are registers, not memory! H. Corporaal ACA 5 MD 00 11

Let's translate some C-code n Can we figure out the code? swap(int v[], int k); { int temp; temp = v[k] = v[k+1]; v[k+1] = temp; } swap: muli add lw lw sw sw jr $2 , $15, $16, $15, $31 $5, 4 $4, $2 0($2) 4($2) Explanation: index k : $5 base address of v: $4 address of v[k] is $4 + 4. $5 H. Corporaal ACA 5 MD 00 12

Machine Language n Instructions, like registers and words of data, are also 32 bits long u u n Example: add $t 0, $s 1, $s 2 Registers have numbers: $t 0=9, $s 1=17, $s 2=18 Instruction Format: op rs rt 000000 10001 10010 5 bits 6 bits rd 01000 5 bits shamt 00000 5 bits funct 100000 6 bits Can you guess what the field names stand for? H. Corporaal ACA 5 MD 00 13

Machine Language n Consider the load-word and store-word instructions, u u n Introduce a new type of instruction format u u n What would the regularity principle have us do? New principle: Good design demands a compromise I-type for data transfer instructions other format was R-type for register Example: lw $t 0, 32($s 2) H. Corporaal ACA 5 MD 00 35 18 9 op rs rt 32 16 bit number 14

Stored Program Concept memory OS Program 1 CPU code global data stack heap unused Program 2 unused H. Corporaal ACA 5 MD 00 15

Control n Decision making instructions u u n alter the control flow, i. e. , change the "next" instruction to be executed MIPS conditional branch instructions: bne $t 0, $t 1, Label beq $t 0, $t 1, Label n Example: if (i==j) h = i + j; bne $s 0, $s 1, Label add $s 3, $s 0, $s 1 Label: . . H. Corporaal ACA 5 MD 00 16

Control n n MIPS unconditional branch instructions: j label Example: if (i!=j) $s 5, Lab 1 h=i+j; $s 4, $s 5 else h=i-j; $s 3, $s 4, $s 5 beq $s 4, add $s 3, j Lab 2 Lab 1: sub Lab 2: H. Corporaal ACA 5 MD 00 . . . 17

So far: n Instruction Meaning add $s 1, $s 2, $s 3 sub $s 1, $s 2, $s 3 lw $s 1, 100($s 2) sw $s 1, 100($s 2) bne $s 4, $s 5, L $s 5 beq $s 4, $s 5, L $s 5 j Label $s 1 = $s 2 + $s 3 $s 1 = $s 2 – $s 3 $s 1 = Memory[$s 2+100] = $s 1 Next instr. is at Label if $s 4 ° R n op Formats: op I J H. Corporaal ACA 5 MD 00 op Next instr. is at Label if $s 4 = Next instr. is at Label rs rt rd shamt funct rs rt 16 bit address 26 bit address 18

Control Flow n n We have: beq, bne, what about Branch-if-less-than? New instruction: meaning: if slt $t 0, $s 1, $s 2 n n $s 1 < $s 2 then $t 0 = 1 else $t 0 = 0 Can use this instruction to build "blt $s 1, $s 2, Label" — can now build general control structures Note that the assembler needs a register to do this, — use conventions for registers H. Corporaal ACA 5 MD 00 19

MIPS compiler/assembler Conventions H. Corporaal ACA 5 MD 00 20

Constants n n Small constants are used quite frequently (50% of operands) e. g. , A = A + 5; B = B + 1; C = C - 18; Solutions? Why not? u u u n put 'typical constants' in memory and load them create hard-wired registers (like $zero) for constants like one or ……. MIPS Instructions: addi slti andi ori H. Corporaal ACA 5 MD 00 $29, $8, $29, $18, $29, 4 10 6 4 3 21

How about larger constants? n n We'd like to be able to load a 32 bit constant into a register Must use two instructions; new "load upper immediate" instruction lui $t 0, 10101010 filled with zeros 10101010 n 00000000 Then must get the lower order bits right, i. e. , ori $t 0, 10101010 ori H. Corporaal ACA 5 MD 00 10101010 0000000000000000 1010101010101010 22

Assembly Language vs. Machine Language n n Assembly provides convenient symbolic representation u much easier than writing down numbers u e. g. , destination first Machine language is the underlying reality u n n e. g. , destination is no longer first Assembly can provide 'pseudoinstructions' u e. g. , “move $t 0, $t 1” exists only in Assembly u would be implemented using “add $t 0, $t 1, $zero” When considering performance you should count real instructions H. Corporaal ACA 5 MD 00 23

Addresses in Branches and Jumps n Instructions: bne $t 4, $t 5, Label $t 5 beq $t 4, $t 5, Label $t 5 j Label n Formats: op I J op rs rt Next instruction is at Label if $t 4 = Next instruction is at Label 16 bit address 26 bit address Addresses are not 32 bits — How do we handle this with load and store H. Corporaal ACA 5 MD 00 instructions? n 24

Addresses in Branches n Instructions: bne $t 4, $t 5, Label beq $t 4, $t 5, Label n Formats: I n Next instruction is at Label if $t 4 $t 5 Next instruction is at Label if $t 4 = $t 5 op rs rt 16 bit address Could specify a register (like lw and sw) and add it to address u u use Instruction Address Register (PC = program counter) most branches are local (principle of locality) n Jump instructions just use high order bits of PC u address boundaries of 256 MB H. Corporaal ACA 5 MD 00 25

To summarize: H. Corporaal ACA 5 MD 00 26

MIPS (3+2) addressing modes overview H. Corporaal ACA 5 MD 00 27

MIPS Datapath n Building a datapath u n A single cycle processor datapath u n all instruction actions in one (long) cycle A multi-cycle processor datapath u n support a subset of the MIPS-I instruction-set each instructions takes multiple (shorter) cycles For details see book (ch 5): H. Corporaal ACA 5 MD 00 28

Datapath and Control FSM or Microprogramming Registers & Memories Multiplexors Buses ALUs Control H. Corporaal ACA 5 MD 00 Datapath 29

The Processor: Datapath & Control n Simplified MIPS implementation to contain only: u u u n lw, sw add, sub, and, or, beq, j Generic Implementation: u u n memory-reference instructions: arithmetic-logical instructions: slt control flow instructions: use the program counter (PC) to supply instruction address get the instruction from memory read registers use the instruction to decide exactly what to do All instructions use the ALU after reading the registers Why? F H. Corporaal ACA 5 MD 00 memory-reference? 30

More Implementation Details n Abstract / Simplified View: Data Address PC Instruction memory Instruction Register # Registers Register # ALU Address Data memory Register # Data n Two types of functional units: u u H. Corporaal ACA 5 MD 00 elements that operate on data values (combinational) elements that contain state (sequential) 31

State Elements n n Unclocked vs. Clocked Clocks used in synchronous logic u when should an element that contains state be updated? falling edge cycle time rising edge H. Corporaal ACA 5 MD 00 32

An unclocked state element n The set-reset (SR) latch u output depends on present inputs and also on past inputs R Q Q S Truth table: H. Corporaal ACA 5 MD 00 R 0 0 1 1 S 0 1 Q Q 1 0 ? state change 33

Latches and Flip-flops n n Output is equal to the stored value inside the element (don't need to ask for permission to look at the value) Change of state (value) is based on the clock u u Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology) A clocking methodology defines when signals can be read and written — wouldn't want to read a signal at the same time it was being written H. Corporaal ACA 5 MD 00 34

D-latch n Two inputs: u u n the data value to be stored (D) the clock signal (C) indicating when to read & store D Two outputs: u the value of the internal state (Q) and it's complement H. Corporaal ACA 5 MD 00 35

D flip-flop n Output changes only on the clock edge D D C D latch Q D latch _ C Q Q _ Q C H. Corporaal ACA 5 MD 00 36

Our Implementation n n An edge triggered methodology Typical execution: u u u read contents of some state elements, send values through some combinational logic, write results to one or more state elements State element 1 Combinational logic State element 2 Clock cycle H. Corporaal ACA 5 MD 00 37

Register File n 3 -ported: one write, two read ports Read reg. #1 Read data 1 Read reg. #2 Read data 2 Write reg. # Write data Write H. Corporaal ACA 5 MD 00 38

Register file: read ports • Register file built using D flip-flops Read register number 1 Register 0 Register 1 M Register n – 1 u x Read data 1 Register n Read register number 2 M u Read data 2 x Implementation of the read ports H. Corporaal ACA 5 MD 00 39

Register file: write port n Note: we still use the real clock to determine when to write W r ite 0 1 R e g is te r n u m b e r n -to -1 C R e g iste r 0 D C d e co d e r n – 1 R e g iste r 1 D n C R e g is te r n – 1 D C R e g iste r n R e g iste r d a ta H. Corporaal ACA 5 MD 00 D 40

Building the Datapath n Use multiplexors to stitch them together PCSrc M u x Add ALU result 4 Shift left 2 Registers PC Read address Instruction memory Read register 1 Read data 1 register 2 Write register Write data Reg. Write 16 H. Corporaal ACA 5 MD 00 ALUSrc Read data 2 Sign extend M u x 32 3 ALU operation Zero ALU result Mem. Write Memto. Reg Address Read data Data Write memory data M u x Mem. Read 41

Our Simple Control Structure n n n All of the logic is combinational We wait for everything to settle down, and the right thing to be done u ALU might not produce “right answer” right away u we use write signals along with clock to determine when to write Cycle time determined by length of the longest path S tate elem ent 1 Com binational logic State elem ent 2 Clock cycle We are ignoring some details like setup and hold times ! H. Corporaal ACA 5 MD 00 42

Control n Selecting the operations to perform (ALU, read/write, etc. ) n Controlling the flow of data (multiplexor inputs) n Information comes from the 32 bits of the instruction n Example: add $8, $17, $18 000000 op n Instruction Format: 10001 rs 10010 rt 01000 rd 00000 100000 shamt funct ALU's operation based on instruction type and function code H. Corporaal ACA 5 MD 00 43

Control: 2 level implementation 31 6 Control 2 26 instruction register Opcode bit 2 ALUop 00: lw, sw 01: beq 10: add, sub, and, or, slt Funct. Control 1 H. Corporaal ACA 5 MD 00 5 6 3 ALUcontrol 000: and 001: or 010: add 110: sub 111: set on less than ALU 0 44

Datapath with Control 0 M u x Add 4 Instruction [31– 26] PC Instruction [31– 0] Instruction memory Read register 1 Instruction [20– 16] Instruction [15– 11] Instruction [15– 0] 0 M u x 1 1 Zero ALU result Address Shift left 2 Reg. Dst Branch Mem. Read Memto. Reg Control ALUOp Mem. Write ALUSrc Reg. Write Instruction [25– 21] Read address ALU Add result Read data 1 Read register 2 Registers Read Write data 2 register 0 M u x 1 Write data 16 Sign extend Write data Read data Data memory 1 M u x 0 32 ALU control Instruction [5– 0] H. Corporaal ACA 5 MD 00 45

ALU Control 1 n n What should the ALU do with this instruction example: lw $1, 100($2) 2 1 op rs rt 100 16 bit offset ALU control input 000 001 010 111 n 35 AND OR add subtract set-on-less-than Why is the code for subtract 110 and not 011? H. Corporaal ACA 5 MD 00 46

ALU Control 1 n Must describe hardware to compute 3 -bit ALU control input given instruction type 00 = lw, sw 01 = beq, 10 = arithmetic u function code for arithmetic inputs Describe it using a truth table u n H. Corporaal ACA 5 MD 00 ALU Operation class, computed from instruction type (can turn into gates): outputs 47

ALU Control 1 n Simple combinational logic (truth tables) H. Corporaal ACA 5 MD 00 48

Deriving Control 2 signals Input 6 -bits 9 control (output) signals Determine these control signals directly from the opcodes: R-format: 0 lw: 35 sw: 43 beq: 4 H. Corporaal ACA 5 MD 00 49

Control 2 n PLA example implementation H. Corporaal ACA 5 MD 00 50

Single Cycle Implementation n Calculate cycle time assuming negligible delays except: u memory (2 ns), ALU and adders (2 ns), register file access (1 ns) PCSrc Add 4 Reg. Write Instruction [25– 21] PC Read address Instruction [31– 0] Instruction memory Instruction [20– 16] 1 M u Instruction [15– 11] x 0 Reg. Dst Instruction [15– 0] Read register 1 Read register 2 Read data 1 Read Write data 2 register Write Registers data 16 Sign 32 extend Shift left 2 ALU Add result 1 M u x 0 Mem. Write ALUSrc 1 M u x 0 ALU control Zero ALU result Memto. Reg Address Write data Read data Data memory 1 M u x 0 Mem. Read Instruction [5– 0] ALUOp H. Corporaal ACA 5 MD 00 51

Single Cycle Implementation n Memory (2 ns), ALU & adders (2 ns), reg. file access (1 ns) n Fixed length clock: longest instruction is the ‘lw’ which requires 8 ns n Variable clock length (not realistic, just as exercise): u u u n R-instr: Load: Store: Branch: Jump: 6 ns 8 ns 7 ns 5 ns 2 ns Average depends on instruction mix (see pg 374) H. Corporaal ACA 5 MD 00 52

Where we are headed n Single Cycle Problems: u u n what if we had a more complicated instruction like floating point? wasteful of area: NO Sharing of Hardware resources One Solution: u use a “smaller” cycle time have different instructions take different numbers of cycles a “multicycle” datapath: Instruction register PC Address IR ALU Registers Memory data register MDR H. Corporaal ACA 5 MD 00 A Register # Instruction Memory or data Data ALUOut Register # B Register # 53

Multicycle Approach n We will be reusing functional units u u n n Add registers after every major functional unit Our control signals will not be determined solely by instruction u n ALU used to compute address and to increment PC Memory used for instruction and data e. g. , what should the ALU do for a “subtract” instruction? We’ll use a finite state machine (FSM) or microcode for control H. Corporaal ACA 5 MD 00 54

Review: finite state machines n Finite state machines: u u u a set of states and next state function (determined by current state and the input) output function (determined by current state and possibly input) Current state Inputs Next-state function Clock Output function u Next state Outputs We’ll use a Moore machine (output based only on current state) H. Corporaal ACA 5 MD 00 55

Multicycle Approach n Break up the instructions into steps, each step takes a cycle u u n At the end of a cycle u u n balance the amount of work to be done restrict each cycle to use only one major functional unit store values for use in later cycles (easiest thing to do) introduce additional “internal” registers Notice: we distinguish u u processor state: programmer visible registers internal state: programmer invisible registers (like IR, MDR, A, B, and ALUout) H. Corporaal ACA 5 MD 00 56

Multicycle Approach PC 0 M u x 1 Address Memory Mem. Data Write data Instruction [25– 21] Read register 1 Instruction [20– 16] Read data 1 register 2 Registers Write Read register data 2 Instruction [15– 0] Instruction [15– 11] Instruction register Instruction [15– 0] Memory data register H. Corporaal ACA 5 MD 00 0 M u x 1 A B Sign extend 32 Zero ALU result ALUOut 0 4 Write data 16 0 M u x 1 1 M u 2 x 3 Shift left 2 57

Multicycle Approach n Note that previous picture does not include: u u u branch support jump support Control lines and logic n Tclock > max (ALU delay, Memory access, Regfile access) n See book for complete picture H. Corporaal ACA 5 MD 00 58

Five Execution Steps n Instruction Fetch n Instruction Decode and Register Fetch n Execution, Memory Address Computation, or Branch Completion n Memory Access or R-type instruction completion n Write-back step INSTRUCTIONS TAKE FROM 3 - 5 CYCLES! H. Corporaal ACA 5 MD 00 59

Step 1: Instruction Fetch n n n Use PC to get instruction and put it in the Instruction Register Increment the PC by 4 and put the result back in the PC Can be described succinctly using RTL "Register. Transfer Language" IR = Memory[PC]; PC = PC + 4; n Can we figure out the values of the control signals? n What is the advantage of updating the PC now? H. Corporaal ACA 5 MD 00 60

Step 2: Instruction Decode and Register Fetch n n Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch Previous two actions are done optimistically!! RTL: A = Reg[IR[25 -21]]; B = Reg[IR[20 -16]]; ALUOut = PC+(sign-extend(IR[15 -0])<< 2); n We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic) H. Corporaal ACA 5 MD 00 61

Step 3 (instruction dependent) n ALU is performing one of four functions, based on instruction type n Memory Reference: ALUOut = A + sign-extend(IR[15 -0]); n R-type: ALUOut = A op B; n Branch: if (A==B) PC = ALUOut; n Jump: PC = PC[31 -28] || (IR[25 -0]<<2) H. Corporaal ACA 5 MD 00 62

Step 4 (R-type or Memoryaccess) n Loads and stores access memory MDR = Memory[ALUOut]; or Memory[ALUOut] = B; n R-type instructions finish Reg[IR[15 -11]] = ALUOut; The write actually takes place at the end of the cycle on the edge H. Corporaal ACA 5 MD 00 63

Write-back step n Memory read completion step Reg[IR[20 -16]]= MDR; What about all the other instructions? H. Corporaal ACA 5 MD 00 64

Summary execution steps Steps taken to execute any instruction class H. Corporaal ACA 5 MD 00 65

Simple Questions n How many cycles will it take to execute this code? lw $t 2, 0($t 3) lw $t 3, 4($t 3) beq $t 2, $t 3, L 1 add $t 5, $t 2, $t 3 sw $t 5, 8($t 3) L 1: . . . n n #assume not taken What is going on during the 8 th cycle of execution? In what cycle does the actual addition of $t 2 and $t 3 takes place? H. Corporaal ACA 5 MD 00 66

Implementing the Control n Value of control signals is dependent upon: u u n Use the information we have accumulated to specify a finite state machine (FSM) u u n what instruction is being executed which step is being performed specify the finite state machine graphically, or use microprogramming Implementation can be derived from specification H. Corporaal ACA 5 MD 00 67

Memory address computation (Op 2 = 'L W' (O ) or 'S p= ') EQ = e) Branch completion Execution 6 ALUSrc. A = 1 ALUSrc. B = 10 ALUOp = 00 8 ALUSrc. A = 1 ALUSrc. B = 00 ALUOp = 10 Jump completion 9 ALUSrc. A = 1 ALUSrc. B = 00 ALUOp = 01 PCWrite. Cond PCSource = 01 PCWrite PCSource = 10 (O p = 'S W ') (Op = 'LW') (O p W ') yp R -t 'B How many state bits will we need? ALUSrc. A = 0 ALUSrc. B = 11 ALUOp = 00 = n 1 p Start Mem. Read ALUSrc. A = 0 Ior. D = 0 IRWrite ALUSrc. B = 01 ALUOp = 00 PCWrite PCSource = 00 (Op = 'J') Graphical Specification of FSM Instruction decode/ register fetch Instruction fetch Memory access 3 Memory access 5 Mem. Read Ior. D = 1 R-type completion 7 Mem. Write Ior. D = 1 Reg. Dst = 1 Reg. Write Memto. Reg = 0 Write-back step 4 Reg. Dst = 0 Reg. Write Memto. Reg = 1 H. Corporaal ACA 5 MD 00 68

Finite State Machine for Control Implementation: H. Corporaal ACA 5 MD 00 69

opcode PLA Impleme n-tation (see book) current state n If I picked a horizontal or vertical line could you explain it ? What type of FSM is used? Mealy or Moore? H. Corporaal ACA 5 MD 00 datapath control n next state 70

Pipelined implementation n n Pipelining Pipelined datapath Pipelined control Hazards: u u n n Structural Data Control Exceptions Scheduling For details see the book (chapter 6): H. Corporaal ACA 5 MD 00 71

Pipelining Improve performance by increasing instruction throughput P rog ram execution T im e ord er (in in structio ns) lw $ 1, 10 0($0 ) 2 Ins truction R eg fe tch lw $ 2, 20 0($0 ) 4 6 8 A LU Data access 10 12 14 16 18 R eg Instruction R eg fe tch 8 ns lw $ 3, 30 0($0 ) D ata acc ess A LU R eg Instruction fe tch 8 ns . . . 8 ns P rog ram e xecutio n Tim e o rder (in in structio n s) 2 lw $1 , 1 00($ 0) Instruction fetch lw $2 , 2 00($ 0) 2 ns lw $3 , 3 00($ 0) 4 Reg Instruction fetch 2 ns 6 ALU Reg Instruction fetch 2 ns H. Corporaal ACA 5 MD 00 8 Da ta access ALU Reg 2 ns 10 14 12 Reg Data access Reg ALU Da ta access 2 ns Reg 2 ns 72

Pipelining n Ideal speedup = number of stages n Do we achieve this? H. Corporaal ACA 5 MD 00 73

Pipelining n What makes it easy u u u n What makes it hard? u u u n n all instructions are the same length just a few instruction formats memory operands appear only in loads and stores structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction We’ll build a simple pipeline and look at these issues We’ll talk about modern processors and what really makes it hard: u u H. Corporaal ACA 5 MD 00 exception handling trying to improve performance with out-of-order execution, etc. 74

Basic idea: start from single cycle impl. What do we need to add to actually split the datapath into stages? IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back 0 M u x 1 Add Add result 4 Shift left 2 PC Read register 1 Address Instruction memory Read data 1 Read register 2 Registers Read Write data 2 register Write data 16 H. Corporaal ACA 5 MD 00 Sign extend 0 M u x 1 Zero ALU result Address Data memory Write data Read data 1 M u x 0 32 75

Pipelined Datapath Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem? 0 M u x 1 ID/EX IF/ID EX/MEM MEM/WB Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU result Address Data memory Write data 16 H. Corporaal ACA 5 MD 00 Sign extend Read data 1 M u x 0 32 76

Corrected Datapath 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add result 4 PC Address Instruction memory I nstr ucti on Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU result Address Data memory Write data 16 H. Corporaal ACA 5 MD 00 Sign extend Read data 1 M u x 0 32 77

Graphically Representing Pipelines Time (in clock cycles) Program execution order (in instructions) lw $10, 20($1) sub $11, $2, $3 n CC 1 CC 2 CC 3 CC 4 CC 5 IM Reg ALU DM IM Reg CC 6 Reg Can help with answering questions like: u u u how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths H. Corporaal ACA 5 MD 00 78

Pipeline Control PCSrc 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add 4 Shift left 2 Address Instruction memory Instruction Reg. Write PC Read register 1 Add result Branch Mem. Write Read data 1 Read register 2 Registers Read Write data 2 register Write data ALUSrc Zero ALU result 0 M u x 1 Memto. Reg Address Data memory Write Read data 1 M u x 0 data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 0 M u x 1 ALU control Mem. Read ALUOp Reg. Dst H. Corporaal ACA 5 MD 00 79

Pipeline control n We have 5 stages. What needs to be controlled in each stage? u u u n Instruction Fetch and PC Increment Instruction Decode / Register Fetch Execution Memory Stage Write Back How would control be handled in an automobile plant? u u a fancy control center telling everyone what to do? should we use a finite state machine? H. Corporaal ACA 5 MD 00 80

Pipeline Control Pass control signals along just like the data: H. Corporaal ACA 5 MD 00 (compare single cycle control!) 81

Datapath with Control PCSrc ID/EX 0 M u x 1 WB Control IF/ID EX/MEM M WB EX M MEM/WB WB Add ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 Zero ALU result 0 M u x 1 6 Memto. Reg Instruction memory Branch Shift left 2 Mem. Write Address Instruction PC Add result Reg. Write 4 Address Data memory Read data Write data ALU control 0 M u x 1 1 M u x 0 Mem. Read ALUOp Reg. Dst H. Corporaal ACA 5 MD 00 82

H. Corporaal ACA 5 MD 00 83

Hazards: problems due to pipelining Hazard types: n Structural u n Data u n same resource is needed multiple times in the same cycle data dependencies limit pipelining Control u next executed instruction may not be the next specified instruction H. Corporaal ACA 5 MD 00 84

Structural hazards Examples: n Two accesses to a single ported memory n Two operations need the same function unit at the same time n Two operations need the same function unit in successive cycles, but the unit is not pipelined Solutions: n stalling n add more hardware H. Corporaal ACA 5 MD 00 85

Structural hazards on MIPS Q: Do we have structural hazards on our simple MIPS pipeline? Instruction stream time IF H. Corporaal ACA 5 MD 00 ID EX MEM WB IF ID EX MEM WB 86

Data hazards n Data dependencies: u u u n (read-after-write) (write-after-read) Hardware solution: u u u n Ra. W Wa. R Forwarding / Bypassing Detection logic Stalling Software solution: Scheduling H. Corporaal ACA 5 MD 00 87

Data dependences Three types: Ra. W, Wa. R and Wa. W add r 1, r 2, 5 sub r 4, r 1, r 3 ; r 1 : = r 2+5 ; Ra. W of r 1 add r 1, r 2, 5 sub r 2, r 4, 1 ; Wa. R of r 2 add r 1, r 2, 5 sub r 1, 1 ; Wa. W of r 1 st ld ; M[r 2+5] : = r 1 ; Ra. W if 5+r 2 = 0+r 4 r 1, 5(r 2) r 5, 0(r 4) Wa. W and Wa. R do not occur in simple pipelines, but they limit scheduling freedom! Problems for your compiler and Pentium! use register renaming to solve this! H. Corporaal ACA 5 MD 00 88

Ra. W on MIPS pipeline T im e (in clock cycle s) V alue of re giste r $ 2 : CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 10 1 0 /– 2 0 – 20 IM Reg DM Reg P ro gra m e xe cution orde r (in instru ctio ns) sub $ 2 , $ 1 , $ 3 a nd $ 1 2 , $ 5 or $ 1 3 , $6 , $ 2 a dd $ 1 4, $ 2 sw $ 1 5, 1 0 0 ($ 2 ) H. Corporaal ACA 5 MD 00 IM DM R eg IM R eg DM Reg IM R eg DM Reg 89

Forwarding Use temporary results, don’t wait for them to be written u u register file forwarding to handle read/write to same register ALU forwarding T ime (in clock cycles) CC 1 V alue of re gister $ 2 : 10 V a lue of E X/M EM : X V a lue of M EM /W B : X CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 X X 10 – 20 X 1 0/– 20 X X – 20 X X DM R eg Program e xe cution orde r (in instructions) sub $ 2, $1 , $ 3 What if this $2 was $13? a nd $ 12 , $ 2, $5 or $ 13 , $ 6, $ 2 a dd $ 14 , $ 2, $2 sw $ 15 , 1 00 ($2 ) H. Corporaal ACA 5 MD 00 IM Reg IM R eg IM DM R eg IM R eg DM DM R eg IM Reg Reg DM Reg 90

Forwarding hardware from register file to register file buf ALU buf from register file buf ALU forwarding circuitry principle: Note: there are two options • buf - ALU – bypass – mux - buf • buf - bypass – mux – ALU - buf H. Corporaal ACA 5 MD 00 91

Forwarding ID/EX WB Control PC Instruction memory Instruction IF/ID EX/MEM M WB EX M MEM/WB WB M u x Registers Forward. A ALU Data memory M u x IF/ID. Register. Rs Rs IF/ID. Register. Rt Rt IF/ID. Register. Rd Rd Forward. B M u x EX/MEM. Register. Rd Forwarding unit H. Corporaal ACA 5 MD 00 M u x MEM/WB. Register. Rd 92

Forwarding check n n Check for matching register-ids: For each source-id of operation in the EX-stage check if there is a matching pending dest-id Example: if (EX/MEM. Reg. Write) (EX/MEM. Register. Rd 0) (EX/MEM. Register. Rd = ID/EX. Register. Rs) then Forward. A = 10 Q. How many comparators do we need? H. Corporaal ACA 5 MD 00 93

Can't always forward n Load word can still cause a hazard: n an instruction tries to read register r following a load to the same r Need a hazard detection unit to “stall” the load instruction u Time (in clock cycles) Program CC 1 execution order (in instructions) lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 add $9, $4, $2 slt $1, $6, $7 H. Corporaal ACA 5 MD 00 IM CC 2 CC 3 Reg IM CC 4 CC 5 DM Reg IM CC 6 CC 8 CC 9 Reg DM Reg IM CC 7 Reg DM Reg 94

Stalling We can stall the pipeline by keeping an instruction in the same stage Time (in clock cycles) Program execution CC 1 CC 2 order (in instructions) lw$2, 20($1) and $4, $2, $5 or $8, $2, $6 IM CC 3 Reg IM CC 4 CC 5 DM Reg Reg IM IM CC 6 CC 7 DM Reg DM CC 8 CC 9 CC 10 Reg bubble add $9, $4, $2 In CC 4 the ALU is not used, slt $1, $6, $7 Reg, and IM are redone H. Corporaal ACA 5 MD 00 IM DM Reg IM Reg DM Reg 95

Hazard Detection Unit ID/EX. Mem. Read Hazard detection unit ID/EX IF/IDWrite WB Control 0 M u x PC Instruction memory Instruction PCWrite IF/ID EX/MEM M WB EX M MEM/WB WB M u x Registers ALU Data memory M u x IF/ID. Register. Rs IF/ID. Register. Rt H. Corporaal ACA 5 MD 00 IF/ID. Register. Rt Rt IF/ID. Register. Rd Rd ID/EX. Register. Rt Rs Rt M u x EX/MEM. Register. Rd Forwarding unit MEM/WB. Register. Rd 96

Software only solution? n n Have compiler guarantee that no hazards occur Example: where do we insert the “NOPs” ? sub and or add sw n $2, $13, $14, $13, $1, $3 $2, $5 $6, $2 $2, $2 100($2) sub nop and or Problem: this really slows us down!add nop sw H. Corporaal ACA 5 MD 00 $2, $1, $3 $12, $5 $13, $6, $2 $14, $2 $13, 100($2) 97

Control hazards n Control operations may change the sequential flow of instructions u u u branch jump call (jump and link) return (exception/interrupt and rti / return from interrupt) H. Corporaal ACA 5 MD 00 98

Control hazard: Branch actions: n Compute new address n Determine condition n Perform the actual branch (if taken): PC : = new address H. Corporaal ACA 5 MD 00 99

Branch example Progra m Time (in clock cycle s) execu tion order CC 1 CC 2 IM Reg CC 3 CC 4 CC 5 DM R eg CC 6 CC 7 CC 8 CC 9 (in instructions) 40 beq $1, $3, 7 44 an d $1 2, $2 , $ 5 48 or $13, $6 , $2 52 ad d $1 4, $2 , $ 2 72 lw $4 , 50($ 7) H. Corporaal ACA 5 MD 00 IM R eg IM DM R eg IM R eg DM R eg Reg DM R eg 100

Branching Squash pipeline: n When we decide to branch, other instructions are in the pipeline! n We are predicting “branch not taken” u need to add hardware for flushing instructions if we are wrong H. Corporaal ACA 5 MD 00 101

Branch with predict not taken Clock cycles Branch L IF Predict not taken L: H. Corporaal ACA 5 MD 00 ID EX MEM WB IF ID EX MEM WB 102

Branch speedup n Earlier address computation Earlier condition calculation n Put both in the ID pipeline stage n u u adder comparator Clock cycles Branch L L: H. Corporaal ACA 5 MD 00 Predict not taken IF ID EX MEM WB 103

Improved branching / flushing IF/ID IF. Flush Hazard detection unit ID/EX M u x WB Control 0 M u x IF/ID 4 M WB EX M MEM/WB WB Shift left 2 Registers PC EX/MEM = M u x Instruction memory ALU M u x Data memory M u x Sign extend M u x Forwarding unit H. Corporaal ACA 5 MD 00 104

Exception support Types of exceptions: n Overflow n I/O device request n Operating system call n Undefined instruction n Hardware malfunction n Page fault n Precise exception: u u finish previous instructions (which are still in the pipeline) flush excepting and following instructions, redo them after handling the exception(s) H. Corporaal ACA 5 MD 00 105

Exceptions Changes needed for handling overflow exception of an operation in EX stage (see book for details) : n n Extend PC input mux with extra entry with fixed address Add EPC register recording the ID/EX stage PC u n this is the address of the next instruction ! Cause register recording exception type E. g. , in case of overflow exception insert 3 bubbles; flush the following stages: n IF/ID stage n ID/EX stage n EX/MEM stage H. Corporaal ACA 5 MD 00 106

Scheduling, why? Let’s look at the execution time: Texecution = Ncycles x Tcycle = Ninstructions x CPI x Tcycle Scheduling may reduce Texecution u Reduce CPI (cycles per instruction) F early scheduling of long latency operations F avoid pipeline stalls due to structural, data and control hazards allow Nissue > 1 and therefore CPI < 1 Reduce Ninstructions F compact many operations into each instruction (VLIW) F u H. Corporaal ACA 5 MD 00 107

Scheduling data hazards: example 1 Try and avoid Ra. W stalls (in this case load interlocks)! E. g. , reorder these instructions: lw lw sw sw $t 0, $t 2, $t 0, H. Corporaal ACA 5 MD 00 0($t 1) 4($t 1) lw lw sw sw $t 0, $t 2, 0($t 1) 4($t 1) 0($t 1) 108

Scheduling data hazards example 2 Avoiding Ra. W stalls: Reordering instructions for following program (by you or the compiler) Unscheduled code: Lw R 1, b Lw R 2, c Add R 3, R 1, R 2 interlock Sw a, R 3 Lw R 1, e Lw R 2, f Sub R 4, R 1, R 2 interlock Sw d, R 4 Code: a = b + c d = e - f H. Corporaal ACA 5 MD 00 Scheduled code: Lw R 1, b Lw R 2, c Lw R 5, e extra reg. needed! Add R 3, R 1, R 2 Lw R 2, f Sw a, R 3 Sub R 4, R 5, R 2 Sw d, R 4 109

Scheduling control hazards Texecution = Ninstructions x CPI x Tcycle CPI = CPIideal + fbranch x Pbranch = Ndelayslots x miss_rate n Modern processors tend to have large branch penalty, Pbranch, due to: u u n many pipeline stages multi-issue Note that penalties have larger effect when CPIideal is low H. Corporaal ACA 5 MD 00 110

Scheduling control hazards What can we do about control hazards and CPI penalty? n Keep penalty Pbranch low: u u u n n n Early computation of new PC Early determination of condition Visible branch delay slots filled by compiler (MIPS) Branch prediction Reduce control dependencies (control height reduction) [Schlansker and Kathail, Micro’ 95] Remove branches: if-conversion u u Conditional instructions: CMOVE, cond skip next Guarding all instructions: Tri. Media H. Corporaal ACA 5 MD 00 111

Branch delay slot n Add a branch delay slot: u u n the next instruction after a branch is always executed rely on compiler to “fill” the slot with something useful Is this a good idea? u let's look how it works H. Corporaal ACA 5 MD 00 112

Branch delay slot scheduling Q. What to put in the delay slot? op 1 beq r 1, r 2, L. . . 'fall-through' op 2. . . branch target L: op 3. . . H. Corporaal ACA 5 MD 00 113

Summary n n n Modern processors are (deeply) pipelined, to reduce Tcycle and aim at CPI = 1 Hazards increase CPI Several software and hardware measure to avoid or reduce hazards are taken Not discussed, but important developments: n Multi-issue further reduces CPI n Branch prediction to avoid high branch penalties n Dynamic scheduling n In all cases: a scheduling compiler needed H. Corporaal ACA 5 MD 00 114

Recap of MIPS n n n RISC architecture Register space Addressing Instruction format Pipelining H. Corporaal ACA 5 MD 00 115

Why RISC? Keep it simple RISC characteristics: n Reduced number of instructions n Limited addressing modes u u n Large register set u n n uniform (no distinction between e. g. address and data registers) Limited number of instruction sizes (preferably one) u n load-store architecture enables pipelining know directly where the following instruction starts Limited number of instruction formats Memory alignment restrictions. . . Based on quantitative analysis u " the famous MIPS one percent rule": don't even think about it when its not used more than one percent H. Corporaal ACA 5 MD 00 116

Register space 32 integer (and 32 floating point) registers of 32 -bit H. Corporaal ACA 5 MD 00 117

Addressing H. Corporaal ACA 5 MD 00 118

Instruction format R op rs rt rd I op rs rt 16 bit address J op Example instructions Instruction add $s 1, $s 2, $s 3 addi $s 2, $s 3, 4 lw $s 1, 100($s 2) bne $s 4, $s 5, L j Label H. Corporaal ACA 5 MD 00 shamt funct 26 bit address Meaning $s 1 = $s 2 + $s 3 $s 2 = $s 3 + 4 $s 1 = Memory[$s 2+100] if $s 4<>$s 5 goto Label 119

Pipelining All integer instructions fit into the following pipeline Instruction stream time IF H. Corporaal ACA 5 MD 00 ID EX MEM WB IF ID EX MEM WB 120

Other architecture styles n Accumulator architecture u n Stack u n u three operands, all in registers loads and stores are the only instructions accessing memory (i. e. with a memory (indirect) addressing mode Register-Memory u n zero operand: all operands implicit (on TOS) Register (load store) u n one operand (in register or memory), accumulator almost always implicitly used two operands, one in memory Memory-Memory u three operands, may be all in memory (there are more varieties / combinations) H. Corporaal ACA 5 MD 00 121

Accumulator architecture Accumulator latch ALU registers address Memory latch Example code: a = b+c; load b; add c; store a; H. Corporaal ACA 5 MD 00 // accumulator is implicit operand 122

Stack architecture latch top of stack ALU latch Example code: a = b+c; push b push c; b add; stack: pop a; H. Corporaal ACA 5 MD 00 Memory stack pt push c add c b b+c pop a 123

Other architecture styles Let's look at the code for C = A + B Stack Architecture Accumulator Architecture Register. Memory Register (load-store) Push A Load r 1, A Add C, B, A Load r 1, A Push B Add Add Store C Pop C B r 1, B Store C, r 1 Load r 2, B Add r 3, r 1, r 2 Store C, r 3 Q: What are the advantages / disadvantages of load-store (RISC) architecture? H. Corporaal ACA 5 MD 00 124