CPE 631 Review Pipelining Electrical and Computer Engineering

  • Slides: 59
Download presentation
CPE 631 Review: Pipelining Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar

CPE 631 Review: Pipelining Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic, milenka@ece. uah. edu http: //www. ece. uah. edu/~milenka UAH-CPE 631

Outline n n n Pipelined Execution 5 Steps in MIPS Datapath Pipeline Hazards n

Outline n n n Pipelined Execution 5 Steps in MIPS Datapath Pipeline Hazards n n n Structural Data Control AM La. CASA 2

Laundry Example (by David Patterson) n Four loads of clothes: A, B, C, D

Laundry Example (by David Patterson) n Four loads of clothes: A, B, C, D A n n B C D Task: each one to wash, dry, and fold Resources n Washer takes 30 minutes n Dryer takes 40 minutes n “Folder” takes 20 minutes AM La. CASA 3

Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time T a s

Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time T a s k AM La. CASA O r d e r n n 30 40 20 A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? 4

Pipelined Laundry n Pipelined laundry takes 3. 5 hours for 4 loads 6 PM

Pipelined Laundry n Pipelined laundry takes 3. 5 hours for 4 loads 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e AM r La. CASA 30 40 40 20 A B C D 5

Pipelining Lessons n 6 PM 7 8 9 Time T a s k O

Pipelining Lessons n 6 PM 7 8 9 Time T a s k O r d e r AM La. CASA n 30 40 40 20 A n n B n C n D Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” reduce speedup 6

Computer Pipelines n n Execute billions of instructions, so throughput is what matters What

Computer Pipelines n n Execute billions of instructions, so throughput is what matters What is desirable in instruction sets for pipelining? n n AM La. CASA n Variable length instructions vs. all instructions same length? Memory operands part of any operation vs. memory operands only in loads or stores? Register operand many places in instruction format vs. registers located in same place? 7

A "Typical" RISC n n n 32 -bit fixed format instruction (3 formats) Memory

A "Typical" RISC n n n 32 -bit fixed format instruction (3 formats) Memory access only via load/store instructions 32 32 -bit GPR (R 0 contains zero) 3 -address, reg-reg arithmetic instruction; registers in same place Single address mode for load/store: base + displacement n AM La. CASA n n no indirection Simple branch conditions Delayed branch see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM Power. PC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 8

Example: MIPS Register-Register 31 Op 26 25 2120 16 15 Rs 1 Rs 2

Example: MIPS Register-Register 31 Op 26 25 2120 16 15 Rs 1 Rs 2 Rd Register-Immediate 31 26 25 2120 16 15 Op Rs 1 Rd Branch 31 Op 26 25 2120 16 15 Rs 1 Rs 2/Opx 1110 65 0 Opx immediate 0 0 Jump / Call AM La. CASA 31 26 25 Op target 0 9

5 Steps of MIPS Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc

5 Steps of MIPS Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Next SEQ PC Adder 4 L M D MUX Data Memory ALU MUX La. CASA Reg File Inst Memory Address RD Imm AM Zero? RS 1 RS 2 Write Back MUX Next PC Memory Access Sign Extend WB Data 10

5 Steps of MIPS Datapath (cont’d) Next SEQ PC Sign Extend RD RD RD

5 Steps of MIPS Datapath (cont’d) Next SEQ PC Sign Extend RD RD RD MUX MEM/WB Data Memory EX/MEM ALU MUX ID/EX Reg File IF/ID Memory Address RS 2 Write Back Zero? RS 1 Imm La. CASA Next SEQ PC Adder 4 Memory Access MUX Next PC AM Execute Addr. Calc Instr. Decode Reg. Fetch WB Data Instruction Fetch • Data stationary control – local decode for each instruction phase / pipeline stage 11

Visualizing Pipeline Time (clock cycles) La. CASA IM Reg IM CC 5 DM Reg

Visualizing Pipeline Time (clock cycles) La. CASA IM Reg IM CC 5 DM Reg IM DM Reg CC 6 CC 7 Reg DM ALU AM Reg CC 4 ALU O r d e r IM CC 3 ALU I n s t r. CC 2 ALU CC 1 Reg DM Reg 12

Instruction Flow through Pipeline Time (clock cycles) CC 1 Reg Sub R 6, R

Instruction Flow through Pipeline Time (clock cycles) CC 1 Reg Sub R 6, R 5, R 7 ALU Add R 1, R 2, R 3 ALU Lw R 4, 0(R 2) Nop DM DM Nop IM Nop Add R 1, R 2, R 3 Reg Reg La. CASA Lw R 4, 0(R 2) ALU Xor R 9, R 8, R 1 Reg Reg Add R 1, R 2, R 3 CC 4 IM IM IM Nop AM Sub R 6, R 5, R 7 Lw R 4, 0(R 2) Add R 1, R 2, R 3 CC 2 13

DLX Pipeline Definition: IF, ID n Stage IF n n n Stage ID n

DLX Pipeline Definition: IF, ID n Stage IF n n n Stage ID n AM La. CASA IF/ID. IR Mem[PC]; if EX/MEM. cond {IF/ID. NPC, PC EX/MEM. ALUOUT} else {IF/ID. NPC, PC + 4}; n n ID/EX. A Regs[IF/ID. IR 6… 10]; ID/EX. B Regs[IF/ID. IR 11… 15]; ID/EX. Imm (IF/ID. IR 16)16 ## IF/ID. IR 16… 31; ID/EX. NPC IF/ID. NPC; ID/EX. IR IF/ID. IR; 14

DLX Pipeline Definition: IE n ALU n n load/store n n n AM n

DLX Pipeline Definition: IE n ALU n n load/store n n n AM n La. CASA EX/MEM. IR ID/EX. IR; EX/MEM. ALUOUT ID/EX. A func ID/EX. B; or EX/MEM. ALUOUT ID/EX. A func ID/EX. Imm; EX/MEM. cond 0; EX/MEM. IR ID/EX. IR; EX/MEM. B ID/EX. B; EX/MEM. ALUOUT ID/EX. A ID/EX. Imm; EX/MEM. cond 0; branch n n EX/MEM. Aluout ID/EX. NPC (ID/EX. Imm<< 2); EX/MEM. cond (ID/EX. A func 0); 15

DLX Pipeline Definition: MEM, WB n Stage MEM n ALU n n n load/store

DLX Pipeline Definition: MEM, WB n Stage MEM n ALU n n n load/store n n n MEM/WB. IR EX/MEM. IR; MEM/WB. LMD Mem[EX/MEM. ALUOUT] or Mem[EX/MEM. ALUOUT] EX/MEM. B; Stage WB n ALU n AM La. CASA MEM/WB. IR EX/MEM. IR; MEM/WB. ALUOUT EX/MEM. ALUOUT; n Regs[MEM/WB. IR 16… 20] MEM/WB. ALUOUT; or Regs[MEM/WB. IR 11… 15] MEM/WB. ALUOUT; load n Regs[MEM/WB. IR 11… 15] MEM/WB. LMD; 16

Its Not That Easy for Computers n Limits to pipelining: Hazards prevent next instruction

Its Not That Easy for Computers n Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle n n n AM La. CASA Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps) 17

One Memory Port/Structural Hazards Time (clock cycles) La. CASA Instr 2 Instr 3 Instr

One Memory Port/Structural Hazards Time (clock cycles) La. CASA Instr 2 Instr 3 Instr 4 Ifetch DMem Reg ALU Instr 1 Reg ALU Ifetch ALU O r d e AM r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Reg DMem Reg 18

One Memory Port/Structural Hazards (cont’d) Time (clock cycles) La. CASA Instr 1 Instr 2

One Memory Port/Structural Hazards (cont’d) Time (clock cycles) La. CASA Instr 1 Instr 2 Stall Instr 3 Reg Ifetch DMem Reg ALU Ifetch Bubble Reg DMem Bubble Ifetch Reg Bubble ALU O r d e AM r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Bubble DMem Reg 19

Data Hazard on R 1 Time (clock cycles) La. CASA and r 6, r

Data Hazard on R 1 Time (clock cycles) La. CASA and r 6, r 1, r 7 or r 8, r 1, r 9 xor r 10, r 11 Ifetch Reg DMem Ifetch Reg ALU sub r 4, r 1, r 3 Reg ALU AM Ifetch ALU O r d e r add r 1, r 2, r 3 WB ALU I n s t r. MEM ALU IF ID/RF EX Reg Reg DMem Reg 20

Three Generic Data Hazards n Read After Write (RAW) Instr. J tries to read

Three Generic Data Hazards n Read After Write (RAW) Instr. J tries to read operand before Instr. I writes it I: add r 1, r 2, r 3 J: sub r 4, r 1, r 3 n AM La. CASA Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. 21

Three Generic Data Hazards n Write After Read (WAR) Instr. J writes operand before

Three Generic Data Hazards n Write After Read (WAR) Instr. J writes operand before Instr. I reads it I: sub r 4, r 1, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 n n AM La. CASA Called an “anti-dependence” by compiler writers. This results from reuse of the name “r 1”. Can’t happen in MIPS 5 stage pipeline because: n n n All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 22

Three Generic Data Hazards n Write After Write (WAW) Instr. J writes operand before

Three Generic Data Hazards n Write After Write (WAW) Instr. J writes operand before Instr. I writes it. I: sub r 1, r 4, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 n n n AM La. CASA Called an “output dependence” by compiler writers This also results from the reuse of name “r 1”. Can’t happen in MIPS 5 stage pipeline because: n n All instructions take 5 stages, and Writes are always in stage 5 23

Forwarding to Avoid Data Hazard La. CASA or r 8, r 1, r 9

Forwarding to Avoid Data Hazard La. CASA or r 8, r 1, r 9 xor r 10, r 11 Reg DMem Ifetch Reg ALU and r 6, r 1, r 7 Ifetch DMem ALU AM sub r 4, r 1, r 3 Reg ALU O r d e r add r 1, r 2, r 3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Reg DMem Reg 24

HW Change for Forwarding Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers mux

HW Change for Forwarding Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers mux Immediate Data Memory AM La. CASA 25

Forwarding to DM input - Forward R 1 from EX/MEM. ALUOUT to ALU input

Forwarding to DM input - Forward R 1 from EX/MEM. ALUOUT to ALU input (lw) - Forward R 1 from MEM/WB. ALUOUT to ALU input (sw) - Forward R 4 from MEM/WB. LMD to memory input (memory output to memory input) Time (clock cycles) O lw r d sw e AM r La. CASA R 4, 0(R 1) 12(R 1), R 4 IM Reg IM CC 3 Reg IM CC 4 CC 5 DM Reg DM ALU add R 1, R 2, R 3 CC 2 ALU CC 1 ALU I n s t. CC 6 CC 7 Reg DM Reg 26

Forwarding to DM input (cont’d) Forward R 1 from MEM/WB. ALUOUT to DM input

Forwarding to DM input (cont’d) Forward R 1 from MEM/WB. ALUOUT to DM input CC 1 add R 1, R 2, R 3 sw 0(R 4), R 1 IM CC 2 Reg IM CC 3 Reg CC 4 CC 5 DM Reg ALU O r d e r Time (clock cycles) ALU I n s t. DM CC 6 Reg AM La. CASA 27

Forwarding to Zero Forward R 1 from EX/MEM. ALUOUT to Zero add R 1,

Forwarding to Zero Forward R 1 from EX/MEM. ALUOUT to Zero add R 1, R 2, R 3 Reg CC 3 CC 4 CC 5 DM Reg CC 6 Z R 1, 50 IM Reg DM Reg Forward R 1 from MEM/WB. ALUOUT to Zero Reg IM Reg DM Reg Z IM Reg ALU IM ALU add R 1, R 2, R 3 O sub R 4, R 5, R 6 r AM d bneq R 1, 50 e r La. CASA IM CC 2 ALU beqz CC 1 ALU Time (clock cycles) ALU I n s t r u c t i o n DM Reg 28

Data Hazard Even with Forwarding La. CASA and r 6, r 1, r 7

Data Hazard Even with Forwarding La. CASA and r 6, r 1, r 7 or r 8, r 1, r 9 DMem Ifetch Reg DMem Reg Ifetch Reg Reg DMem ALU sub r 4, r 1, r 6 Reg ALU O r d e r AM lw r 1, 0(r 2) Ifetch ALU I n s t r. ALU Time (clock cycles) Reg DMem Reg 29

Data Hazard Even with Forwarding La. CASA and r 6, r 1, r 7

Data Hazard Even with Forwarding La. CASA and r 6, r 1, r 7 or r 8, r 1, r 9 Reg DMem Ifetch Reg Bubble Ifetch Bubble Reg Bubble Ifetch Reg DMem ALU AM sub r 4, r 1, r 6 Ifetch ALU O r d e r lw r 1, 0(r 2) ALU I n s t r. ALU Time (clock cycles) Reg DMem 30

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d , e, and f in memory. AM La. CASA Slow code: LW LW ADD SW LW LW SUB SW Rb, b Rc, c Ra, Rb, Rc a, Ra Re, e Rf, f Rd, Re, Rf d, Rd Fast code: LW LW LW ADD LW SW SUB SW Rb, b Rc, c Re, e Ra, Rb, Rc Rf, f a, Ra Rd, Re, Rf d, Rd 31

22: add r 8, r 1, r 9 AM 36: xor r 10, r

22: add r 8, r 1, r 9 AM 36: xor r 10, r 11 La. CASA Reg DMem Ifetch Reg ALU r 6, r 1, r 7 Ifetch DMem ALU 18: or Reg ALU 14: and r 2, r 3, r 5 Ifetch ALU 10: beq r 1, r 3, 36 ALU Control Hazard on Branches Three Stage Stall Reg Reg DMem Reg 32

Example: Branch Stall Impact n n If 30% branch, Stall 3 cycles significant Two

Example: Branch Stall Impact n n If 30% branch, Stall 3 cycles significant Two part solution: n n MIPS branch tests if register = 0 or 0 MIPS Solution: n AM La. CASA Determine branch taken or not sooner, AND Compute taken branch address earlier n n Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3 33

Pipelined MIPS Datapath Instruction Fetch Write Back Adder Zero? RS 1 RD RD WB

Pipelined MIPS Datapath Instruction Fetch Write Back Adder Zero? RS 1 RD RD WB Data RD MUX Sign Extend MEM/WB Data Memory EX/MEM ALU MUX ID/EX Reg File IF/ID Memory Address RS 2 Imm AM Memory Access MUX Next SEQ PC Next PC 4 Execute Addr. Calc Instr. Decode Reg. Fetch • Data stationary control La. CASA – local decode for each instruction phase / pipeline stage 34

Four Branch Hazard Alternatives n n #1: Stall until branch direction is clear #2:

Four Branch Hazard Alternatives n n #1: Stall until branch direction is clear #2: Predict Branch Not Taken n n AM La. CASA n Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction 35

Branch not Taken 5 branch IF (not taken) Ii+1 ID Ex Mem WB IF

Branch not Taken 5 branch IF (not taken) Ii+1 ID Ex Mem WB IF ID Ii+2 branch (taken) Ii+1 branch target AM branch target+1 La. CASA Ex Mem WB 5 IF ID IF Time [clocks] Branch is untaken (determined during ID), we have fetched the fallthrough and just continue no wasted cycles Ex Mem WB Branch is taken (determined during ID), idle restart the fetch from at the branch target IF ID Ex Mem WB one cycle wasted IF ID Ex Mem WB Instructions 36

Four Branch Hazard Alternatives n #3: Predict Branch Taken n Treat every branch as

Four Branch Hazard Alternatives n #3: Predict Branch Taken n Treat every branch as taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS n n MIPS still incurs 1 cycle branch penalty Make sense only when branch target is known before branch outcome AM La. CASA 37

Four Branch Hazard Alternatives n #4: Delayed Branch n Define branch to take place

Four Branch Hazard Alternatives n #4: Delayed Branch n Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2. . . . sequential successorn branch target if taken AM La. CASA n n Branch delay of length n 1 slot delay allows proper decision and branch target address in 5 stage pipeline MIPS uses this 38

Delayed Branch n Where to get instructions to fill branch delay slot? n n

Delayed Branch n Where to get instructions to fill branch delay slot? n n n Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken AM La. CASA 39

Scheduling the branch delay slot: From Before ADD R 1, R 2, R 3

Scheduling the branch delay slot: From Before ADD R 1, R 2, R 3 if(R 2=0) then <Delay Slot> Becomes if(R 2=0) then n n Delay slot is scheduled with an independent instruction from before the branch Best choice, always improves performance <ADD R 1, R 2, R 3> AM La. CASA 40

Scheduling the branch delay slot: From Target SUB R 4, R 5, R 6.

Scheduling the branch delay slot: From Target SUB R 4, R 5, R 6. . . ADD R 1, R 2, R 3 if(R 1=0) then <Delay Slot> Becomes AM La. CASA . . . ADD R 1, R 2, R 3 if(R 2=0) then <SUB R 4, R 5, R 6> n n Delay slot is scheduled from the target of the branch Must be OK to execute that instruction if branch is not taken Usually the target instruction will need to be copied because it can be reached by another path programs are enlarged Preferred when the branch is taken with high probability 41

Scheduling the branch delay slot: From Fall Through ADD R 1, R 2, R

Scheduling the branch delay slot: From Fall Through ADD R 1, R 2, R 3 if(R 2=0) then <Delay Slot> SUB R 4, R 5, R 6 n n Becomes n ADD R 1, R 2, R 3 if(R 2=0) then <SUB Delay slot is scheduled from the taken fall through Must be OK to execute that instruction if branch is taken Improves performance when branch is not taken R 4, R 5, R 6> AM La. CASA 42

Delayed Branch Effectiveness n Compiler effectiveness for single branch delay slot: n n AM

Delayed Branch Effectiveness n Compiler effectiveness for single branch delay slot: n n AM La. CASA Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: 7 -8 stage pipelines, multiple instructions issued per clock (superscalar) 43

Example: Branch Stall Impact n n n Assume CPI = 1. 0 ignoring branches

Example: Branch Stall Impact n n n Assume CPI = 1. 0 ignoring branches Assume solution was stalling for 3 cycles If 30% branch, Stall 3 cycles n Op Freq Cycles CPI(i) Other 70% 1. 7 (37%) Branch 30% 4 1. 2 (63%) n => new CPI = 1. 9, or almost 2 times slower n n (% Time) AM La. CASA 44

Example 2: Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1:

Example 2: Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: AM La. CASA 45

Example 3: Evaluating Branch Alternatives (for 1 program) n Scheduling Branch CPI speedup v.

Example 3: Evaluating Branch Alternatives (for 1 program) n Scheduling Branch CPI speedup v. scheme penalty stall Stall pipeline 3 1. 42 Predict taken 1 1. 14 Predict not taken 1 1. 09 Delayed branch 0. 5 1. 07 n Conditional & Unconditional = 14%, 65% n n 1. 0 1. 26 1. 29 1. 31 change PC AM La. CASA 46

Example 4: Dual-port vs. Single-port n n Machine A: Dual ported memory (“Harvard Architecture”)

Example 4: Dual-port vs. Single-port n n Machine A: Dual ported memory (“Harvard Architecture”) Machine B: Single ported memory, but its pipelined implementation has a 1. 05 times faster clock rate Ideal CPI = 1 for both Loads&Stores are 40% of instructions executed AM La. CASA 47

Extended MIPS Pipeline DLX pipe with three unpipelined, FP functional units IF ID EX

Extended MIPS Pipeline DLX pipe with three unpipelined, FP functional units IF ID EX Int EX FP/I Mult Mem EX FP Add AM La. CASA EX FP/I Div WB In reality, the intermediate results are probably not cycled around the EX unit; instead the EX stages has some number of clock delays larger than 1 48

Extended MIPS Pipeline (cont’d) n n Initiation or repeat interval: number of clock cycles

Extended MIPS Pipeline (cont’d) n n Initiation or repeat interval: number of clock cycles that must elapse between issuing two operations Latency: the number of intervening clock cycles between an instruction that produces a result and an instruction that uses the result Functional unit AM La. CASA Latency Initiation interval Integer ALU 0 1 Data Memory 1 1 FP Add 3 1 FP/Integer Multiply 6 1 FP/Integer Divide 24 25 49

Extended MIPS Pipeline (cont’d) Ex M 1 IF M 2 M 3 M 4

Extended MIPS Pipeline (cont’d) Ex M 1 IF M 2 M 3 M 4 M 5 M 6 M 7 ID M A 1 A 2 A 3 WB A 4 . . AM La. CASA 50

Extended MIPS Pipeline (cont’d) n Multiple outstanding FP operations n n n FP/I Adder

Extended MIPS Pipeline (cont’d) n Multiple outstanding FP operations n n n FP/I Adder and Multiplier are fully pipelined FP/I Divider is not pipelined Pipeline timing for independent operations MUL. D ADD. D L. D S. D IF ID M 1 M 2 M 3 M 4 M 5 IF ID A 1 A 2 A 3 A 4 IF ID Ex IF ID M 6 M 7 Mem WB Ex Mem WB AM La. CASA 51

Hazards and Forwarding in Longer Pipes n Structural hazard: divide unit is not fully

Hazards and Forwarding in Longer Pipes n Structural hazard: divide unit is not fully pipelined n n La. CASA Structural hazard: number of register writes can be larger than one due to varying running times WAW hazards are possible Exceptions! n AM n detect it and stall the instructions can complete in different order than they were issued RAW hazards will be more frequent 52

Examples Stalls arising from RAW hazards n L. D F 4, 0(R 2) IF

Examples Stalls arising from RAW hazards n L. D F 4, 0(R 2) IF MUL. D F 0, F 4, F 6 ADD. D F 2, F 0, F 8 S. D 0(R 2), F 2 n ID EX Mem WB IF ID stall M 1 IF stall ID stall stall A 1 A 2 IF stall stall ID EX . . . La. CASA M 3 M 4 M 5 M 6 M 7 Mem WB A 3 A 4 Mem WB stall Mem Three instructions that want to perform a write back to the FP register file simultaneously MUL. D F 0, F 4, F 6. . . AM M 2 ADD. D F 2, F 4, F 6. . . L. D F 2, 0(R 2) IF ID M 1 M 2 M 3 M 4 M 5 IF ID EX Mem WB IF ID A 1 A 2 IF ID IF EX ID IF M 6 M 7 Mem WB A 3 A 4 Mem WB EX Mem WB ID EX Mem WB 53

Solving Register Write Conflicts n First approach: track the use of the write port

Solving Register Write Conflicts n First approach: track the use of the write port in the ID stage and stall an instruction before it issues n n Alternative approach: stall a conflicting instruction when it tries to enter MEM or WB stage n n n AM La. CASA use a shift register that indicates when already issued instructions will use the register file if there is a conflict with an already issued instruction, stall the instruction for one clock cycle on each clock cycle the reservation register is shifted one bit n we can stall either instruction e. g. give priority to the unit with the longest latency Pros: does not require to detect the conflict until the entrance of MEM or WB stage Cons: complicates pipeline control; stalls now can arise from two different places 54

WAW Hazards IF ADD. D F 2, F 4, F 6 ID EX Mem

WAW Hazards IF ADD. D F 2, F 4, F 6 ID EX Mem WB IF ID A 1 A 2 A 3 A 4 IF ID EX Mem WB IF ID EX Mem L. D F 2, 0(R 2) n n La. CASA WB WB Result of ADD. D is overwritten without any instruction ever using it n AM Mem WAWs occur when useless instruction is executed still, we must detect them and provide correct execution Why? BNEZ DIV. D. . . foo: L. D R 1, foo F 0, F 2, F 4 ; delay slot from fall-through F 0, qrs 55

Solving WAW Hazards n n n First approach: delay the issue of load instruction

Solving WAW Hazards n n n First approach: delay the issue of load instruction until ADD. D enters MEM Second approach: stamp out the result of the ADD. D by detecting the hazard and changing the control so that ADDD does not write; LD issues right away Detect hazard in ID when LD is issuing n AM La. CASA n n stall LD, or make ADDD no-op Luckily this hazard is rare 56

Hazard Detection in ID Stage n Possible hazards n n hazards among FP instructions

Hazard Detection in ID Stage n Possible hazards n n hazards among FP instructions hazards between an FP instruction and an integer instr. n n FP and integer registers are distinct, except for FP load-stores, and FP-integer moves Assume that pipeline does all hazard detection in ID stage AM La. CASA 57

Hazard Detection in ID Stage (cont’d) n Check for structural hazards n n Check

Hazard Detection in ID Stage (cont’d) n Check for structural hazards n n Check for RAW data hazards n n La. CASA wait until source registers are not listed as pending destinations in a pipeline register that will not be available when this instruction needs the result Check for WAW data hazards n AM wait until the required functional unit is not busy and make sure that the register write port is available determine if any instruction in A 1, . . A 4, M 1, . . M 7, D has the same register destination as this instruction; if so, stall the issue of the instruction in ID 58

Forwarding Logic n n Check if the destination register in any of EX/MEM, A

Forwarding Logic n n Check if the destination register in any of EX/MEM, A 4/MEM, M 7/MEM, D/MEM, or MEM/WB pipeline registers is one of the source registers of a FP instruction If so, the appropriate input multiplexer will have to be enabled so as to choose the forwarded data AM La. CASA 59