CPE 631 Review Pipelining Electrical and Computer Engineering
- Slides: 59
CPE 631 Review: Pipelining Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic, milenka@ece. uah. edu http: //www. ece. uah. edu/~milenka UAH-CPE 631
Outline n n n Pipelined Execution 5 Steps in MIPS Datapath Pipeline Hazards n n n Structural Data Control AM La. CASA 2
Laundry Example (by David Patterson) n Four loads of clothes: A, B, C, D A n n B C D Task: each one to wash, dry, and fold Resources n Washer takes 30 minutes n Dryer takes 40 minutes n “Folder” takes 20 minutes AM La. CASA 3
Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time T a s k AM La. CASA O r d e r n n 30 40 20 A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? 4
Pipelined Laundry n Pipelined laundry takes 3. 5 hours for 4 loads 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e AM r La. CASA 30 40 40 20 A B C D 5
Pipelining Lessons n 6 PM 7 8 9 Time T a s k O r d e r AM La. CASA n 30 40 40 20 A n n B n C n D Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” reduce speedup 6
Computer Pipelines n n Execute billions of instructions, so throughput is what matters What is desirable in instruction sets for pipelining? n n AM La. CASA n Variable length instructions vs. all instructions same length? Memory operands part of any operation vs. memory operands only in loads or stores? Register operand many places in instruction format vs. registers located in same place? 7
A "Typical" RISC n n n 32 -bit fixed format instruction (3 formats) Memory access only via load/store instructions 32 32 -bit GPR (R 0 contains zero) 3 -address, reg-reg arithmetic instruction; registers in same place Single address mode for load/store: base + displacement n AM La. CASA n n no indirection Simple branch conditions Delayed branch see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM Power. PC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 8
Example: MIPS Register-Register 31 Op 26 25 2120 16 15 Rs 1 Rs 2 Rd Register-Immediate 31 26 25 2120 16 15 Op Rs 1 Rd Branch 31 Op 26 25 2120 16 15 Rs 1 Rs 2/Opx 1110 65 0 Opx immediate 0 0 Jump / Call AM La. CASA 31 26 25 Op target 0 9
5 Steps of MIPS Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Next SEQ PC Adder 4 L M D MUX Data Memory ALU MUX La. CASA Reg File Inst Memory Address RD Imm AM Zero? RS 1 RS 2 Write Back MUX Next PC Memory Access Sign Extend WB Data 10
5 Steps of MIPS Datapath (cont’d) Next SEQ PC Sign Extend RD RD RD MUX MEM/WB Data Memory EX/MEM ALU MUX ID/EX Reg File IF/ID Memory Address RS 2 Write Back Zero? RS 1 Imm La. CASA Next SEQ PC Adder 4 Memory Access MUX Next PC AM Execute Addr. Calc Instr. Decode Reg. Fetch WB Data Instruction Fetch • Data stationary control – local decode for each instruction phase / pipeline stage 11
Visualizing Pipeline Time (clock cycles) La. CASA IM Reg IM CC 5 DM Reg IM DM Reg CC 6 CC 7 Reg DM ALU AM Reg CC 4 ALU O r d e r IM CC 3 ALU I n s t r. CC 2 ALU CC 1 Reg DM Reg 12
Instruction Flow through Pipeline Time (clock cycles) CC 1 Reg Sub R 6, R 5, R 7 ALU Add R 1, R 2, R 3 ALU Lw R 4, 0(R 2) Nop DM DM Nop IM Nop Add R 1, R 2, R 3 Reg Reg La. CASA Lw R 4, 0(R 2) ALU Xor R 9, R 8, R 1 Reg Reg Add R 1, R 2, R 3 CC 4 IM IM IM Nop AM Sub R 6, R 5, R 7 Lw R 4, 0(R 2) Add R 1, R 2, R 3 CC 2 13
DLX Pipeline Definition: IF, ID n Stage IF n n n Stage ID n AM La. CASA IF/ID. IR Mem[PC]; if EX/MEM. cond {IF/ID. NPC, PC EX/MEM. ALUOUT} else {IF/ID. NPC, PC + 4}; n n ID/EX. A Regs[IF/ID. IR 6… 10]; ID/EX. B Regs[IF/ID. IR 11… 15]; ID/EX. Imm (IF/ID. IR 16)16 ## IF/ID. IR 16… 31; ID/EX. NPC IF/ID. NPC; ID/EX. IR IF/ID. IR; 14
DLX Pipeline Definition: IE n ALU n n load/store n n n AM n La. CASA EX/MEM. IR ID/EX. IR; EX/MEM. ALUOUT ID/EX. A func ID/EX. B; or EX/MEM. ALUOUT ID/EX. A func ID/EX. Imm; EX/MEM. cond 0; EX/MEM. IR ID/EX. IR; EX/MEM. B ID/EX. B; EX/MEM. ALUOUT ID/EX. A ID/EX. Imm; EX/MEM. cond 0; branch n n EX/MEM. Aluout ID/EX. NPC (ID/EX. Imm<< 2); EX/MEM. cond (ID/EX. A func 0); 15
DLX Pipeline Definition: MEM, WB n Stage MEM n ALU n n n load/store n n n MEM/WB. IR EX/MEM. IR; MEM/WB. LMD Mem[EX/MEM. ALUOUT] or Mem[EX/MEM. ALUOUT] EX/MEM. B; Stage WB n ALU n AM La. CASA MEM/WB. IR EX/MEM. IR; MEM/WB. ALUOUT EX/MEM. ALUOUT; n Regs[MEM/WB. IR 16… 20] MEM/WB. ALUOUT; or Regs[MEM/WB. IR 11… 15] MEM/WB. ALUOUT; load n Regs[MEM/WB. IR 11… 15] MEM/WB. LMD; 16
Its Not That Easy for Computers n Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle n n n AM La. CASA Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps) 17
One Memory Port/Structural Hazards Time (clock cycles) La. CASA Instr 2 Instr 3 Instr 4 Ifetch DMem Reg ALU Instr 1 Reg ALU Ifetch ALU O r d e AM r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Reg DMem Reg 18
One Memory Port/Structural Hazards (cont’d) Time (clock cycles) La. CASA Instr 1 Instr 2 Stall Instr 3 Reg Ifetch DMem Reg ALU Ifetch Bubble Reg DMem Bubble Ifetch Reg Bubble ALU O r d e AM r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Bubble DMem Reg 19
Data Hazard on R 1 Time (clock cycles) La. CASA and r 6, r 1, r 7 or r 8, r 1, r 9 xor r 10, r 11 Ifetch Reg DMem Ifetch Reg ALU sub r 4, r 1, r 3 Reg ALU AM Ifetch ALU O r d e r add r 1, r 2, r 3 WB ALU I n s t r. MEM ALU IF ID/RF EX Reg Reg DMem Reg 20
Three Generic Data Hazards n Read After Write (RAW) Instr. J tries to read operand before Instr. I writes it I: add r 1, r 2, r 3 J: sub r 4, r 1, r 3 n AM La. CASA Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. 21
Three Generic Data Hazards n Write After Read (WAR) Instr. J writes operand before Instr. I reads it I: sub r 4, r 1, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 n n AM La. CASA Called an “anti-dependence” by compiler writers. This results from reuse of the name “r 1”. Can’t happen in MIPS 5 stage pipeline because: n n n All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 22
Three Generic Data Hazards n Write After Write (WAW) Instr. J writes operand before Instr. I writes it. I: sub r 1, r 4, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 n n n AM La. CASA Called an “output dependence” by compiler writers This also results from the reuse of name “r 1”. Can’t happen in MIPS 5 stage pipeline because: n n All instructions take 5 stages, and Writes are always in stage 5 23
Forwarding to Avoid Data Hazard La. CASA or r 8, r 1, r 9 xor r 10, r 11 Reg DMem Ifetch Reg ALU and r 6, r 1, r 7 Ifetch DMem ALU AM sub r 4, r 1, r 3 Reg ALU O r d e r add r 1, r 2, r 3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Reg DMem Reg 24
HW Change for Forwarding Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers mux Immediate Data Memory AM La. CASA 25
Forwarding to DM input - Forward R 1 from EX/MEM. ALUOUT to ALU input (lw) - Forward R 1 from MEM/WB. ALUOUT to ALU input (sw) - Forward R 4 from MEM/WB. LMD to memory input (memory output to memory input) Time (clock cycles) O lw r d sw e AM r La. CASA R 4, 0(R 1) 12(R 1), R 4 IM Reg IM CC 3 Reg IM CC 4 CC 5 DM Reg DM ALU add R 1, R 2, R 3 CC 2 ALU CC 1 ALU I n s t. CC 6 CC 7 Reg DM Reg 26
Forwarding to DM input (cont’d) Forward R 1 from MEM/WB. ALUOUT to DM input CC 1 add R 1, R 2, R 3 sw 0(R 4), R 1 IM CC 2 Reg IM CC 3 Reg CC 4 CC 5 DM Reg ALU O r d e r Time (clock cycles) ALU I n s t. DM CC 6 Reg AM La. CASA 27
Forwarding to Zero Forward R 1 from EX/MEM. ALUOUT to Zero add R 1, R 2, R 3 Reg CC 3 CC 4 CC 5 DM Reg CC 6 Z R 1, 50 IM Reg DM Reg Forward R 1 from MEM/WB. ALUOUT to Zero Reg IM Reg DM Reg Z IM Reg ALU IM ALU add R 1, R 2, R 3 O sub R 4, R 5, R 6 r AM d bneq R 1, 50 e r La. CASA IM CC 2 ALU beqz CC 1 ALU Time (clock cycles) ALU I n s t r u c t i o n DM Reg 28
Data Hazard Even with Forwarding La. CASA and r 6, r 1, r 7 or r 8, r 1, r 9 DMem Ifetch Reg DMem Reg Ifetch Reg Reg DMem ALU sub r 4, r 1, r 6 Reg ALU O r d e r AM lw r 1, 0(r 2) Ifetch ALU I n s t r. ALU Time (clock cycles) Reg DMem Reg 29
Data Hazard Even with Forwarding La. CASA and r 6, r 1, r 7 or r 8, r 1, r 9 Reg DMem Ifetch Reg Bubble Ifetch Bubble Reg Bubble Ifetch Reg DMem ALU AM sub r 4, r 1, r 6 Ifetch ALU O r d e r lw r 1, 0(r 2) ALU I n s t r. ALU Time (clock cycles) Reg DMem 30
Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d , e, and f in memory. AM La. CASA Slow code: LW LW ADD SW LW LW SUB SW Rb, b Rc, c Ra, Rb, Rc a, Ra Re, e Rf, f Rd, Re, Rf d, Rd Fast code: LW LW LW ADD LW SW SUB SW Rb, b Rc, c Re, e Ra, Rb, Rc Rf, f a, Ra Rd, Re, Rf d, Rd 31
22: add r 8, r 1, r 9 AM 36: xor r 10, r 11 La. CASA Reg DMem Ifetch Reg ALU r 6, r 1, r 7 Ifetch DMem ALU 18: or Reg ALU 14: and r 2, r 3, r 5 Ifetch ALU 10: beq r 1, r 3, 36 ALU Control Hazard on Branches Three Stage Stall Reg Reg DMem Reg 32
Example: Branch Stall Impact n n If 30% branch, Stall 3 cycles significant Two part solution: n n MIPS branch tests if register = 0 or 0 MIPS Solution: n AM La. CASA Determine branch taken or not sooner, AND Compute taken branch address earlier n n Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3 33
Pipelined MIPS Datapath Instruction Fetch Write Back Adder Zero? RS 1 RD RD WB Data RD MUX Sign Extend MEM/WB Data Memory EX/MEM ALU MUX ID/EX Reg File IF/ID Memory Address RS 2 Imm AM Memory Access MUX Next SEQ PC Next PC 4 Execute Addr. Calc Instr. Decode Reg. Fetch • Data stationary control La. CASA – local decode for each instruction phase / pipeline stage 34
Four Branch Hazard Alternatives n n #1: Stall until branch direction is clear #2: Predict Branch Not Taken n n AM La. CASA n Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction 35
Branch not Taken 5 branch IF (not taken) Ii+1 ID Ex Mem WB IF ID Ii+2 branch (taken) Ii+1 branch target AM branch target+1 La. CASA Ex Mem WB 5 IF ID IF Time [clocks] Branch is untaken (determined during ID), we have fetched the fallthrough and just continue no wasted cycles Ex Mem WB Branch is taken (determined during ID), idle restart the fetch from at the branch target IF ID Ex Mem WB one cycle wasted IF ID Ex Mem WB Instructions 36
Four Branch Hazard Alternatives n #3: Predict Branch Taken n Treat every branch as taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS n n MIPS still incurs 1 cycle branch penalty Make sense only when branch target is known before branch outcome AM La. CASA 37
Four Branch Hazard Alternatives n #4: Delayed Branch n Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2. . . . sequential successorn branch target if taken AM La. CASA n n Branch delay of length n 1 slot delay allows proper decision and branch target address in 5 stage pipeline MIPS uses this 38
Delayed Branch n Where to get instructions to fill branch delay slot? n n n Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken AM La. CASA 39
Scheduling the branch delay slot: From Before ADD R 1, R 2, R 3 if(R 2=0) then <Delay Slot> Becomes if(R 2=0) then n n Delay slot is scheduled with an independent instruction from before the branch Best choice, always improves performance <ADD R 1, R 2, R 3> AM La. CASA 40
Scheduling the branch delay slot: From Target SUB R 4, R 5, R 6. . . ADD R 1, R 2, R 3 if(R 1=0) then <Delay Slot> Becomes AM La. CASA . . . ADD R 1, R 2, R 3 if(R 2=0) then <SUB R 4, R 5, R 6> n n Delay slot is scheduled from the target of the branch Must be OK to execute that instruction if branch is not taken Usually the target instruction will need to be copied because it can be reached by another path programs are enlarged Preferred when the branch is taken with high probability 41
Scheduling the branch delay slot: From Fall Through ADD R 1, R 2, R 3 if(R 2=0) then <Delay Slot> SUB R 4, R 5, R 6 n n Becomes n ADD R 1, R 2, R 3 if(R 2=0) then <SUB Delay slot is scheduled from the taken fall through Must be OK to execute that instruction if branch is taken Improves performance when branch is not taken R 4, R 5, R 6> AM La. CASA 42
Delayed Branch Effectiveness n Compiler effectiveness for single branch delay slot: n n AM La. CASA Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: 7 -8 stage pipelines, multiple instructions issued per clock (superscalar) 43
Example: Branch Stall Impact n n n Assume CPI = 1. 0 ignoring branches Assume solution was stalling for 3 cycles If 30% branch, Stall 3 cycles n Op Freq Cycles CPI(i) Other 70% 1. 7 (37%) Branch 30% 4 1. 2 (63%) n => new CPI = 1. 9, or almost 2 times slower n n (% Time) AM La. CASA 44
Example 2: Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: AM La. CASA 45
Example 3: Evaluating Branch Alternatives (for 1 program) n Scheduling Branch CPI speedup v. scheme penalty stall Stall pipeline 3 1. 42 Predict taken 1 1. 14 Predict not taken 1 1. 09 Delayed branch 0. 5 1. 07 n Conditional & Unconditional = 14%, 65% n n 1. 0 1. 26 1. 29 1. 31 change PC AM La. CASA 46
Example 4: Dual-port vs. Single-port n n Machine A: Dual ported memory (“Harvard Architecture”) Machine B: Single ported memory, but its pipelined implementation has a 1. 05 times faster clock rate Ideal CPI = 1 for both Loads&Stores are 40% of instructions executed AM La. CASA 47
Extended MIPS Pipeline DLX pipe with three unpipelined, FP functional units IF ID EX Int EX FP/I Mult Mem EX FP Add AM La. CASA EX FP/I Div WB In reality, the intermediate results are probably not cycled around the EX unit; instead the EX stages has some number of clock delays larger than 1 48
Extended MIPS Pipeline (cont’d) n n Initiation or repeat interval: number of clock cycles that must elapse between issuing two operations Latency: the number of intervening clock cycles between an instruction that produces a result and an instruction that uses the result Functional unit AM La. CASA Latency Initiation interval Integer ALU 0 1 Data Memory 1 1 FP Add 3 1 FP/Integer Multiply 6 1 FP/Integer Divide 24 25 49
Extended MIPS Pipeline (cont’d) Ex M 1 IF M 2 M 3 M 4 M 5 M 6 M 7 ID M A 1 A 2 A 3 WB A 4 . . AM La. CASA 50
Extended MIPS Pipeline (cont’d) n Multiple outstanding FP operations n n n FP/I Adder and Multiplier are fully pipelined FP/I Divider is not pipelined Pipeline timing for independent operations MUL. D ADD. D L. D S. D IF ID M 1 M 2 M 3 M 4 M 5 IF ID A 1 A 2 A 3 A 4 IF ID Ex IF ID M 6 M 7 Mem WB Ex Mem WB AM La. CASA 51
Hazards and Forwarding in Longer Pipes n Structural hazard: divide unit is not fully pipelined n n La. CASA Structural hazard: number of register writes can be larger than one due to varying running times WAW hazards are possible Exceptions! n AM n detect it and stall the instructions can complete in different order than they were issued RAW hazards will be more frequent 52
Examples Stalls arising from RAW hazards n L. D F 4, 0(R 2) IF MUL. D F 0, F 4, F 6 ADD. D F 2, F 0, F 8 S. D 0(R 2), F 2 n ID EX Mem WB IF ID stall M 1 IF stall ID stall stall A 1 A 2 IF stall stall ID EX . . . La. CASA M 3 M 4 M 5 M 6 M 7 Mem WB A 3 A 4 Mem WB stall Mem Three instructions that want to perform a write back to the FP register file simultaneously MUL. D F 0, F 4, F 6. . . AM M 2 ADD. D F 2, F 4, F 6. . . L. D F 2, 0(R 2) IF ID M 1 M 2 M 3 M 4 M 5 IF ID EX Mem WB IF ID A 1 A 2 IF ID IF EX ID IF M 6 M 7 Mem WB A 3 A 4 Mem WB EX Mem WB ID EX Mem WB 53
Solving Register Write Conflicts n First approach: track the use of the write port in the ID stage and stall an instruction before it issues n n Alternative approach: stall a conflicting instruction when it tries to enter MEM or WB stage n n n AM La. CASA use a shift register that indicates when already issued instructions will use the register file if there is a conflict with an already issued instruction, stall the instruction for one clock cycle on each clock cycle the reservation register is shifted one bit n we can stall either instruction e. g. give priority to the unit with the longest latency Pros: does not require to detect the conflict until the entrance of MEM or WB stage Cons: complicates pipeline control; stalls now can arise from two different places 54
WAW Hazards IF ADD. D F 2, F 4, F 6 ID EX Mem WB IF ID A 1 A 2 A 3 A 4 IF ID EX Mem WB IF ID EX Mem L. D F 2, 0(R 2) n n La. CASA WB WB Result of ADD. D is overwritten without any instruction ever using it n AM Mem WAWs occur when useless instruction is executed still, we must detect them and provide correct execution Why? BNEZ DIV. D. . . foo: L. D R 1, foo F 0, F 2, F 4 ; delay slot from fall-through F 0, qrs 55
Solving WAW Hazards n n n First approach: delay the issue of load instruction until ADD. D enters MEM Second approach: stamp out the result of the ADD. D by detecting the hazard and changing the control so that ADDD does not write; LD issues right away Detect hazard in ID when LD is issuing n AM La. CASA n n stall LD, or make ADDD no-op Luckily this hazard is rare 56
Hazard Detection in ID Stage n Possible hazards n n hazards among FP instructions hazards between an FP instruction and an integer instr. n n FP and integer registers are distinct, except for FP load-stores, and FP-integer moves Assume that pipeline does all hazard detection in ID stage AM La. CASA 57
Hazard Detection in ID Stage (cont’d) n Check for structural hazards n n Check for RAW data hazards n n La. CASA wait until source registers are not listed as pending destinations in a pipeline register that will not be available when this instruction needs the result Check for WAW data hazards n AM wait until the required functional unit is not busy and make sure that the register write port is available determine if any instruction in A 1, . . A 4, M 1, . . M 7, D has the same register destination as this instruction; if so, stall the issue of the instruction in ID 58
Forwarding Logic n n Check if the destination register in any of EX/MEM, A 4/MEM, M 7/MEM, D/MEM, or MEM/WB pipeline registers is one of the source registers of a FP instruction If so, the appropriate input multiplexer will have to be enabled so as to choose the forwarded data AM La. CASA 59
- Difference between linear pipeline and non linear pipeline
- Klipsch school of electrical and computer engineering
- Tum department of electrical and computer engineering
- Scalar pipeline in computer architecture
- Flavour enhancer 627 and 631 side effects
- Round 598 500 to the nearest ten thousand
- Ctech-collects
- 704-631-1500
- Latécoère 631
- Biscarrosse
- 631-992-3221
- Lc 631
- Challenges n 631 ddl
- Pipelining and superscalar techniques
- Scalar pipeline
- What is system in software engineering
- Define estimation in electrical engineering
- Ts-2di
- Principles and applications of electrical engineering
- Principles and applications of electrical engineering
- Pipelining adalah
- Pipelined protocol
- Vector pipelining
- Pengertian pipeline adalah
- How to overcome data hazards in pipelining
- Major hurdles of pipelining
- Principles of pipelining
- Verilog pipeline example
- Collision prevention in computer architecture
- Pipelining in 8086 microprocessor
- Adam smith pipelining
- Pipelining
- Pipelining
- Pipelining
- Pipeline adalah
- Fpmul
- Pipelining
- Pipelining adalah
- "us pipelining"
- Pengertian pipelining
- "us pipelining"
- "us pipelining"
- "us pipelining"
- Vector electrical engineering
- Gwu electrical engineering
- Tel aviv university electrical engineering
- Northwestern computer science department
- Electrical engineering department
- Umd ece faculty
- Electrical engineering environmental issues
- Wpi ece faculty
- Electrical engineering presentation
- Kfupm ee faculty
- Big data in electrical engineering
- Chapter 11 electrical engineering
- Analogy between electric and magnetic circuits
- University of belgrade school of electrical engineering
- Electrical engineering notation
- Bus electrical engineering
- Hello im human