Lecture 4 Pipelining EENG633 1 Pipelining Its Natural

Pipelining: Its Natural! • Laundry Example • Ann, Brian, Cathy, Dave each have one

Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20

Pipelined Laundry Start work ASAP 6 PM 7 8 9 10 11 Midnight Time

Pipelining Lessons 6 PM 7 8 9 Time T a s k O r

Computer Pipelines • Execute billions of instructions, so throughout is what matters • MIPS/DLX

5 Steps of MIPS Datapath Figure 3. 1, Page 130, CA: AQA 2 e

5 Steps of MIPS Datapath Figure 3. 4, Page 134 , CA: AQA 2

Its Not That Easy for Computers • Limits to pipelining: Hazards prevent next instruction

One Memory Port/Structural Hazards Figure 3. 6, Page 142 , CA: AQA 2 e

One Memory Port/Structural Hazards Figure 3. 7, Page 143 , CA: AQA 2 e

Three Generic Data Hazards • Read After Write (RAW) Instr. J tries to read

Three Generic Data Hazards • Write After Read (WAR) Instr. J writes operand before

Three Generic Data Hazards • Write After Write (WAW) Instr. J writes operand before

Forwarding to Avoid Data Hazard Figure 3. 10, Page 149 , CA: AQA 2

HW Change for Forwarding Figure 3. 20, Page 161, CA: AQA 2 e Next.

Data Hazard Even with Forwarding Figure 3. 12, Page 153 , CA: AQA 2

Data Hazard Even with Forwarding Figure 3. 13, Page 154 , CA: AQA 2

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b

Reg DMem Ifetch Reg ALU r 6, r 1, r 7 Ifetch DMem ALU

Example: Branch Stall Impact • If 30% branch, Stall 3 cycles significant • Two

Pipelined MIPS Datapath Figure 3. 22, page 163, CA: AQA 2/e Instruction Fetch Memory

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch

Four Branch Hazard Alternatives #4: Delayed Branch – Define branch to take place AFTER

Delayed Branch • Where to get instructions to fill branch delay slot? – –

CPIpipelined Speed Up Equation for Pipelining = Ideal CPI + Pipeline stall clock cycles

Pipelining Introduction Summary • Just overlap tasks, and easy if tasks are independent •

Slides: 27

Download presentation

Lecture 4: Pipelining EENG-633 1

Pipelining: Its Natural! • Laundry Example • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes A B C D • Dryer takes 40 minutes • “Folder” takes 20 minutes EENG-633 2

Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20 T a s k O r d e r A B C D • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take? 3 EENG-633

Pipelined Laundry Start work ASAP 6 PM 7 8 9 10 11 Midnight Time 30 40 T a s k O r d e r 40 40 40 20 A B C D • Pipelined laundry takes 3. 5 hours for 4 loads EENG-633 4

Pipelining Lessons 6 PM 7 8 9 Time T a s k O r d e r 30 40 40 20 A B C D EENG-633 • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces 5 speedup

Computer Pipelines • Execute billions of instructions, so throughout is what matters • MIPS/DLX desirable features: all instructions same length, registers located in same place in instruction format, memory operands only in loads or stores EENG-633 6

5 Steps of MIPS Datapath Figure 3. 1, Page 130, CA: AQA 2 e Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Next SEQ PC Adder 4 Zero? RS 1 L M D MUX Data Memory ALU Imm MUX RD Reg File Inst Memory Address RS 2 Write Back MUX Next PC Memory Access Sign Extend WB Data EENG-633 7

5 Steps of MIPS Datapath Figure 3. 4, Page 134 , CA: AQA 2 e Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Adder 4 Zero? RS 1 MUX MEM/WB Data Memory EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Memory Address RS 2 Write Back MUX Next PC Memory Access WB Data Instruction Fetch Sign Extend RD RD • Data stationary control EENG-633 RD – local decode for each instruction phase / pipeline stage 8

Its Not That Easy for Computers • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) – Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) – Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). EENG-633 9

One Memory Port/Structural Hazards Figure 3. 6, Page 142 , CA: AQA 2 e Time (clock cycles) Instr 2 Instr 3 Ifetch DMem Reg ALU Instr 1 Reg ALU Ifetch ALU O r d e r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Instr 4 Reg Ifetch EENG-633 Reg Reg DMem 10

One Memory Port/Structural Hazards Figure 3. 7, Page 143 , CA: AQA 2 e Time (clock cycles) Instr 1 Instr 2 Stall Reg Ifetch DMem Reg ALU Ifetch Bubble Reg DMem Bubble Instr 3 Ifetch EENG-633 Reg Bubble ALU O r d e r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Bubble Reg DMem 11

Three Generic Data Hazards • Read After Write (RAW) Instr. J tries to read operand before Instr. I writes it I: add r 1, r 2, r 3 J: sub r 4, r 1, r 3 • Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. EENG-633 12

Three Generic Data Hazards • Write After Read (WAR) Instr. J writes operand before Instr. I reads it I: sub r 4, r 1, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 • Called an “anti-dependence” by compiler writers. This results from reuse of the name “r 1”. • Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always EENG-633 in stage 5 13

Three Generic Data Hazards • Write After Write (WAW) Instr. J writes operand before Instr. I writes it. I: sub r 1, r 4, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 • Called an “output dependence” by compiler writers This also results from the reuse of name “r 1”. • Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 • Will see WAR and WAW in later more complicated pipes EENG-633 14

Forwarding to Avoid Data Hazard Figure 3. 10, Page 149 , CA: AQA 2 e or Reg DMem Ifetch Reg ALU and r 6, r 1, r 7 Ifetch DMem ALU sub r 4, r 1, r 3 Reg ALU O r d e r add r 1, r 2, r 3 Ifetch ALU I n s t r. ALU Time (clock cycles) r 8, r 1, r 9 xor r 10, r 11 EENG-633 Reg Reg Reg DMem 15

HW Change for Forwarding Figure 3. 20, Page 161, CA: AQA 2 e Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers Data Memory mux Immediate EENG-633 16

Data Hazard Even with Forwarding Figure 3. 12, Page 153 , CA: AQA 2 e and r 6, r 1, r 7 or DMem Ifetch Reg DMem Reg Ifetch r 8, r 1, r 9 EENG-633 Reg Reg DMem ALU O r d e r sub r 4, r 1, r 6 Reg ALU lw r 1, 0(r 2) Ifetch ALU I n s t r. ALU Time (clock cycles) Reg DMem 17

Data Hazard Even with Forwarding Figure 3. 13, Page 154 , CA: AQA 2 e and r 6, r 1, r 7 Reg DMem Ifetch Reg Bubble Ifetch Bubble Reg Bubble Ifetch or r 8, r 1, r 9 EENG-633 Reg DMem ALU sub r 4, r 1, r 6 Ifetch ALU O r d e r lw r 1, 0(r 2) ALU I n s t r. ALU Time (clock cycles) Reg DMem 18

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d , e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW Fast code: LW Rb, b LW Rc, c LW Ra, Rb, Rc ADD a, Ra LW Re, e SW Rf, f SUB Rd, Re, Rf EENG-633 SW d, Rd Rb, b Rc, c Re, e Ra, Rb, Rc Rf, f a, Ra Rd, Re, Rf d, Rd 19

Reg DMem Ifetch Reg ALU r 6, r 1, r 7 Ifetch DMem ALU 18: or Reg ALU 14: and r 2, r 3, r 5 Ifetch ALU 10: beq r 1, r 3, 36 ALU Control Hazard on Branches Three Stage Stall 22: add r 8, r 1, r 9 36: xor r 10, r 11 EENG-633 Reg Reg DMem 20 Reg

Example: Branch Stall Impact • If 30% branch, Stall 3 cycles significant • Two part solution: – Determine branch taken or not sooner, AND – Compute taken branch address earlier • MIPS branch tests if register = 0 or 0 • MIPS Solution: – Move Zero test to ID/RF stage – Adder to calculate new PC in ID/RF stage – 1 clock cycle penalty for branch versus 3 EENG-633 21

Pipelined MIPS Datapath Figure 3. 22, page 163, CA: AQA 2/e Instruction Fetch Memory Access Write Back Adder MUX Next SEQ PC Next PC Zero? RS 1 MUX MEM/WB Data Memory EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Memory Address RS 2 WB Data 4 Execute Addr. Calc Instr. Decode Reg. Fetch Sign Extend RD RD • Data stationary control EENG-633 RD – local decode for each instruction phase / pipeline stage 22

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken – – – Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken – 53% MIPS branches taken on average – But haven’t calculated branch target address in MIPS » MIPS still incurs 1 cycle branch penalty » Other machines: branch target known before outcome EENG-633 23

Four Branch Hazard Alternatives #4: Delayed Branch – Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2. . . . sequential successorn branch target if taken Branch delay of length n – 1 slot delay allows proper decision and branch target address in 5 stage pipeline – MIPS uses this EENG-633 24

Delayed Branch • Where to get instructions to fill branch delay slot? – – Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Canceling branches allow more slots to be filled • Compiler effectiveness for single branch delay slot: – Fills about 60% of branch delay slots – About 80% of instructions executed in branch delay slots useful in computation – About 50% (60% x 80%) of slots usefully filled • Delayed Branch downside: 7 -8 stage pipelines, multiple instructions issued per clock (superscalar) EENG-633 25

CPIpipelined Speed Up Equation for Pipelining = Ideal CPI + Pipeline stall clock cycles per instr Speedup = Ideal CPI x Pipeline depth x Ideal CPI + Pipeline stall CPI Speedup = Pipeline depth x 1 + Pipeline stall CPI Clock Cycleunpipelined Clock Cyclepipelined Ideal CPI is almost always 1 EENG-633 26

Pipelining Introduction Summary • Just overlap tasks, and easy if tasks are independent • Speed Up vs Pipeline Depth; if ideal CPI is 1, then: Speedup = Pipeline Depth 1 + Pipeline stall CPI X Clock Cycle Unpipelined Clock Cycle Pipelined • Hazards limit performance on computers: – Structural: need more HW resources – Data (RAW, WAR, WAW): need forwarding, compiler scheduling – Control: delayed branch, prediction EENG-633 27