EEL 5764 Graduate Computer Architecture Appendix A Pipelining

What is Pipelining? • Overlapping execution to produce faster results – – Washing and

Outline • • • MIPS – An ISA for Pipelining 5 stage pipelining Structural

A "Typical" RISC ISA (Load/Store) • 32 bit fixed format instruction (3 formats) •

Example: MIPS ( MIPS) Register-Register 31 26 25 Op 21 20 Rs 1 16

Datapath vs Control (FSM+D) Datapath Controller signals Control Points • Datapath: Storage, FU, interconnect

Approaching an ISA • Instruction Set Architecture – Defines set of operations, instruction format,

5 Steps of MIPS Datapath Figure A. 2, Page A 8 Instruction Fetch Instr.

5 Steps of MIPS Datapath Figure A. 3, Page A 9 Execute Addr. Calc

Inst. Set Processor Controller IR <= mem[PC]; PC <= PC + 4 JSR A

Visualizing Pipelining Figure A. 2, Page A 8 Time (clock cycles) 12/3/2020 Reg DMem

Pipelining is not quite that easy! • Limits to pipelining: Hazards prevent next instruction

One Memory Port/Structural Hazards Figure A. 4, Page A 14 Time (clock cycles) Instr

One Memory Port/Structural Hazards (Similar to Figure A. 5, Page A 15) Time (clock

Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: 12/3/2020 16

Example: Dual port vs. Single port • Machine A: Dual ported memory (“Harvard Architecture”)

Data Hazard on R 1 Figure A. 6, Page A 17 Time (clock cycles)

Three Generic Data Hazards • Read After Write (RAW) Instr. J tries to read

Three Generic Data Hazards • Write After Read (WAR) Instr. J writes operand before

Three Generic Data Hazards • Write After Write (WAW) Instr. J writes operand before

Forwarding to Avoid Data Hazard Figure A. 7, Page A 19 or r 8,

HW Change for Forwarding Figure A. 23, Page A 37 Next. PC mux MEM/WR

Forwarding to Avoid LW SW Data Hazard Figure A. 8, Page A 20 or

Data Hazard Even with Forwarding Figure A. 9, Page A 21 and r 6,

Data Hazard Even with Forwarding (Similar to Figure A. 10, Page A 21) Reg

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b

Outline • • MIPS – An ISA for Pipelining 5 stage pipelining Structural and

Reg DMem Ifetch Reg ALU r 6, r 1, r 7 Ifetch DMem ALU

Branch Stall Impact • If CPI = 1, 30% branch, Stall 3 cycles =>

Pipelined MIPS Datapath Figure A. 24, page A 38 Instruction Fetch Memory Access Write

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch

Four Branch Hazard Alternatives #4: Delayed Branch – Define branch to take place AFTER

Scheduling Branch Delay Slots (Fig A. 14) A. From before branch add $1, $2,

Delayed Branch • Compiler effectiveness for single branch delay slot: – Fills about 60%

Evaluating Branch Alternatives Assume 4% unconditional branch, 6% conditional branch untaken, 10% conditional branch

Slides: 36

Download presentation

EEL 5764: Graduate Computer Architecture Appendix A Pipelining Review Ann Gordon-Ross Electrical and Computer Engineering University of Florida http: //www. ann. ece. ufl. edu/ These slides are provided by: David Patterson Electrical Engineering and Computer Sciences, University of California, Berkeley Modifications/additions have been made from the originals

What is Pipelining? • Overlapping execution to produce faster results – – Washing and drying dishes Washing and drying laundry Automobile assembly line Chipotle, Quiznos, etc • Pipelining in computer architecture – Multiple instructions are overlapped in execution – Exploits parallelism – Not visible to programmer • Each stage is a pipeline “cycle” – Each stage happens simultaneously so results are produced only as fast as the longest pipeline cycle – Determines clock cycle time 12/3/2020 2

Outline • • • MIPS – An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts 12/3/2020 3

A "Typical" RISC ISA (Load/Store) • 32 bit fixed format instruction (3 formats) • 32 32 bit GPR (R 0 contains zero) • ALU instructions – 3 address, reg arithmetic instruction – 2 address, reg im arithmetic instruction • Single address mode for load/store: base + displacement – no indirection • Simple branch conditions • Delayed branch 12/3/2020 4

Example: MIPS ( MIPS) Register-Register 31 26 25 Op 21 20 Rs 1 16 15 Rs 2 11 10 6 5 Rd 0 Opx Register-Immediate 31 26 25 Op 21 20 Rs 1 16 15 Rd immediate 0 Branch 31 26 25 Op Rs 1 21 20 16 15 Rs 2/Opx immediate 0 Jump / Call 31 26 25 Op 12/3/2020 target 0 5

Datapath vs Control (FSM+D) Datapath Controller signals Control Points • Datapath: Storage, FU, interconnect sufficient to perform the desired functions – Inputs are Control Points – Outputs are signals • Controller: State machine to orchestrate operation on the data path – Based on desired function and signals 12/3/2020 6

Approaching an ISA • Instruction Set Architecture – Defines set of operations, instruction format, hardware supported data types, named storage, addressing modes, sequencing • Meaning of each instruction is described by RTL on architected registers and memory • Given technology constraints assemble adequate datapath – – Architected storage mapped to actual storage Function units to do all the required operations Possible additional storage (eg. MAR, MBR, …) Interconnect to move information among regs and FUs • Implement controller (Finite State Machine (FSM)) 12/3/2020 7

Outline • • • MIPS – An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts 12/3/2020 8

5 Steps of MIPS Datapath Figure A. 2, Page A 8 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Next SEQ PC Adder 4 Zero? RS 1 L M D MUX Data Memory ALU Imm MUX RD Reg File Inst Memory Address RS 2 Write Back MUX Next PC Memory Access Sign Extend WB Data 12/3/2020 9

5 Steps of MIPS Datapath Figure A. 3, Page A 9 Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Adder 4 Zero? RS 1 MUX MEM/WB Data Memory EX/MEM ALU A <= Reg[IRrs]; Imm MUX PC <= PC + 4 ID/EX IR <= mem[PC]; Reg File IF/ID Memory Address RS 2 Write Back MUX Next PC Memory Access WB Data Instruction Fetch Sign Extend RD RD RD B <= Reg[IRrt] rslt <= A op. IRop B WB <= rslt 12/3/2020 Reg[IRrd] <= WB 10

Inst. Set Processor Controller IR <= mem[PC]; PC <= PC + 4 JSR A <= Reg[IRrs]; JR RR if bop(A, b) op. Fetch DCD ST B <= Reg[IRrt] jmp br Ifetch PC <= IRjaddr r <= A op. IRop B RI LD r <= A op. IRop IRim r <= A + IRim WB <= r WB <= Mem[r] PC <= PC+IRim WB <= r Reg[IRrd] <= WB 12/3/2020 Reg[IRrd] <= WB 11

Visualizing Pipelining Figure A. 2, Page A 8 Time (clock cycles) 12/3/2020 Reg DMem Ifetch Reg DMem Reg ALU O r d e r Ifetch ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Reg DMem Reg 12

Pipelining is not quite that easy! • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) – Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) – Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). 12/3/2020 13

One Memory Port/Structural Hazards Figure A. 4, Page A 14 Time (clock cycles) Instr 2 Instr 3 Instr 4 12/3/2020 Ifetch DMem Reg ALU Instr 1 Reg ALU Ifetch ALU O r d e r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Reg DMem 14

One Memory Port/Structural Hazards (Similar to Figure A. 5, Page A 15) Time (clock cycles) Instr 1 Instr 2 Stall Reg Ifetch DMem Reg ALU Ifetch Bubble Instr 3 How 12/3/2020 do you “bubble” the pipe? Reg DMem Bubble Ifetch Reg Bubble ALU O r d e r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Bubble Reg DMem 15

Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: 12/3/2020 16

Example: Dual port vs. Single port • Machine A: Dual ported memory (“Harvard Architecture”) • Machine B: Single ported memory, but its pipelined implementation has a 1. 05 times faster clock rate • Ideal CPI = 1 for both • Loads are 40% of instructions executed Speed. Up. A = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth Speed. Up. B = Pipeline Depth/(1 + 0. 4 x 1) x (clockunpipe/(clockunpipe / 1. 05) = (Pipeline Depth/1. 4) x 1. 05 = 0. 75 x Pipeline Depth Speed. Up. A / Speed. Up. B = Pipeline Depth/(0. 75 x Pipeline Depth) = 1. 33 • Machine A is 1. 33 times faster 12/3/2020 17

Data Hazard on R 1 Figure A. 6, Page A 17 Time (clock cycles) and r 6, r 1, r 7 or r 8, r 1, r 9 xor r 10, r 11 12/3/2020 Ifetch DMem Reg DMem Ifetch Reg ALU sub r 4, r 1, r 3 Reg ALU Ifetch ALU O r d e r add r 1, r 2, r 3 WB ALU I n s t r. MEM ALU IF ID/RF EX Reg Reg DMem 18 Reg

Three Generic Data Hazards • Read After Write (RAW) Instr. J tries to read operand before Instr. I writes it I: add r 1, r 2, r 3 J: sub r 4, r 1, r 3 • Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. 12/3/2020 19

Three Generic Data Hazards • Write After Read (WAR) Instr. J writes operand before Instr. I reads it I: sub r 4, r 1, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 • Called an “anti dependence” by compiler writers. This results from reuse of the name “r 1”. • Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5 12/3/2020 20

Three Generic Data Hazards • Write After Write (WAW) Instr. J writes operand before Instr. I writes it. I: sub r 1, r 4, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 • Called an “output dependence” by compiler writers This also results from the reuse of name “r 1”. • Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 • Will see WAR and WAW in more complicated pipes 12/3/2020 21

Forwarding to Avoid Data Hazard Figure A. 7, Page A 19 or r 8, r 1, r 9 xor r 10, r 11 12/3/2020 Reg DMem Ifetch Reg ALU and r 6, r 1, r 7 Ifetch DMem ALU sub r 4, r 1, r 3 Reg ALU O r d e r add r 1, r 2, r 3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Reg DMem 22 Reg

HW Change for Forwarding Figure A. 23, Page A 37 Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers Data Memory mux Immediate What circuit detects and resolves this hazard? 12/3/2020 23

Forwarding to Avoid LW SW Data Hazard Figure A. 8, Page A 20 or r 8, r 6, r 9 xor r 10, r 9, r 11 12/3/2020 Reg DMem Ifetch Reg ALU sw r 4, 12(r 1) Ifetch DMem ALU lw r 4, 0(r 1) Reg ALU O r d e r add r 1, r 2, r 3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Reg DMem 24 Reg

Data Hazard Even with Forwarding Figure A. 9, Page A 21 and r 6, r 1, r 7 or 12/3/2020 r 8, r 1, r 9 DMem Ifetch Reg DMem Reg Ifetch Reg Reg DMem ALU O r d e r sub r 4, r 1, r 6 Reg ALU lw r 1, 0(r 2) Ifetch ALU I n s t r. ALU Time (clock cycles) Reg DMem 25 Reg

Data Hazard Even with Forwarding (Similar to Figure A. 10, Page A 21) Reg DMem Ifetch Reg Bubble Ifetch Bubble Reg Bubble Ifetch and r 6, r 1, r 7 or r 8, r 1, r 9 12/3/2020 How is this detected? Reg DMem Reg Reg DMem ALU sub r 4, r 1, r 6 Ifetch ALU O r d e r lw r 1, 0(r 2) ALU I n s t r. ALU Time (clock cycles) DMem 26

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d , e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW Rb, b Rc, c Ra, Rb, Rc a, Ra Re, e Rf, f Rd, Re, Rf d, Rd Fast code: LW LW LW ADD LW SW SUB SW Rb, b Rc, c Re, e Ra, Rb, Rc Rf, f a, Ra Rd, Re, Rf d, Rd Compiler optimizes for performance. Hardware checks for safety. 12/3/2020 27

Outline • • MIPS – An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion 12/3/2020 28

Reg DMem Ifetch Reg ALU r 6, r 1, r 7 Ifetch DMem ALU 18: or Reg ALU 14: and r 2, r 3, r 5 Ifetch ALU 10: beq r 1, r 3, 36 ALU Control Hazard on Branches Three Stage Stall 22: add r 8, r 1, r 9 36: xor r 10, r 11 Reg Reg What do you do with the 3 instructions in between? How do you do it? Where is the “commit”? 12/3/2020 29 Reg DMem

Branch Stall Impact • If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1. 9! • Two part solution: – Determine branch taken or not sooner, AND – Compute taken branch address earlier • MIPS branch tests if register = 0 or 0 • MIPS Solution: – Move Zero test to ID/RF stage – Adder to calculate new PC in ID/RF stage – 1 clock cycle penalty for branch versus 3 12/3/2020 30

Pipelined MIPS Datapath Figure A. 24, page A 38 Instruction Fetch Memory Access Write Back Adder MUX Next SEQ PC Next PC Zero? RS 1 MUX MEM/WB Data Memory EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Memory Address RS 2 WB Data 4 Execute Addr. Calc Instr. Decode Reg. Fetch Sign Extend RD RD RD • Interplay of instruction set design and cycle time. 12/3/2020 31

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken – – – Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken – 53% MIPS branches taken on average – But haven’t calculated branch target address in MIPS » MIPS still incurs 1 cycle branch penalty » Other machines: branch target known before outcome 12/3/2020 32

Four Branch Hazard Alternatives #4: Delayed Branch – Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2. . . . sequential successorn branch target if taken Branch delay of length n – 1 slot delay allows proper decision and branch target address in 5 stage pipeline – MIPS uses this 12/3/2020 33

Scheduling Branch Delay Slots (Fig A. 14) A. From before branch add $1, $2, $3 if $2=0 then delay slot becomes B. From branch target sub $4, $5, $6 add $1, $2, $3 if $1=0 then delay slot becomes if $2=0 then add $1, $2, $3 if $1=0 then sub $4, $5, $6 C. From fall through add $1, $2, $3 if $1=0 then delay slot sub $4, $5, $6 becomes add $1, $2, $3 if $1=0 then sub $4, $5, $6 • A is the best choice, fills delay slot & reduces instruction count (IC) • In B, the sub instruction may need to be copied, increasing IC • In B and C, must be okay to execute sub when branch fails 12/3/2020 34

Delayed Branch • Compiler effectiveness for single branch delay slot: – Fills about 60% of branch delay slots – About 80% of instructions executed in branch delay slots useful in computation – About 50% (60% x 80%) of slots usefully filled • Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot – Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches – Growth in available transistors has made dynamic approaches relatively cheaper 12/3/2020 35

Evaluating Branch Alternatives Assume 4% unconditional branch, 6% conditional branch untaken, 10% conditional branch taken Scheduling Branch CPI speedup v. scheme penalty unpipelined stall Stall pipeline 3 1. 60 3. 1 1. 0 Predict taken 1 1. 20 4. 2 1. 33 Predict not taken 1 1. 14 4. 4 1. 40 Delayed branch 0. 5 1. 10 4. 5 1. 45 12/3/2020 36