Computer Architecture A Quantitative Approach Fifth Edition Appendix

  • Slides: 45
Download presentation
Computer Architecture A Quantitative Approach, Fifth Edition Appendix C Pipelining: Basic and Intermediate Concepts

Computer Architecture A Quantitative Approach, Fifth Edition Appendix C Pipelining: Basic and Intermediate Concepts Advanced Computer Architecture 2019 @ Utsunomiya University 1

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural Hazards Data Hazards & Forwarding Branch Hazards Handling Exceptions Floating Point Operations Advanced Computer Architecture 2019 @ Utsunomiya University 2

A "Typical" RISC ISA n n n 32 -bit fixed format instruction 32 32

A "Typical" RISC ISA n n n 32 -bit fixed format instruction 32 32 -bit GPR (x 0 contains zero) 3 -address, reg-reg arithmetic instruction Single addressing mode for load/store: base + displacement Simple branch conditions (Delayed branch) see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM Power. PC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 Advanced Computer Architecture 2019 @ Utsunomiya University 3

Summary of RISC-V ISA (RV 32 I) x 0 x 1 ° ° °

Summary of RISC-V ISA (RV 32 I) x 0 x 1 ° ° ° x 31 PC 0 Programmable storage 2^32 x bytes 31 x 32 -bit GPRs (x 0 always zero) PC 32 -bit instructions on word boundary Arithmetic logical add, sub, and, or, xor, sltu, addi, sltiu, andi, ori, xori, lui, auipc sll, sra, slli, srai Memory Access lb, lbu, lhu, lw sb, sh, sw Control jal, jalr beq, bne, blt, bge, bltu, bgeu Advanced Computer Architecture 2019 @ Utsunomiya University 4

Instruction Format (RV 32 I) - Location of each register operand field is fixed

Instruction Format (RV 32 I) - Location of each register operand field is fixed - MSB indicates sign of immediate value Advanced Computer Architecture 2019 @ Utsunomiya University 5

Datapath vs Control Datapath Controller signals Control Points n Datapath: Storage, FU, interconnect sufficient

Datapath vs Control Datapath Controller signals Control Points n Datapath: Storage, FU, interconnect sufficient to perform the desired functions n n n Inputs are Control Points Outputs are signals Controller: State machine to orchestrate operation on the data path n Based on desired function and signals Advanced Computer Architecture 2019 @ Utsunomiya University 6

5 Steps of RISC-V Datapath Instr. Decode Reg. Fetch Next SEQ PC Adder 4

5 Steps of RISC-V Datapath Instr. Decode Reg. Fetch Next SEQ PC Adder 4 Next SEQ PC RS 1 Advanced Computer Architecture 2019 @ Utsunomiya University MUX Sign Extend Data Memory MUX Imm Write Back z ALU RD Reg File Inst Memory Address RS 2 Memory Access MUX Next PC Execute Addr. Calc WB Data Instruction Fetch 7

Inst. Set Processor Controller IR <= mem[PC]; Ifetch PC <= PC + 4 A

Inst. Set Processor Controller IR <= mem[PC]; Ifetch PC <= PC + 4 A <= Reg[IRrs 1]; Branch if bop(A, b) PC <= PC + IRimm Jump (and link) r <= PC PC <= PC +IRimma WB <= r Reg[IRrd] <= WB op. Fetch B <= Reg[IRrs 2] RR RI LD r <= A + IRimm r <= A op. IRop B r <= A op. IRop IRimm WB <= r WB <= Mem[r] Reg[IRrd] <= WB Advanced Computer Architecture 2019 @ Utsunomiya University 8

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural Hazards Data Hazards & Forwarding Branch Hazards Handling Exceptions Floating Point Operations Advanced Computer Architecture 2019 @ Utsunomiya University 9

5 Steps of RISC-V Datapath & Stage Registers Execute Addr. Calc Instr. Decode Reg.

5 Steps of RISC-V Datapath & Stage Registers Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Adder 4 RS 1 RD RD Advanced Computer Architecture 2019 @ Utsunomiya University RD MUX Sign Extend MEM/WB Data Memory ALU MUX EX/MEM z ID/EX Imm Reg File IF/ID Memory Address RS 2 Write Back MUX Next PC Memory Access WB Data Instruction Fetch 10

Visualizing Pipelining Time (clock cycles) Reg DMem Ifetch Reg DMem Reg ALU O r

Visualizing Pipelining Time (clock cycles) Reg DMem Ifetch Reg DMem Reg ALU O r d e r Ifetch ALU I n s t r. Cycle 6 Cycle 7 ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Ifetch Reg Reg DMem Advanced Computer Architecture 2019 @ Utsunomiya University Reg 11

Pipelining is not quite that easy! • Limits to pipelining: Hazards prevent next instruction

Pipelining is not quite that easy! • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – – – Structural hazards: A required resource is busy Data hazards: Instruction depends on the result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow Advanced Computer Architecture 2019 @ Utsunomiya University 12

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural Hazards Data Hazards & Forwarding Branch Hazards Handling Exceptions Floating Point Operations Advanced Computer Architecture 2019 @ Utsunomiya University 13

One Memory Port/Structural Hazards Time (clock cycles) Instr 2 Instr 3 Instr 4 DMem

One Memory Port/Structural Hazards Time (clock cycles) Instr 2 Instr 3 Instr 4 DMem Ifetch Reg DMem Reg ALU Instr 1 Reg ALU Ifetch ALU O r d e r Load ALU I n s t r. Cycle 6 Cycle 7 ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Ifetch Reg Advanced Computer Architecture 2019 @ Utsunomiya University Reg DMem Reg 14

One Memory Port/Structural Hazards Time (clock cycles) Instr 1 Instr 2 Stall Reg DMem

One Memory Port/Structural Hazards Time (clock cycles) Instr 1 Instr 2 Stall Reg DMem Ifetch Reg DMem Reg ALU Ifetch Bubble Instr 3 Cycle 6 Cycle 7 Reg DMem Bubble Ifetch Reg Bubble ALU O r d e r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Bubble DMem Reg How do you “bubble” the pipe? Advanced Computer Architecture 2019 @ Utsunomiya University 15

Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: Advanced Computer

Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: Advanced Computer Architecture 2019 @ Utsunomiya University 16

Example: Dual-port vs. Single-port n n Machine A: Dual ported memory (“Harvard Architecture”) Machine

Example: Dual-port vs. Single-port n n Machine A: Dual ported memory (“Harvard Architecture”) Machine B: Single ported memory, but its pipelined implementation has a 1. 05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed Speed. Up. A = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth Speed. Up. B = Pipeline Depth/(1 + 0. 4 x 1) x (clockunpipe/(clockunpipe / 1. 05) = (Pipeline Depth/1. 4) x 1. 05 = 0. 75 x Pipeline Depth Speed. Up. A / Speed. Up. B = Pipeline Depth/(0. 75 x Pipeline Depth) = 1. 33 n Machine A is 1. 33 times faster Advanced Computer Architecture 2019 @ Utsunomiya University 17

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural Hazards Data Hazards & Forwarding Branch Hazards Handling Exceptions Floating Point Operations Advanced Computer Architecture 2019 @ Utsunomiya University 18

Data Hazard on R 1 Time (clock cycles) and x 6, x 1, x

Data Hazard on R 1 Time (clock cycles) and x 6, x 1, x 7 or x 8, x 1, x 9 xor x 10, x 11 Ifetch DMem Reg DMem Ifetch Reg ALU sub x 4, x 1, x 3 Reg ALU Ifetch ALU O r d e r add x 1, x 2, x 3 WB ALU I n s t r. MEM ALU IF ID/RF EX Reg Advanced Computer Architecture 2019 @ Utsunomiya University Reg DMem Reg 19

Three Generic Data Hazards n Read After Write (RAW) Instr. J tries to read

Three Generic Data Hazards n Read After Write (RAW) Instr. J tries to read operand before Instr. I writes it I: add x 1, x 2, x 3 J: sub x 4, x 1, x 3 n Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. Advanced Computer Architecture 2019 @ Utsunomiya University 20

Three Generic Data Hazards n Write After Read (WAR) Instr. J writes operand before

Three Generic Data Hazards n Write After Read (WAR) Instr. J writes operand before Instr. I reads it I: sub x 4, x 1, x 3 J: add x 1, x 2, x 3 K: add x 6, x 1, x 7 n n Called an “anti-dependence” by compiler writers. This results from reuse of the name “x 1”. Can’t happen in RISC-V 5 stage pipeline because: n n n All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 Advanced Computer Architecture 2019 @ Utsunomiya University 21

Three Generic Data Hazards n Write After Write (WAW) Instr. J writes operand before

Three Generic Data Hazards n Write After Write (WAW) Instr. J writes operand before Instr. I writes it. I: sub x 1, x 4, x 3 J: add x 1, x 2, x 3 K: add x 6, x 1, x 7 n n Called an “output dependence” by compiler writers This also results from the reuse of name “x 1”. Can’t happen in RISC-V stage pipeline because: n n n All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in more complicated pipes Advanced Computer Architecture 2019 @ Utsunomiya University 22

Forwarding to Avoid Data Hazard or x 8, x 1, x 9 xor x

Forwarding to Avoid Data Hazard or x 8, x 1, x 9 xor x 10, x 11 Reg DMem Ifetch Reg ALU and x 6, x 1, x 7 Ifetch DMem ALU sub x 4, x 1, x 3 Reg ALU O r d e r add x 1, x 2, x 3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Advanced Computer Architecture 2019 @ Utsunomiya University Reg DMem Reg 23

HW Change for Forwarding Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers Data

HW Change for Forwarding Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers Data Memory mux Immediate What circuit detects and resolves this hazard? Advanced Computer Architecture 2019 @ Utsunomiya University 24

Forwarding to Avoid LW-SW Data Hazard or x 8, x 6, x 9 xor

Forwarding to Avoid LW-SW Data Hazard or x 8, x 6, x 9 xor x 10, x 9, x 11 Reg DMem Ifetch Reg ALU sw x 4, 12(x 1) Ifetch DMem ALU lw x 4, 0(x 1) Reg ALU O r d e r add x 1, x 2, x 3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Advanced Computer Architecture 2019 @ Utsunomiya University Reg DMem Reg 25

Data Hazard Even with Forwarding and x 6, x 1, x 7 sub x

Data Hazard Even with Forwarding and x 6, x 1, x 7 sub x 4, x 1, x 6 or x 8, x 1, x 9 Reg DMem Ifetch Reg DMem Reg Ifetch Reg Advanced Computer Architecture 2019 @ Utsunomiya University Reg DMem ALU O r d e r x 1, 0(x 2) Ifetch ALU lw ALU I n s t r. ALU Time (clock cycles) Reg DMem Reg 26

Data Hazard Even with Forwarding and x 6, x 1, x 7 or Reg

Data Hazard Even with Forwarding and x 6, x 1, x 7 or Reg DMem Ifetch Reg Bubble Ifetch Bubble Reg Bubble Ifetch x 8, x 1, x 9 Reg DMem ALU sub x 4, x 1, x 6 Ifetch ALU O r d e r lw x 1, 0(x 2) ALU I n s t r. ALU Time (clock cycles) Reg DMem How is this detected? Advanced Computer Architecture 2019 @ Utsunomiya University 27

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d , e, and f in memory. Slow code: LW Rb, b LW Rc, c ADD Ra, Rb, Rc SW a, Ra LW Re, e LW Rf, f SUB Rd, Re, Rf SW d, Rd Fast code: LW LW LW ADD LW SW SUB SW Rb, b Rc, c Re, e Ra, Rb, Rc Rf, f a, Ra Rd, Re, Rf d, Rd Compiler optimizes for performance. Hardware checks for safety. Advanced Computer Architecture 2019 @ Utsunomiya University 28

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural Hazards Data Hazards & Forwarding Branch Hazards Handling Exceptions Floating Point Operations Advanced Computer Architecture 2019 @ Utsunomiya University 29

22: add x 8, x 1, x 9 Reg DMem Ifetch Reg ALU x

22: add x 8, x 1, x 9 Reg DMem Ifetch Reg ALU x 6, x 1, x 7 Ifetch DMem ALU 18: or Reg ALU 14: and x 2, x 3, x 5 Ifetch ALU 10: beq x 1, x 3, 36 ALU Control Hazard on Branches Three Stage Stall 36: xor x 10, x 11 Reg Reg DMem Reg What do you do with the 3 instructions in between? How do you do it? Where is the “commit”? Advanced Computer Architecture 2019 @ Utsunomiya University 30

Branch Stall Impact • • If CPI = 1, 30% branch, Stall 3 cycles

Branch Stall Impact • • If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1. 9! Two part solution: – – • Determine branch taken or not sooner, AND Compute taken branch address earlier RISC-V Solution: – – – Move Zero-test to ID stage Adder to calculate new PC in ID stage 1 clock cycle penalty for branch versus 3 Advanced Computer Architecture 2019 @ Utsunomiya University 31

Pipelined RISC-V Datapath Instruction Fetch Memory Access Write Back Adder MUX Next SEQ PC

Pipelined RISC-V Datapath Instruction Fetch Memory Access Write Back Adder MUX Next SEQ PC Next PC Z RS 1 RD RD Advanced Computer Architecture 2019 @ Utsunomiya University RD MUX Sign Extend MEM/WB Data Memory EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Memory Address RS 2 WB Data 4 Execute Addr. Calc Instr. Decode Reg. Fetch 32

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken n n n Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% branches not taken on average PC+4 already calculated, so use it to get next instruction ARM and RISC-V use this #3: Predict Branch Taken n n 53% branches taken on average But haven’t calculated branch target address in RISC-V still incurs 1 cycle branch penalty Advanced Computer Architecture 2019 @ Utsunomiya University 33

Four Branch Hazard Alternatives #4: Delayed Branch n n Define branch to take place

Four Branch Hazard Alternatives #4: Delayed Branch n n Define branch to take place AFTER a following instruction branch instruction sequential successor 1 Branch delay of length n sequential successor 2. . . . sequential successorn target of taken branch 1 slot delay allows proper decision and branch target address in 5 stage pipeline Advanced Computer Architecture 2019 @ Utsunomiya University 34

Scheduling Branch Delay Slots A. From before branch add x 1, x 2, x

Scheduling Branch Delay Slots A. From before branch add x 1, x 2, x 3 if x 2=0 then delay slot B. From branch target sub x 4, x 5, x 6 add x 1, x 2, x 3 if x 1=0 then delay slot becomes if x 2=0 then add x 1, x 2, x 3 n n n add x 1, x 2, x 3 if x 1=0 then sub x 4, x 5, x 6 C. From fall through add x 1, x 2, x 3 if x 1=0 then delay slot or x 7, x 8, x 9 sub x 4, x 5, x 6 becomes add x 1, x 2, x 3 if x 1=0 then or x 7, x 8, x 9 sub x 4, x 5, x 6 A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails Advanced Computer Architecture 2019 @ Utsunomiya University 35

Delayed Branch n n Compiler effectiveness for single branch delay slot: n Fills about

Delayed Branch n n Compiler effectiveness for single branch delay slot: n Fills about 60% of branch delay slots n About 80% of instructions executed in branch delay slots useful in computation n About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot n Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches n Growth in available transistors has made dynamic approaches relatively cheaper Advanced Computer Architecture 2019 @ Utsunomiya University 36

Evaluating Branch Alternatives Assume 4% unconditional branch, 6% conditional branch- untaken, 10% conditional branch-taken

Evaluating Branch Alternatives Assume 4% unconditional branch, 6% conditional branch- untaken, 10% conditional branch-taken Scheduling Branch CPI speedup v. scheme penalty unpipelined stall Stall pipeline 3 1. 60 3. 1 1. 0 Predict taken 1 1. 20 4. 2 1. 33 Predict not taken 1 1. 14 4. 4 1. 40 Delayed branch 0. 5 1. 10 4. 5 1. 45 Advanced Computer Architecture 2019 @ Utsunomiya University 37

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural Hazards Data Hazards & Forwarding Branch Hazards Handling Exceptions Floating Point Operations Advanced Computer Architecture 2019 @ Utsunomiya University 38

n n Exception: An unusual event happens to an instruction during its execution n

n n Exception: An unusual event happens to an instruction during its execution n Examples: divide by zero, undefined opcode Interrupt: Hardware signal to switch the processor to a new instruction stream n Example: a sound card interrupts when it needs more audio output samples (an audio “click” happens if it is left waiting) Problem: It must appear that the exception or interrupt must appear between 2 instructions (Ii and Ii+1) n The effect of all instructions up to and including I i is totally completed n No effect of any instruction after Ii can take place The interrupt (exception) handler either aborts program or restarts at instruction Ii+1 Advanced Computer Architecture 2019 @ Utsunomiya University 39

Precise Exceptions in Static Pipelines Key observation: architected state only change in memory and

Precise Exceptions in Static Pipelines Key observation: architected state only change in memory and register write stages. Advanced Computer Architecture 2019 @ Utsunomiya University 40

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural Hazards Data Hazards & Forwarding Branch Hazards Handling Exceptions Floating Point Operations Advanced Computer Architecture 2019 @ Utsunomiya University 41

Floating Point Operations n FP pipeline will allow for a longer latency n It

Floating Point Operations n FP pipeline will allow for a longer latency n It is impractical to require all FP operations complete in 1 clock cycle n n n Slow clock rate Enormous amounts of logic circuit Multiple FUs (functional units) n n Integer unit, FP/Int multiplier, FP adder, FP/Int divider Two alternatives n n n Multicycle (unpipelined) FP unit Pipelined FP unit Latency and initiation interval (or repeat interval) Advanced Computer Architecture 2019 @ Utsunomiya University 42

Multicycle (unpipelined) FP units Advanced Computer Architecture 2019 @ Utsunomiya University 43

Multicycle (unpipelined) FP units Advanced Computer Architecture 2019 @ Utsunomiya University 43

Pipelined FP units WAR and WAW hazards may appear because instructions have different lengths

Pipelined FP units WAR and WAW hazards may appear because instructions have different lengths Advanced Computer Architecture 2019 @ Utsunomiya University 44

Latency and initiation interval Functional unit Latency Initiation interval Integer ALU 0 1 Data

Latency and initiation interval Functional unit Latency Initiation interval Integer ALU 0 1 Data memory (load) 1 1 FP add 3 1 FP/Int multiply 6 1 FP/Int divide 24 25 n n Can start every FP-add operation but each result is obtained after 3 cycles FP divider is not pipelined requires 24 cycles to complete Advanced Computer Architecture 2019 @ Utsunomiya University 45