Computer Architecture A Quantitative Approach Fifth Edition Appendix

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural

A "Typical" RISC ISA n n n 32 -bit fixed format instruction 32 32

Summary of RISC-V ISA (RV 32 I) x 0 x 1 ° ° °

Instruction Format (RV 32 I) - Location of each register operand field is fixed

Datapath vs Control Datapath Controller signals Control Points n Datapath: Storage, FU, interconnect sufficient

5 Steps of RISC-V Datapath Instr. Decode Reg. Fetch Next SEQ PC Adder 4

Inst. Set Processor Controller IR <= mem[PC]; Ifetch PC <= PC + 4 A

5 Steps of RISC-V Datapath & Stage Registers Execute Addr. Calc Instr. Decode Reg.

Visualizing Pipelining Time (clock cycles) Reg DMem Ifetch Reg DMem Reg ALU O r

Pipelining is not quite that easy! • Limits to pipelining: Hazards prevent next instruction

One Memory Port/Structural Hazards Time (clock cycles) Instr 2 Instr 3 Instr 4 DMem

One Memory Port/Structural Hazards Time (clock cycles) Instr 1 Instr 2 Stall Reg DMem

Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: Advanced Computer

Example: Dual-port vs. Single-port n n Machine A: Dual ported memory (“Harvard Architecture”) Machine

Data Hazard on R 1 Time (clock cycles) and x 6, x 1, x

Three Generic Data Hazards n Read After Write (RAW) Instr. J tries to read

Three Generic Data Hazards n Write After Read (WAR) Instr. J writes operand before

Three Generic Data Hazards n Write After Write (WAW) Instr. J writes operand before

Forwarding to Avoid Data Hazard or x 8, x 1, x 9 xor x

HW Change for Forwarding Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers Data

Forwarding to Avoid LW-SW Data Hazard or x 8, x 6, x 9 xor

Data Hazard Even with Forwarding and x 6, x 1, x 7 sub x

Data Hazard Even with Forwarding and x 6, x 1, x 7 or Reg

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b

22: add x 8, x 1, x 9 Reg DMem Ifetch Reg ALU x

Branch Stall Impact • • If CPI = 1, 30% branch, Stall 3 cycles

Pipelined RISC-V Datapath Instruction Fetch Memory Access Write Back Adder MUX Next SEQ PC

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch

Four Branch Hazard Alternatives #4: Delayed Branch n n Define branch to take place

Scheduling Branch Delay Slots A. From before branch add x 1, x 2, x

Delayed Branch n n Compiler effectiveness for single branch delay slot: n Fills about

Evaluating Branch Alternatives Assume 4% unconditional branch, 6% conditional branch- untaken, 10% conditional branch-taken

n n Exception: An unusual event happens to an instruction during its execution n

Precise Exceptions in Static Pipelines Key observation: architected state only change in memory and

Floating Point Operations n FP pipeline will allow for a longer latency n It

Multicycle (unpipelined) FP units Advanced Computer Architecture 2019 @ Utsunomiya University 43

Pipelined FP units WAR and WAW hazards may appear because instructions have different lengths

Latency and initiation interval Functional unit Latency Initiation interval Integer ALU 0 1 Data

Slides: 45

Download presentation

Computer Architecture A Quantitative Approach, Fifth Edition Appendix C Pipelining: Basic and Intermediate Concepts Advanced Computer Architecture 2019 @ Utsunomiya University 1

Outline n n n n A “Typical” RISC ISA (RISC-V) 5 stage pipelining Structural Hazards Data Hazards & Forwarding Branch Hazards Handling Exceptions Floating Point Operations Advanced Computer Architecture 2019 @ Utsunomiya University 2

A "Typical" RISC ISA n n n 32 -bit fixed format instruction 32 32 -bit GPR (x 0 contains zero) 3 -address, reg-reg arithmetic instruction Single addressing mode for load/store: base + displacement Simple branch conditions (Delayed branch) see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM Power. PC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 Advanced Computer Architecture 2019 @ Utsunomiya University 3

Summary of RISC-V ISA (RV 32 I) x 0 x 1 ° ° ° x 31 PC 0 Programmable storage 2^32 x bytes 31 x 32 -bit GPRs (x 0 always zero) PC 32 -bit instructions on word boundary Arithmetic logical add, sub, and, or, xor, sltu, addi, sltiu, andi, ori, xori, lui, auipc sll, sra, slli, srai Memory Access lb, lbu, lhu, lw sb, sh, sw Control jal, jalr beq, bne, blt, bge, bltu, bgeu Advanced Computer Architecture 2019 @ Utsunomiya University 4

Instruction Format (RV 32 I) - Location of each register operand field is fixed - MSB indicates sign of immediate value Advanced Computer Architecture 2019 @ Utsunomiya University 5

Datapath vs Control Datapath Controller signals Control Points n Datapath: Storage, FU, interconnect sufficient to perform the desired functions n n n Inputs are Control Points Outputs are signals Controller: State machine to orchestrate operation on the data path n Based on desired function and signals Advanced Computer Architecture 2019 @ Utsunomiya University 6

5 Steps of RISC-V Datapath Instr. Decode Reg. Fetch Next SEQ PC Adder 4 Next SEQ PC RS 1 Advanced Computer Architecture 2019 @ Utsunomiya University MUX Sign Extend Data Memory MUX Imm Write Back z ALU RD Reg File Inst Memory Address RS 2 Memory Access MUX Next PC Execute Addr. Calc WB Data Instruction Fetch 7

Inst. Set Processor Controller IR <= mem[PC]; Ifetch PC <= PC + 4 A <= Reg[IRrs 1]; Branch if bop(A, b) PC <= PC + IRimm Jump (and link) r <= PC PC <= PC +IRimma WB <= r Reg[IRrd] <= WB op. Fetch B <= Reg[IRrs 2] RR RI LD r <= A + IRimm r <= A op. IRop B r <= A op. IRop IRimm WB <= r WB <= Mem[r] Reg[IRrd] <= WB Advanced Computer Architecture 2019 @ Utsunomiya University 8

5 Steps of RISC-V Datapath & Stage Registers Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Adder 4 RS 1 RD RD Advanced Computer Architecture 2019 @ Utsunomiya University RD MUX Sign Extend MEM/WB Data Memory ALU MUX EX/MEM z ID/EX Imm Reg File IF/ID Memory Address RS 2 Write Back MUX Next PC Memory Access WB Data Instruction Fetch 10

Visualizing Pipelining Time (clock cycles) Reg DMem Ifetch Reg DMem Reg ALU O r d e r Ifetch ALU I n s t r. Cycle 6 Cycle 7 ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Ifetch Reg Reg DMem Advanced Computer Architecture 2019 @ Utsunomiya University Reg 11

Pipelining is not quite that easy! • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – – – Structural hazards: A required resource is busy Data hazards: Instruction depends on the result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow Advanced Computer Architecture 2019 @ Utsunomiya University 12

One Memory Port/Structural Hazards Time (clock cycles) Instr 2 Instr 3 Instr 4 DMem Ifetch Reg DMem Reg ALU Instr 1 Reg ALU Ifetch ALU O r d e r Load ALU I n s t r. Cycle 6 Cycle 7 ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Ifetch Reg Advanced Computer Architecture 2019 @ Utsunomiya University Reg DMem Reg 14

One Memory Port/Structural Hazards Time (clock cycles) Instr 1 Instr 2 Stall Reg DMem Ifetch Reg DMem Reg ALU Ifetch Bubble Instr 3 Cycle 6 Cycle 7 Reg DMem Bubble Ifetch Reg Bubble ALU O r d e r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Bubble DMem Reg How do you “bubble” the pipe? Advanced Computer Architecture 2019 @ Utsunomiya University 15

Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: Advanced Computer Architecture 2019 @ Utsunomiya University 16

Example: Dual-port vs. Single-port n n Machine A: Dual ported memory (“Harvard Architecture”) Machine B: Single ported memory, but its pipelined implementation has a 1. 05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed Speed. Up. A = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth Speed. Up. B = Pipeline Depth/(1 + 0. 4 x 1) x (clockunpipe/(clockunpipe / 1. 05) = (Pipeline Depth/1. 4) x 1. 05 = 0. 75 x Pipeline Depth Speed. Up. A / Speed. Up. B = Pipeline Depth/(0. 75 x Pipeline Depth) = 1. 33 n Machine A is 1. 33 times faster Advanced Computer Architecture 2019 @ Utsunomiya University 17

Data Hazard on R 1 Time (clock cycles) and x 6, x 1, x 7 or x 8, x 1, x 9 xor x 10, x 11 Ifetch DMem Reg DMem Ifetch Reg ALU sub x 4, x 1, x 3 Reg ALU Ifetch ALU O r d e r add x 1, x 2, x 3 WB ALU I n s t r. MEM ALU IF ID/RF EX Reg Advanced Computer Architecture 2019 @ Utsunomiya University Reg DMem Reg 19

Three Generic Data Hazards n Read After Write (RAW) Instr. J tries to read operand before Instr. I writes it I: add x 1, x 2, x 3 J: sub x 4, x 1, x 3 n Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. Advanced Computer Architecture 2019 @ Utsunomiya University 20

Three Generic Data Hazards n Write After Read (WAR) Instr. J writes operand before Instr. I reads it I: sub x 4, x 1, x 3 J: add x 1, x 2, x 3 K: add x 6, x 1, x 7 n n Called an “anti-dependence” by compiler writers. This results from reuse of the name “x 1”. Can’t happen in RISC-V 5 stage pipeline because: n n n All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 Advanced Computer Architecture 2019 @ Utsunomiya University 21

Three Generic Data Hazards n Write After Write (WAW) Instr. J writes operand before Instr. I writes it. I: sub x 1, x 4, x 3 J: add x 1, x 2, x 3 K: add x 6, x 1, x 7 n n Called an “output dependence” by compiler writers This also results from the reuse of name “x 1”. Can’t happen in RISC-V stage pipeline because: n n n All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in more complicated pipes Advanced Computer Architecture 2019 @ Utsunomiya University 22

Forwarding to Avoid Data Hazard or x 8, x 1, x 9 xor x 10, x 11 Reg DMem Ifetch Reg ALU and x 6, x 1, x 7 Ifetch DMem ALU sub x 4, x 1, x 3 Reg ALU O r d e r add x 1, x 2, x 3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Advanced Computer Architecture 2019 @ Utsunomiya University Reg DMem Reg 23

HW Change for Forwarding Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers Data Memory mux Immediate What circuit detects and resolves this hazard? Advanced Computer Architecture 2019 @ Utsunomiya University 24

Forwarding to Avoid LW-SW Data Hazard or x 8, x 6, x 9 xor x 10, x 9, x 11 Reg DMem Ifetch Reg ALU sw x 4, 12(x 1) Ifetch DMem ALU lw x 4, 0(x 1) Reg ALU O r d e r add x 1, x 2, x 3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Advanced Computer Architecture 2019 @ Utsunomiya University Reg DMem Reg 25

Data Hazard Even with Forwarding and x 6, x 1, x 7 sub x 4, x 1, x 6 or x 8, x 1, x 9 Reg DMem Ifetch Reg DMem Reg Ifetch Reg Advanced Computer Architecture 2019 @ Utsunomiya University Reg DMem ALU O r d e r x 1, 0(x 2) Ifetch ALU lw ALU I n s t r. ALU Time (clock cycles) Reg DMem Reg 26

Data Hazard Even with Forwarding and x 6, x 1, x 7 or Reg DMem Ifetch Reg Bubble Ifetch Bubble Reg Bubble Ifetch x 8, x 1, x 9 Reg DMem ALU sub x 4, x 1, x 6 Ifetch ALU O r d e r lw x 1, 0(x 2) ALU I n s t r. ALU Time (clock cycles) Reg DMem How is this detected? Advanced Computer Architecture 2019 @ Utsunomiya University 27

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d , e, and f in memory. Slow code: LW Rb, b LW Rc, c ADD Ra, Rb, Rc SW a, Ra LW Re, e LW Rf, f SUB Rd, Re, Rf SW d, Rd Fast code: LW LW LW ADD LW SW SUB SW Rb, b Rc, c Re, e Ra, Rb, Rc Rf, f a, Ra Rd, Re, Rf d, Rd Compiler optimizes for performance. Hardware checks for safety. Advanced Computer Architecture 2019 @ Utsunomiya University 28

22: add x 8, x 1, x 9 Reg DMem Ifetch Reg ALU x 6, x 1, x 7 Ifetch DMem ALU 18: or Reg ALU 14: and x 2, x 3, x 5 Ifetch ALU 10: beq x 1, x 3, 36 ALU Control Hazard on Branches Three Stage Stall 36: xor x 10, x 11 Reg Reg DMem Reg What do you do with the 3 instructions in between? How do you do it? Where is the “commit”? Advanced Computer Architecture 2019 @ Utsunomiya University 30

Branch Stall Impact • • If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1. 9! Two part solution: – – • Determine branch taken or not sooner, AND Compute taken branch address earlier RISC-V Solution: – – – Move Zero-test to ID stage Adder to calculate new PC in ID stage 1 clock cycle penalty for branch versus 3 Advanced Computer Architecture 2019 @ Utsunomiya University 31

Pipelined RISC-V Datapath Instruction Fetch Memory Access Write Back Adder MUX Next SEQ PC Next PC Z RS 1 RD RD Advanced Computer Architecture 2019 @ Utsunomiya University RD MUX Sign Extend MEM/WB Data Memory EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Memory Address RS 2 WB Data 4 Execute Addr. Calc Instr. Decode Reg. Fetch 32

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken n n n Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% branches not taken on average PC+4 already calculated, so use it to get next instruction ARM and RISC-V use this #3: Predict Branch Taken n n 53% branches taken on average But haven’t calculated branch target address in RISC-V still incurs 1 cycle branch penalty Advanced Computer Architecture 2019 @ Utsunomiya University 33

Four Branch Hazard Alternatives #4: Delayed Branch n n Define branch to take place AFTER a following instruction branch instruction sequential successor 1 Branch delay of length n sequential successor 2. . . . sequential successorn target of taken branch 1 slot delay allows proper decision and branch target address in 5 stage pipeline Advanced Computer Architecture 2019 @ Utsunomiya University 34

Scheduling Branch Delay Slots A. From before branch add x 1, x 2, x 3 if x 2=0 then delay slot B. From branch target sub x 4, x 5, x 6 add x 1, x 2, x 3 if x 1=0 then delay slot becomes if x 2=0 then add x 1, x 2, x 3 n n n add x 1, x 2, x 3 if x 1=0 then sub x 4, x 5, x 6 C. From fall through add x 1, x 2, x 3 if x 1=0 then delay slot or x 7, x 8, x 9 sub x 4, x 5, x 6 becomes add x 1, x 2, x 3 if x 1=0 then or x 7, x 8, x 9 sub x 4, x 5, x 6 A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails Advanced Computer Architecture 2019 @ Utsunomiya University 35

Delayed Branch n n Compiler effectiveness for single branch delay slot: n Fills about 60% of branch delay slots n About 80% of instructions executed in branch delay slots useful in computation n About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot n Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches n Growth in available transistors has made dynamic approaches relatively cheaper Advanced Computer Architecture 2019 @ Utsunomiya University 36

Evaluating Branch Alternatives Assume 4% unconditional branch, 6% conditional branch- untaken, 10% conditional branch-taken Scheduling Branch CPI speedup v. scheme penalty unpipelined stall Stall pipeline 3 1. 60 3. 1 1. 0 Predict taken 1 1. 20 4. 2 1. 33 Predict not taken 1 1. 14 4. 4 1. 40 Delayed branch 0. 5 1. 10 4. 5 1. 45 Advanced Computer Architecture 2019 @ Utsunomiya University 37

n n Exception: An unusual event happens to an instruction during its execution n Examples: divide by zero, undefined opcode Interrupt: Hardware signal to switch the processor to a new instruction stream n Example: a sound card interrupts when it needs more audio output samples (an audio “click” happens if it is left waiting) Problem: It must appear that the exception or interrupt must appear between 2 instructions (Ii and Ii+1) n The effect of all instructions up to and including I i is totally completed n No effect of any instruction after Ii can take place The interrupt (exception) handler either aborts program or restarts at instruction Ii+1 Advanced Computer Architecture 2019 @ Utsunomiya University 39

Precise Exceptions in Static Pipelines Key observation: architected state only change in memory and register write stages. Advanced Computer Architecture 2019 @ Utsunomiya University 40

Floating Point Operations n FP pipeline will allow for a longer latency n It is impractical to require all FP operations complete in 1 clock cycle n n n Slow clock rate Enormous amounts of logic circuit Multiple FUs (functional units) n n Integer unit, FP/Int multiplier, FP adder, FP/Int divider Two alternatives n n n Multicycle (unpipelined) FP unit Pipelined FP unit Latency and initiation interval (or repeat interval) Advanced Computer Architecture 2019 @ Utsunomiya University 42

Multicycle (unpipelined) FP units Advanced Computer Architecture 2019 @ Utsunomiya University 43

Pipelined FP units WAR and WAW hazards may appear because instructions have different lengths Advanced Computer Architecture 2019 @ Utsunomiya University 44

Latency and initiation interval Functional unit Latency Initiation interval Integer ALU 0 1 Data memory (load) 1 1 FP add 3 1 FP/Int multiply 6 1 FP/Int divide 24 25 n n Can start every FP-add operation but each result is obtained after 3 cycles FP divider is not pipelined requires 24 cycles to complete Advanced Computer Architecture 2019 @ Utsunomiya University 45