CENG 450 Computer Systems and Architecture Lecture 6

Overview of Today’s Lecture z MIPS z. Pipelining

CPU Pipelining Example: z Theoretically: y Speedup should be equal to number of stages

MIPS: Software conventions for Registers 0 zero constant 0 16 s 0 callee saves

Example in C: swap(int v[], int k) { int temp; temp = v[k]; v[k]

swap: MIPS swap: addiu sw sll addu lw lw sw sw lw addiu jr

5 Steps of MIPS Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ

Review: Visualizing Pipelining Time (clock cycles) Ifetch DMem Reg ALU O r d e

Limits to pipelining z Hazards: circumstances that would cause incorrect execution if next instruction

Example: One Memory Port/Structural Hazard Time (clock cycles) Instr 1 Instr 2 Instr 3

Resolving structural hazards z Defn: attempt to use same hardware for two different things

Detecting and Resolving Structural Hazard Time (clock cycles) Instr 1 Instr 2 Stall Instr

Eliminating Structural Hazards at Design Time Next SEQ PC Adder Zero? RS 1 MUX

Role of Instruction Set Design in Structural Hazard Resolution z Simple to determine the

Data Hazards Time (clock cycles) and r 6, r 1, r 7 or r

Three Generic Data Hazards z Read After Write (RAW) Instr. J tries to read

Three Generic Data Hazards z Write After Read (WAR) Instr. J writes operand before

Three Generic Data Hazards z Write After Write (WAW) Instr. J writes operand before

Forwarding to Avoid Data Hazard or r 8, r 1, r 9 xor r

HW Change for Forwarding Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers mux

Data Hazard Even with Forwarding and r 6, r 1, r 7 or r

Resolving this load hazard z Adding hardware? . . . not z Detection? z

Resolving the Load Data Hazard and r 6, r 1, r 7 or r

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b

Instruction Set Connection z What is exposed about this organizational hazard in the instruction

Eliminating Control Hazards at Design Time Next SEQ PC Adder Zero? RS 1 MUX

Example: Branch Stall Impact z If 30% branch, Stall 3 cycles significant z Two

Pipelined MIPS Datapath Instruction Fetch Memory Access Write Back Adder MUX Next SEQ PC

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch

Four Branch Hazard Alternatives #4: Delayed Branch y Define branch to take place AFTER

Delayed Branch z Where to get instructions to fill branch delay slot? y Before

Recall: Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: 33

Example: Evaluating Branch Alternatives Assume: Conditional & Unconditional = 14%, 65% change PC Scheduling

Summary z. Hazards z. Date Hazards & Control Hazards z. How to remove Hazard?

Slides: 35

Download presentation

CENG 450 Computer Systems and Architecture Lecture 6 Amirali Baniasadi amirali@ece. uvic. ca 1

Overview of Today’s Lecture z MIPS z. Pipelining

CPU Pipelining Example: z Theoretically: y Speedup should be equal to number of stages ( n tasks, k stages, p latency) y Speedup = n*p =~ k (for large n) y p/k*(n-1) + p z Practically: y Stages are imperfectly balanced y Pipelining needs overhead y Speedup less than number of stages z If we have 3 consecutive instructions y Non-pipelined needs 8 x 3 = 24 ns y Pipelined needs 14 ns => Speedup = 24 / 14 = 1. 7 z If we have 1003 consecutive instructions y Add more time for 1000 instruction (i. e. 1003 instruction)on the previous example x. Non-pipelined total time= 1000 x 8 + 24 = 8024 ns x. Pipelined total time = 1000 x 2 + 14 = 2014 ns => Speedup ~ 3. 98~ (8 ns / 2 ns] ~ near perfect speedup => Performance increases for larger number of instructions (throughput)3

MIPS: Software conventions for Registers 0 zero constant 0 16 s 0 callee saves 1 at . . . (caller can clobber) 2 v 0 expression evaluation & 23 s 7 3 v 1 function results 24 t 8 4 a 0 arguments 25 t 9 5 a 1 26 k 0 reserved for OS kernel 6 a 2 27 k 1 7 a 3 28 gp Pointer to global area 8 t 0 reserved for assembler temporary: caller saves temporary (cont’d) 29 sp Stack pointer . . . (callee can clobber) 30 fp frame pointer 15 t 7 31 ra Return Address (HW) Plus a 3 -deep stack of mode bits. 4

Example in C: swap(int v[], int k) { int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } ° Assume swap is called as a procedure ° Assume temp is register $15; arguments in $a 1, $a 2; $16 is scratch reg: ° Write MIPS code

swap: MIPS swap: addiu sw sll addu lw lw sw sw lw addiu jr $sp, – 4 $16, 4($sp) $t 2, $a 2, 2 $t 2, $a 1, $t 2 $15, 0($t 2) $16, 4($t 2) $16, 0($t 2) $15, 4($t 2) $16, 4($sp) $sp, 4 $31 ; create space on stack ; callee saved register put onto stack ; multiply k by 4 ; address of v[k] ; load v[k+1] ; store v[k+1] into v[k] ; store old value of v[k] into v[k+1] ; callee saved register restored from stack ; restore top of stack ; return to place that called swap

5 Steps of MIPS Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Adder 4 Zero? RS 1 MUX MEM/WB Memory EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Memory Address Datapath RS 2 Write Back MUX Next PC Memory Access WB Data Instruction Fetch Sign Extend RD RD RD Control Path 7

5 Steps of MIPS Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Adder Zero? RS 1 Inst 12 Inst 3 MUX MEM/WB Memory EX/MEM ALU Sign Extend RD Inst 1 Inst 2 RD Control Path MUX ID/EX Imm Reg File IF/ID Memory Address Datapath RS 2 WB Data 4 Write Back MUX Next PC Memory Access RD Inst 1 Instruction Fetch 8

Review: Visualizing Pipelining Time (clock cycles) Ifetch DMem Reg ALU O r d e r Ifetch ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Reg DMem Reg 9

Limits to pipelining z Hazards: circumstances that would cause incorrect execution if next instruction were launched y Structural hazards: Attempting to use the same hardware to do two different things at the same time y Data hazards: Instruction depends on result of prior instruction still in the pipeline y Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). 10

Example: One Memory Port/Structural Hazard Time (clock cycles) Instr 1 Instr 2 Instr 3 Reg Ifetch DMem Reg ALU Ifetch ALU O r d e r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Reg DMem Reg Instr 4 Structural Hazard 11

Resolving structural hazards z Defn: attempt to use same hardware for two different things at the same time z Solution 1: Wait Þ must detect the hazard Þ must have mechanism to stall z Solution 2: Throw more hardware at the problem 12

Detecting and Resolving Structural Hazard Time (clock cycles) Instr 1 Instr 2 Stall Instr 3 Reg Ifetch DMem Reg ALU Ifetch Bubble Reg DMem Bubble Ifetch Reg Bubble ALU O r d e r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Bubble Reg DMem 13

Eliminating Structural Hazards at Design Time Next SEQ PC Adder Zero? RS 1 MUX MEM/WB Data Cache EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Instr Cache Address Datapath RS 2 WB Data 4 MUX Next PC Sign Extend RD RD RD Control Path 14

Role of Instruction Set Design in Structural Hazard Resolution z Simple to determine the sequence of resources used by an instruction y opcode tells it all z Uniformity in the resource usage z Compare MIPS to IA 32? z MIPS approach => all instructions flow through same 5 -stage pipeling 15

Data Hazards Time (clock cycles) and r 6, r 1, r 7 or r 8, r 1, r 9 xor r 10, r 11 Ifetch DMem Reg DMem Ifetch Reg ALU sub r 4, r 1, r 3 Reg ALU Ifetch ALU O r d e r add r 1, r 2, r 3 WB ALU I n s t r. MEM ALU IF ID/RF EX Reg Reg DMem 16 Reg

Three Generic Data Hazards z Read After Write (RAW) Instr. J tries to read operand before Instr. I writes it I: add r 1, r 2, r 3 J: sub r 4, r 1, r 3 z Caused by a “Data Dependence”. This hazard results from an actual need for communication. 17

Three Generic Data Hazards z Write After Read (WAR) Instr. J writes operand before Instr. I reads it I: sub r 4, r 1, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 z an “anti-dependence” by compiler writers. This results from reuse of the name “r 1”. z Can’t happen in MIPS 5 stage pipeline because: y All instructions take 5 stages, and y Reads are always in stage 2, and y Writes are always in stage 5 18

Three Generic Data Hazards z Write After Write (WAW) Instr. J writes operand before Instr. I writes it. I: sub r 1, r 4, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 z Called an “output dependence” by compiler writers This also results from the reuse of name “r 1”. z Can’t happen in MIPS 5 stage pipeline because: y All instructions take 5 stages, and y Writes are always in stage 5 z Will see WAR and WAW in later more complicated pipes 19

Forwarding to Avoid Data Hazard or r 8, r 1, r 9 xor r 10, r 11 Reg DMem Ifetch Reg ALU and r 6, r 1, r 7 Ifetch DMem ALU sub r 4, r 1, r 3 Reg ALU O r d e r add r 1, r 2, r 3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Reg Reg DMem 20

HW Change for Forwarding Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers mux Immediate Data Memory 21

Data Hazard Even with Forwarding and r 6, r 1, r 7 or r 8, r 1, r 9 DMem Ifetch Reg DMem Reg Ifetch Reg Reg DMem ALU O r d e r sub r 4, r 1, r 6 Reg ALU lw r 1, 0(r 2) Ifetch ALU I n s t r. ALU Time (clock cycles) Reg DMem 22

Resolving this load hazard z Adding hardware? . . . not z Detection? z Compilation techniques? z What is the cost of load delays? 23

Resolving the Load Data Hazard and r 6, r 1, r 7 or r 8, r 1, r 9 Reg DMem Ifetch Reg Bubble Ifetch Bubble Reg Bubble Ifetch Reg DMem ALU sub r 4, r 1, r 6 Ifetch ALU O r d e r lw r 1, 0(r 2) ALU I n s t r. ALU Time (clock cycles) Reg DMem How is this different from the instruction issue stall? 24

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d , e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW Rb, b Rc, c Ra, Rb, Rc a, Ra Re, e Rf, f Rd, Re, Rf d, Rd Fast code: LW LW LW ADD LW SW SUB SW Rb, b Rc, c Re, e Ra, Rb, Rc Rf, f a, Ra Rd, Re, Rf d, Rd 25

Instruction Set Connection z What is exposed about this organizational hazard in the instruction set? z k cycle delay? y bad, CPI is not part of ISA z k instruction slot delay y load should not be followed by use of the value in the next k instructions z Nothing, but code can reduce run-time delays z MIPS did the transformation in the assembler 26

Eliminating Control Hazards at Design Time Next SEQ PC Adder Zero? RS 1 MUX MEM/WB Data Cache EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Instr Cache Address Datapath RS 2 WB Data 4 MUX Next PC Sign Extend RD RD RD Control Path 27

Example: Branch Stall Impact z If 30% branch, Stall 3 cycles significant z Two part solution: y. Determine branch taken or not sooner, AND y. Compute taken branch address earlier z MIPS branch tests if register = 0 or 0 z MIPS Solution: y. Move Zero test to ID/RF stage y. Adder to calculate new PC in ID/RF stage y 1 clock cycle penalty for branch versus 3 28

Pipelined MIPS Datapath Instruction Fetch Memory Access Write Back Adder MUX Next SEQ PC Next PC Zero? RS 1 MUX MEM/WB Data Memory EX/MEM ALU MUX EXTRA HARDWARE ID/EX Imm Reg File IF/ID Memory Address RS 2 WB Data 4 Execute Addr. Calc Instr. Decode Reg. Fetch Sign Extend RD RD RD • Data stationary control – local decode for each instruction phase / pipeline stage 29

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken y Execute successor instructions in sequence y “Squash” instructions in pipeline if branch actually taken y Advantage of late pipeline state update y 47% MIPS branches not taken on average y PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken y 53% MIPS branches taken on average y But haven’t calculated branch target address in MIPS x. MIPS still incurs 1 cycle branch penalty x. Other machines: branch target known before outcome 30

Four Branch Hazard Alternatives #4: Delayed Branch y Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2. . . . sequential successorn. . . . branch target if taken Branch delay of length n y 1 slot delay allows proper decision and branch target address in 5 stage pipeline y MIPS uses this 31

Delayed Branch z Where to get instructions to fill branch delay slot? y Before branch instruction y From the target address: only valuable when branch taken y From fall through: only valuable when branch not taken y Canceling branches allow more slots to be filled z Compiler effectiveness for single branch delay slot: y Fills about 60% of branch delay slots y About 80% of instructions executed in branch delay slots useful in computation y About 50% (60% x 80%) of slots usefully filled z Delayed Branch downside: 7 -8 stage pipelines, multiple instructions issued per clock (superscalar) 32

Recall: Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: 33

Example: Evaluating Branch Alternatives Assume: Conditional & Unconditional = 14%, 65% change PC Scheduling scheme Stall pipeline Predict taken Predict not taken Delayed branch Branch CPI penalty 3 1 1 0. 5 1. 42 1. 14 1. 09 1. 07 speedup v. stall 1. 0 1. 26 1. 29 1. 31 34

Summary z. Hazards z. Date Hazards & Control Hazards z. How to remove Hazard? z. Data Hazards: Forwarding Change program order z. Control Hazards: Speculate branch outcome Delay Slots Use extra hardware