CENG 450 Computer Systems and Architecture Lecture 6

  • Slides: 35
Download presentation
CENG 450 Computer Systems and Architecture Lecture 6 Amirali Baniasadi amirali@ece. uvic. ca 1

CENG 450 Computer Systems and Architecture Lecture 6 Amirali Baniasadi amirali@ece. uvic. ca 1

Overview of Today’s Lecture z MIPS z. Pipelining

Overview of Today’s Lecture z MIPS z. Pipelining

CPU Pipelining Example: z Theoretically: y Speedup should be equal to number of stages

CPU Pipelining Example: z Theoretically: y Speedup should be equal to number of stages ( n tasks, k stages, p latency) y Speedup = n*p =~ k (for large n) y p/k*(n-1) + p z Practically: y Stages are imperfectly balanced y Pipelining needs overhead y Speedup less than number of stages z If we have 3 consecutive instructions y Non-pipelined needs 8 x 3 = 24 ns y Pipelined needs 14 ns => Speedup = 24 / 14 = 1. 7 z If we have 1003 consecutive instructions y Add more time for 1000 instruction (i. e. 1003 instruction)on the previous example x. Non-pipelined total time= 1000 x 8 + 24 = 8024 ns x. Pipelined total time = 1000 x 2 + 14 = 2014 ns => Speedup ~ 3. 98~ (8 ns / 2 ns] ~ near perfect speedup => Performance increases for larger number of instructions (throughput)3

MIPS: Software conventions for Registers 0 zero constant 0 16 s 0 callee saves

MIPS: Software conventions for Registers 0 zero constant 0 16 s 0 callee saves 1 at . . . (caller can clobber) 2 v 0 expression evaluation & 23 s 7 3 v 1 function results 24 t 8 4 a 0 arguments 25 t 9 5 a 1 26 k 0 reserved for OS kernel 6 a 2 27 k 1 7 a 3 28 gp Pointer to global area 8 t 0 reserved for assembler temporary: caller saves temporary (cont’d) 29 sp Stack pointer . . . (callee can clobber) 30 fp frame pointer 15 t 7 31 ra Return Address (HW) Plus a 3 -deep stack of mode bits. 4

Example in C: swap(int v[], int k) { int temp; temp = v[k]; v[k]

Example in C: swap(int v[], int k) { int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } ° Assume swap is called as a procedure ° Assume temp is register $15; arguments in $a 1, $a 2; $16 is scratch reg: ° Write MIPS code

swap: MIPS swap: addiu sw sll addu lw lw sw sw lw addiu jr

swap: MIPS swap: addiu sw sll addu lw lw sw sw lw addiu jr $sp, – 4 $16, 4($sp) $t 2, $a 2, 2 $t 2, $a 1, $t 2 $15, 0($t 2) $16, 4($t 2) $16, 0($t 2) $15, 4($t 2) $16, 4($sp) $sp, 4 $31 ; create space on stack ; callee saved register put onto stack ; multiply k by 4 ; address of v[k] ; load v[k+1] ; store v[k+1] into v[k] ; store old value of v[k] into v[k+1] ; callee saved register restored from stack ; restore top of stack ; return to place that called swap

5 Steps of MIPS Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ

5 Steps of MIPS Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Adder 4 Zero? RS 1 MUX MEM/WB Memory EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Memory Address Datapath RS 2 Write Back MUX Next PC Memory Access WB Data Instruction Fetch Sign Extend RD RD RD Control Path 7

5 Steps of MIPS Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ

5 Steps of MIPS Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Adder Zero? RS 1 Inst 12 Inst 3 MUX MEM/WB Memory EX/MEM ALU Sign Extend RD Inst 1 Inst 2 RD Control Path MUX ID/EX Imm Reg File IF/ID Memory Address Datapath RS 2 WB Data 4 Write Back MUX Next PC Memory Access RD Inst 1 Instruction Fetch 8

Review: Visualizing Pipelining Time (clock cycles) Ifetch DMem Reg ALU O r d e

Review: Visualizing Pipelining Time (clock cycles) Ifetch DMem Reg ALU O r d e r Ifetch ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Reg DMem Reg 9

Limits to pipelining z Hazards: circumstances that would cause incorrect execution if next instruction

Limits to pipelining z Hazards: circumstances that would cause incorrect execution if next instruction were launched y Structural hazards: Attempting to use the same hardware to do two different things at the same time y Data hazards: Instruction depends on result of prior instruction still in the pipeline y Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). 10

Example: One Memory Port/Structural Hazard Time (clock cycles) Instr 1 Instr 2 Instr 3

Example: One Memory Port/Structural Hazard Time (clock cycles) Instr 1 Instr 2 Instr 3 Reg Ifetch DMem Reg ALU Ifetch ALU O r d e r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Reg DMem Reg Instr 4 Structural Hazard 11

Resolving structural hazards z Defn: attempt to use same hardware for two different things

Resolving structural hazards z Defn: attempt to use same hardware for two different things at the same time z Solution 1: Wait Þ must detect the hazard Þ must have mechanism to stall z Solution 2: Throw more hardware at the problem 12

Detecting and Resolving Structural Hazard Time (clock cycles) Instr 1 Instr 2 Stall Instr

Detecting and Resolving Structural Hazard Time (clock cycles) Instr 1 Instr 2 Stall Instr 3 Reg Ifetch DMem Reg ALU Ifetch Bubble Reg DMem Bubble Ifetch Reg Bubble ALU O r d e r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Bubble Reg DMem 13

Eliminating Structural Hazards at Design Time Next SEQ PC Adder Zero? RS 1 MUX

Eliminating Structural Hazards at Design Time Next SEQ PC Adder Zero? RS 1 MUX MEM/WB Data Cache EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Instr Cache Address Datapath RS 2 WB Data 4 MUX Next PC Sign Extend RD RD RD Control Path 14

Role of Instruction Set Design in Structural Hazard Resolution z Simple to determine the

Role of Instruction Set Design in Structural Hazard Resolution z Simple to determine the sequence of resources used by an instruction y opcode tells it all z Uniformity in the resource usage z Compare MIPS to IA 32? z MIPS approach => all instructions flow through same 5 -stage pipeling 15

Data Hazards Time (clock cycles) and r 6, r 1, r 7 or r

Data Hazards Time (clock cycles) and r 6, r 1, r 7 or r 8, r 1, r 9 xor r 10, r 11 Ifetch DMem Reg DMem Ifetch Reg ALU sub r 4, r 1, r 3 Reg ALU Ifetch ALU O r d e r add r 1, r 2, r 3 WB ALU I n s t r. MEM ALU IF ID/RF EX Reg Reg DMem 16 Reg

Three Generic Data Hazards z Read After Write (RAW) Instr. J tries to read

Three Generic Data Hazards z Read After Write (RAW) Instr. J tries to read operand before Instr. I writes it I: add r 1, r 2, r 3 J: sub r 4, r 1, r 3 z Caused by a “Data Dependence”. This hazard results from an actual need for communication. 17

Three Generic Data Hazards z Write After Read (WAR) Instr. J writes operand before

Three Generic Data Hazards z Write After Read (WAR) Instr. J writes operand before Instr. I reads it I: sub r 4, r 1, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 z an “anti-dependence” by compiler writers. This results from reuse of the name “r 1”. z Can’t happen in MIPS 5 stage pipeline because: y All instructions take 5 stages, and y Reads are always in stage 2, and y Writes are always in stage 5 18

Three Generic Data Hazards z Write After Write (WAW) Instr. J writes operand before

Three Generic Data Hazards z Write After Write (WAW) Instr. J writes operand before Instr. I writes it. I: sub r 1, r 4, r 3 J: add r 1, r 2, r 3 K: mul r 6, r 1, r 7 z Called an “output dependence” by compiler writers This also results from the reuse of name “r 1”. z Can’t happen in MIPS 5 stage pipeline because: y All instructions take 5 stages, and y Writes are always in stage 5 z Will see WAR and WAW in later more complicated pipes 19

Forwarding to Avoid Data Hazard or r 8, r 1, r 9 xor r

Forwarding to Avoid Data Hazard or r 8, r 1, r 9 xor r 10, r 11 Reg DMem Ifetch Reg ALU and r 6, r 1, r 7 Ifetch DMem ALU sub r 4, r 1, r 3 Reg ALU O r d e r add r 1, r 2, r 3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Reg Reg DMem 20

HW Change for Forwarding Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers mux

HW Change for Forwarding Next. PC mux MEM/WR EX/MEM ALU mux ID/EX Registers mux Immediate Data Memory 21

Data Hazard Even with Forwarding and r 6, r 1, r 7 or r

Data Hazard Even with Forwarding and r 6, r 1, r 7 or r 8, r 1, r 9 DMem Ifetch Reg DMem Reg Ifetch Reg Reg DMem ALU O r d e r sub r 4, r 1, r 6 Reg ALU lw r 1, 0(r 2) Ifetch ALU I n s t r. ALU Time (clock cycles) Reg DMem 22

Resolving this load hazard z Adding hardware? . . . not z Detection? z

Resolving this load hazard z Adding hardware? . . . not z Detection? z Compilation techniques? z What is the cost of load delays? 23

Resolving the Load Data Hazard and r 6, r 1, r 7 or r

Resolving the Load Data Hazard and r 6, r 1, r 7 or r 8, r 1, r 9 Reg DMem Ifetch Reg Bubble Ifetch Bubble Reg Bubble Ifetch Reg DMem ALU sub r 4, r 1, r 6 Ifetch ALU O r d e r lw r 1, 0(r 2) ALU I n s t r. ALU Time (clock cycles) Reg DMem How is this different from the instruction issue stall? 24

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d , e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW Rb, b Rc, c Ra, Rb, Rc a, Ra Re, e Rf, f Rd, Re, Rf d, Rd Fast code: LW LW LW ADD LW SW SUB SW Rb, b Rc, c Re, e Ra, Rb, Rc Rf, f a, Ra Rd, Re, Rf d, Rd 25

Instruction Set Connection z What is exposed about this organizational hazard in the instruction

Instruction Set Connection z What is exposed about this organizational hazard in the instruction set? z k cycle delay? y bad, CPI is not part of ISA z k instruction slot delay y load should not be followed by use of the value in the next k instructions z Nothing, but code can reduce run-time delays z MIPS did the transformation in the assembler 26

Eliminating Control Hazards at Design Time Next SEQ PC Adder Zero? RS 1 MUX

Eliminating Control Hazards at Design Time Next SEQ PC Adder Zero? RS 1 MUX MEM/WB Data Cache EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Instr Cache Address Datapath RS 2 WB Data 4 MUX Next PC Sign Extend RD RD RD Control Path 27

Example: Branch Stall Impact z If 30% branch, Stall 3 cycles significant z Two

Example: Branch Stall Impact z If 30% branch, Stall 3 cycles significant z Two part solution: y. Determine branch taken or not sooner, AND y. Compute taken branch address earlier z MIPS branch tests if register = 0 or 0 z MIPS Solution: y. Move Zero test to ID/RF stage y. Adder to calculate new PC in ID/RF stage y 1 clock cycle penalty for branch versus 3 28

Pipelined MIPS Datapath Instruction Fetch Memory Access Write Back Adder MUX Next SEQ PC

Pipelined MIPS Datapath Instruction Fetch Memory Access Write Back Adder MUX Next SEQ PC Next PC Zero? RS 1 MUX MEM/WB Data Memory EX/MEM ALU MUX EXTRA HARDWARE ID/EX Imm Reg File IF/ID Memory Address RS 2 WB Data 4 Execute Addr. Calc Instr. Decode Reg. Fetch Sign Extend RD RD RD • Data stationary control – local decode for each instruction phase / pipeline stage 29

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken y Execute successor instructions in sequence y “Squash” instructions in pipeline if branch actually taken y Advantage of late pipeline state update y 47% MIPS branches not taken on average y PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken y 53% MIPS branches taken on average y But haven’t calculated branch target address in MIPS x. MIPS still incurs 1 cycle branch penalty x. Other machines: branch target known before outcome 30

Four Branch Hazard Alternatives #4: Delayed Branch y Define branch to take place AFTER

Four Branch Hazard Alternatives #4: Delayed Branch y Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2. . . . sequential successorn. . . . branch target if taken Branch delay of length n y 1 slot delay allows proper decision and branch target address in 5 stage pipeline y MIPS uses this 31

Delayed Branch z Where to get instructions to fill branch delay slot? y Before

Delayed Branch z Where to get instructions to fill branch delay slot? y Before branch instruction y From the target address: only valuable when branch taken y From fall through: only valuable when branch not taken y Canceling branches allow more slots to be filled z Compiler effectiveness for single branch delay slot: y Fills about 60% of branch delay slots y About 80% of instructions executed in branch delay slots useful in computation y About 50% (60% x 80%) of slots usefully filled z Delayed Branch downside: 7 -8 stage pipelines, multiple instructions issued per clock (superscalar) 32

Recall: Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: 33

Recall: Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: 33

Example: Evaluating Branch Alternatives Assume: Conditional & Unconditional = 14%, 65% change PC Scheduling

Example: Evaluating Branch Alternatives Assume: Conditional & Unconditional = 14%, 65% change PC Scheduling scheme Stall pipeline Predict taken Predict not taken Delayed branch Branch CPI penalty 3 1 1 0. 5 1. 42 1. 14 1. 09 1. 07 speedup v. stall 1. 0 1. 26 1. 29 1. 31 34

Summary z. Hazards z. Date Hazards & Control Hazards z. How to remove Hazard?

Summary z. Hazards z. Date Hazards & Control Hazards z. How to remove Hazard? z. Data Hazards: Forwarding Change program order z. Control Hazards: Speculate branch outcome Delay Slots Use extra hardware