CPU Pipelining an assembly line Todays CPUs are

CPU Pipelining an assembly line Today’s CPUs are pipelined (4. 5 – 4. 8 4 th ed) 10/1/16 Pipelining. 1

Pipelining Overview • • Basic idea : : assembly line, visit El Abd downtown Pipelined datapath Data hazards: pipelining problems Solutions to pipelining problems Controlling the pipeline Advanced pipelining – Dynamic branch Advanced pipelining – Superscalar Benz 2015 C class • https: //www. youtube. com/watch? v=tb_1 Trp. Urm. Q 10/1/16 Pipelining. 2

1966 Mustang Assembly Line, Michigan Ford Mustang 19666 assembly line, Michigan 10/1/16 Pipelining. 3

MIPS Pipeline: : 5 steps 5 stages RTL Notation: register transfer level • • • IR <= mem[PC]; PC <= PC + 4 A <= Reg[IRrs]; register fetch B <= Reg[IRrt] IF: instruction fetch ID: inst. decode and EX: execute / effective addr. calculation rslt <= A op. IRop WB <= rslt; MEM: memory access Or WB <= mem(rslt) WB: write back to register Reg[IRrd] <= WB IF 10/1/16 ID EX MEM B WB Pipelining. 4

Single Cycle vs. Pipeline Cycle 1 Cycle 2 Clk Single Cycle Implementation: Load Store Waste Pipeline Implementation: Load Ifetch Reg Store Ifetch Exec Mem Wr Reg Exec Mem R-type Ifetch Reg Exec Wr Mem Wr Why pipeline ? 10/1/16 Pipelining. 5

Pipelined Representation timing diagram Time IFetch Dcd Exec IFetch Dcd Mem WB Exec Mem WB Exec Mem IFetch Dcd Program Flow Your code 10/1/16 IFetch Dcd WB Pipelining. 6

Pipelining: Performance • Pipeline: multiple inst. are overlapped in execution – Improve inst. throughput rather than inst. execution time – speedup = – pipe stage: balancing length of each stage with equal length, limited # pipe stages 10/1/16 Pipelining. 7

Visualizing Pipelining – clearer view Time (clock cycles) 10/1/16 Ifetch DMem Reg ALU O r d e r Ifetch ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Reg DMem Reg Pipelining. 8

Pipelined Datapath 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 Must Add buffers between stages to maintain instruction information and results 10/1/16 Pipelining. 9

Pipeline Rules • Each functional unit can only be used once per instruction • And at same stage for all instructions: – Load uses Register File’s Write Port during its 5 th stage Load 1 Ifetch 2 Reg/Dec 3 Exec 4 Mem 5 Wr – R type uses Register File’s Write Port during its 4 th stage 1 R-type Ifetch 2 Reg/Dec 3 Exec 4 Wr Introduce bypass stage for R type so WB is also 5 th stage 10/1/16 Pipelining. 10

Pipeline problems a. k. a hazards • Hazards: cause incorrect execution if next instruction launched 1. Structural hazards: Use of same hardware to do two different things at the same time 2. Data hazards: Instruction depends on result of prior instruction ; Data dependency 3. Control hazards: due to delay between instruction fetching and decisions about changes in control flow (branches and jumps). 10/1/16 Pipelining. 11

1. Structural Hazard Example I: Same Register File cycle 5: Load instruction and Instr 3 use same reg. file Time (clock cycles) Instr 1 Instr 2 Instr 3 Reg Ifetch DMem Reg ALU Ifetch ALU O r d e r Load ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Instr 4 10/1/16 Structural Hazard Reg DMem Reg Solution: Write in 1 st half of clock Read in 2 nd half Pipelining. 12

Resolving structural hazards • Problem: simultaneous use of same hardware by two different stages • Solution 1: Wait Detect hazard stall Serious. Poor choice • Solution 2: Redesign pipeline, add hardware 10/1/16 Pipelining. 13

Eliminating Structural Hazards Separate Reg File Read / Write ports Next SEQ PC Adder Zero? RS 1 RD RD RD MUX Sign Extend MEM/WB Data Cache EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Instr Cache Address Datapath RS 2 WB Data 4 MUX Next PC Control Path 10/1/16 Pipelining. 14

2. Data Hazards: Data Dependency, most common • Read After Write (RAW) Instr. J tries to read operand before Instr. I writes it I: add r 1, r 2, r 3 J: sub r 4, r 1, r 3 • Caused by a “Data Dependence” (in compiler nomenclature). Common in every modern CPU 10/1/16 Pipelining. 15

Data Hazard Example Time (in clock cycles) CC 1 register $2: 10 CC 2 CC 3 CC 4 10 10 10 CC 5 10/Ð 20 CC 6 CC 7 CC 8 CC 9 Ð 20 Program execution Sub $2, $1, $3 IM Reg DM and $12, $5, $2 IM Reg DM Reg or $13, $6, $2 , add $14, $2 Reg IM DM Reg IM Reg DM Reg sw $15, 100 ($2) IM DM Reg Wrong (old) data fetched from registers! 10/1/16 Pipelining. 16

S/W: Compiler can eliminate / Minimize Data Hazards Code Scheduling • Move independent inst. to eliminate data hazards and fill in bubbles I 1 and $18, $9, $10 I 2 sub $2, $1, $3 I 3 and $12, $5 I 4 or $13, $6, $2 I 5 add $14, $2 I 6 sw $15, 100($2) I 7 sub $16, $7, $8 I 8 add $17, $8, $9 • RAW dependence b/w I 2 and I 3, I 4, I 5 10/1/16 Pipelining. 17

Data hazard Architecture solution: Forwarding Time (in clock cycles) Value of register $2 : CC 1 CC 2 CC 3 CC 4 10 10 CC 5 10/Ð 20 CC 6 CC 7 CC 8 CC 9 Ð 20 Value of EX/MEM : X X X Ð 20 X X X Value of MEM/WB : X X Ð 20 X X IM Reg Program execution order (in instructions) sub $2 , $1, $3 and $12, or $13, $6, add $14, sw $15, 100 10/1/16 $2 , $5 $2 $2 , $2 ($2) IM DM Reg IM Reg DM Reg Pipelining. 18

ID/EX Hardware before forwarding EX/MEM Registers MEM/WB ALU Data memory M u x a. No forwarding ID/EX Hardware With Forwarding additions EX/MEM M u x Registers Forward. A ALU M u x Data memory Rs Forward. B Rt Rt M u Rd x M u x EX/MEM. Register. Rd Forwarding unit 10/1/16 MEM/WB. Register. Rd b. With forwarding Pipelining. 19

Modified datapath with forwarding ID/EX WB Control PC Instruction IF/ID EX/MEM M WB EX M MEM/WB WB M u X Registers ALU memory Data memory x M u x IF/ID. Register. Rs Rs IF/ID. Register. Rt Rt IF/ID. Register. Rd Rd EX/MEM. Register. Rd M u x Forwarding unit 10/1/16 M u MEM/WB. Register. Rd Pipelining. 20

Data Hazard Even with Forwarding (Load – immediate Use) Time (in clock cycles) Program CC 1 execution order (in instructions) CC 2 lw $2 , 20($1) IM Reg and $4 , $2 , $5 IM or $8, add $9, $2 , $6 $4 , $2 slt $1, $6, $7 10/1/16 CC 3 CC 4 DM Reg IM CC 5 CC 7 CC 8 CC 9 Reg DM Reg IM CC 6 Reg DM Reg IM Reg DM Reg Pipelining. 21

Load – immediate use stall Program Time (in clock cycles) execution order (in instructions) CC 1 CC 2 lw$2 , 20($1) IM Reg and $4 , $2 , $5 or $8, $2 , $6 IM CC 3 CC 4 DM Reg IM IM CC 5 CC 6 CC 7 DM Reg CC 8 CC 9 CC 10 Reg DM Reg bubble add $9, $4 , $2 slt $1, $6, $7 IM DM Reg IM Reg DM Reg And $4, $2, $5 stalled one cycle 10/1/16 Pipelining. 22

Forwarding to Avoid LW-SW Data Hazard or r 8, r 6, r 9 xor r 10, r 9, r 11 10/1/16 Reg DMem Ifetch Reg ALU sw r 4, 12(r 1) Ifetch DMem ALU lw r 4, 0(r 1) Reg ALU O r d e r add r 1, r 2, r 3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Reg DMem Reg Pipelining. 23

Resolving load hazards Summary • Adding hardware. Forwarding memory data to pipeline minimizes stalls. • Compilation / code scheduling techniques 10/1/16 Pipelining. 24

Control Hazard Branches => Three Stage MIPS Stall branch penalty Program execution order (in instructions) Time (in clock cycles) 40 beq $1, $3, 7 44 and $12, $5 48 or $13, $6, $2 52 add $14, $2 72 lw $4, 50($7) 10/1/16 CC 1 IM CC 2 CC 3 Reg IM CC 4 DM Reg IM CC 5 CC 7 CC 8 CC 9 Reg DM Reg IM CC 6 Reg DM Reg IM Reg DM Reg Pipelining. 25

Example: Branch Stall Impact • If 20% branch, Stall 3 cycles significant • Two part solution: – Determine branch taken or not sooner, AND – Compute taken branch address earlier • MIPS branch tests if register = 0 or 0 • MIPS Solution: – Move Zero test to ID/RF stage – Adder to calculate new PC in ID/RF stage – 1 clock cycle penalty for branch versus 3 10/1/16 Pipelining. 26

Datapath with branch flush hardware Detects Load – imm use, needs Rt from LW, Rs, Rt from dependent inst. r IF. Flush Hazard detection unit ID/EX M u x Control 0 M u x IF/ID 4 EX/MEM M WB EX M MEM/WB WB Shift left 2 = PC WB M u x Registers Instruction memory ALU M u x Data memory M u x Sign extend M u x Forwarding unit 10/1/16 Pipelining. 27

Control Hazard Solution #1: Stall Add Beq Mem Reg Reg Mem Lost potential Mem Reg ALU Load Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg • Stall: wait until decision is clear • Impact: 2 lost cycles (i. e. 3 clock cycles per branch instruction) => slow • Move decision to end of decode – save 1 cycle per branch 10/1/16 Pipelining. 28

Static Branch Hazard Alternatives #1: Stall (WAIT) until branch direction is clear #2: Predict Branch Not Taken – – – 10/1/16 Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction Pipelining. 29

#2 Predict not taken, branch untaken, taken 10/1/16 Pipelining. 30

MIPS with Predict Not Taken Prediction correct Prediction incorrect 10/1/16 Pipelining. 31

Control Hazard Solution #3: Delayed Branch Misc Load Mem Reg Mem Reg ALU Beq Reg ALU Add Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg • Delayed Branch: Redefine branch behavior (takes place after next instruction) • Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) • As launch more instruction per clock cycle, less useful 10/1/16 Pipelining. 32

Delayed branch behavior is same 10/1/16 Pipelining. 33

Delayed Branch • Where to get instructions to fill branch delay slot? – – Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Canceling branches allow more slots to be filled • Compiler effectiveness for single branch delay slot: – Fills about 60% of branch delay slots – About 80% of instructions executed in branch delay slots useful in computation – About 50% (60% x 80%) of slots usefully filled • Delayed Branch downside: if 7 8 stage pipelines, multiple instructions issued per clock (superscalar) 10/1/16 Pipelining. 34

Scheduling the branch Delay slot a. From before add $s 1, $s 2, $s 3 if $s 2 = 0 then Delay slot b. From target sub $t 4, $t 5, $t 6 É add $s 1, $s 2, $s 3 c. From fall through add $s 1, $s 2, $s 3 if $s 1 = 0 then Delay slot Becomes sub $t 4, $t 5, $t 6 Becomes add $s 1, $s 2, $s 3 if $s 1 = 0 then if $s 2 = 0 then add $s 1, $s 2, $s 3 sub $t 4, $t 5, $t 6 if $s 1 = 0 then sub $t 4, $t 5, $t 6 10/1/16 Pipelining. 35

More Realistic Branch Prediction • Static branch prediction – Based on typical branch behavior – Example: loop and if statement branches » Predict backward branches taken » Predict forward branches not taken • Dynamic branch prediction – Hardware measures actual branch behavior » e. g. , record recent history of each branch – Assume future behavior will continue the trend » When wrong, stall while re fetching, and update history – Used in all modern CPUs; eg core … 10/1/16 Pipelining. 36

Dynamic Branch Prediction Problem History Information Incoming Branches { Address } Branch Predictor Prediction { Address, Value } Corrections { Address, Value } • Incoming stream of addresses • Fast outgoing stream of predictions • Correction information returned from pipeline 10/1/16 Pipelining. 37

Dynamic Branch Prediction • branch penalty is huge I deep superscalar pipelines • dynamic prediction – Branch prediction buffer (aka branch history table) – Indexed by branch address – Stores outcome (taken/not taken) – To execute a branch » Check table » Start fetching from fall through or target » If wrong, flush pipeline and flip prediction 10/1/16 Pipelining. 38

1 Bit Predictor: Shortcoming • Inner loop branches mispredicted twice! outer: … … inner: … … beq …, …, inner … beq …, …, outer n n 10/1/16 Mispredict taken on last iteration of inner loop Mispredict not taken on first iteration of inner loop Pipelining. 39

2 Bit Predictor • Only change prediction on two successive mispredictions 10/1/16 Pipelining. 40

BTB: Branch Address at Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) PC of instruction FETCH Branch PC =? Predicted PC Yes: instruction is prediction state branch and use Bits (2 bits) predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4) Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded later: check prediction, if wrong kill instruction, update BPb 10/1/16 Pipelining. 41

Pipeline Control RTL description • We have 5 stages. What needs to be controlled in each stage? IR <= mem[PC]; – Instruction Fetch and PC Increment – Instruction Decode / Register Fetch PC <= PC + 4 A <= Reg[IRrs]; B <= Reg[IRrt] – Execution rslt <= A op. IRop B – Memory Stage WB <= rslt; Or WB <= mem(rslt) – Write Back Reg[IRrd] <= WB • Generate all control signals in the ID stage and pipeline them 10/1/16 Pipelining. 42

Pipelined datpath with control signals PCSrc 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add result 4 Address Instruction memory Instruction PC Read register 1 Branch Shift left 2 Reg. Write Mem. Write ead data 1 Read register 2 Registers. Read Write data 2 register ALUSrc 0 M u x 1 Write data Memto. Reg Zero ALUALU result Address Read data Data memory Write 1 M u x 0 data Instruction [15Ð 0] 16 Instruction [20Ð 16] Instruction [15Ð 11] Sign extend 32 6 0 M u x 1 ALU control Mem. Read ALUOp Reg. Dst 10/1/16 Pipelining. 43

Pipeline Control • control signals passed along with data 10/1/16 Pipelining. 44

Deeper Pipeline Example: MIPS R 4000 10/1/16 Pipelining. 45

Load immediate use = 2 cycle penalty 10/1/16 Pipelining. 46

Branch delay = 3 cycles 10/1/16 Pipelining. 47

R 4000 taken/untaken branch penalty 10/1/16 Pipelining. 48

Exceptions and Interrupts Review • “Unexpected” events change control flow • Exception – Arises within the CPU » e. g. , undefined opcode, overflow, syscall, … • Interrupt – From an external I/O controller • Dealing with them without sacrificing performance is hard 10/1/16 Pipelining. 49

Vectored Interrupts: Alternate Mechanism – Handler address determined by the cause • Example: – Undefined opcode: – Overflow: – …: C 0000 C 000 0020 C 000 0040 • Instructions Jump to real handler 10/1/16 Pipelining. 50

Summary: Control and Pipelining • Control VIA State Machines and Microprogramming • Speed Up Pipeline Depth; if ideal CPI is 1, then: • Hazards limit performance on computers: – Structural: need more HW resources – Data (RAW, WAR, WAW): need forwarding, compiler scheduling – Control: delayed branch, prediction • Exceptions, Interrupts add complexity 10/1/16 Pipelining. 51

Instruction Level Parallelism (ILP) • Pipelining: executing multiple instructions in parallel • To increase ILP – Deeper pipeline » Less work per stage shorter clock cycle – Multiple issue » » Replicate pipeline stages multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E. g. , 4 GHz 4 way multiple issue • 16 BIPS, peak CPI = 0. 25, peak IPC = 4 » But dependencies reduce this in practice 10/1/16 Pipelining. 52

Simple Static Dual Issue with MIPS • Two issue packets – One ALU/branch instruction – One load/store instruction – 64 bit aligned » ALU/branch, then load/store » Pad an unused instruction with nop 10/1/16 Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n+4 Load/store IF ID EX MEM WB n+8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB Pipelining. 53

Multiple Issue • Static multiple issue – Compiler groups instructions to be issued – Packages them into “issue slots” – Compiler detects and avoids hazards • current technology Dynamic multiple issue – CPU examines instruction stream – chooses instructions to issue each cycle – CPU resolves hazards using advanced techniques at runtime – Compiler can help by reordering instructions 10/1/16 Pipelining. 54

Speculative Execution Techniques • “Guess” what to do with an instruction – Start operation as soon as possible – Check whether guess was right » If so, complete operation » If not, roll back • Speculate ex – branch outcome » Roll back if path taken is different – load » Roll back if location is updated dependency 10/1/16 Pipelining. 55

Dynamic Pipeline Scheduling • Allow the CPU to execute instructions out of order to avoid stalls – But commit result to registers in order • Example lw addu sub slti $t 0, $t 1, $s 4, $t 5, 20($s 2) $t 0, $t 2 $s 4, $t 3 $s 4, 20 – Start sub while addu is waiting for lw 10/1/16 Pipelining. 56

Dynamically Scheduled CPU Preserves dependencies Hold pending operands Results also sent to any waiting reservation stations Reorders buffer for register writes 10/1/16 Can supply operands for issued instructions Pipelining. 57

PPC 604 Pipeline 7 -units 10/1/16 IBM PPC 604 4 -way superscalar IBM POWER archit Pipelining. 58

Modern Superscalar 10/1/16 Pipelining. 59

PPC 604 e 10/1/16 Pipelining. 60

Example 1: Instruction Timing for cache Hit Clock 0 1 2 3 4 5 6 0 AND Fet DQ DS EX C/WB 1 OR Fet DQ DS EX C/WB 2 FADD Fet DQ DS EX 3 FSUB Fet DQ DS 4 ADDC Fet 5 SUBFC Fet EX EX C/WB DS EX EX EX C/WB DQ DS EX C C C/WB 6 FMADD Fet DQ DS EX EX EX C/WB 7 FMSUB Fet DQ DS DS EX EX EX C/WB 8 XOR Fet DQ DS DS EX C C C/WB 9 NEG Fet DQ DS DS EX C C C/WB 10 FADDS Fet DQ DQ DS EX EX EX C/WB 11 FSUBS Fet DQ DQ DS DS EX EX EX C/WB 12 ADD Fet DQ DQ DS DS EX C C C/WB 13 SUB Fet DQ DQ DS DS EX C C C/WB 10/1/16 7 8 9 10 11 61 Pipelining. 61

Example 2 : Branch Taken with BTAC hit No branch penalty; 4 OR is from target stream Clock 0 1 2 3 4 5 0 AND Fet DQ DS EX C/WB 1 LD Fet DQ DS EX EX C/WB 2 ADD Fet DQ DS EX C C/WB 3 BC Fet DQ DS EX C C 4 OR Fet DQ DS EX 5 CMP Fet DQ DS 6 LD Fet DQ 7 MULLI Fet DQ 6 7 8 9 10 taken Waits for LD add C C/WB waits bc EX C C/WB DS EX EX C for C/WB Cycle 1: instructions 4 – 7 fetched from Target based on address from BTAC HIT Cycle 5: inst. 2 -3 wait for LD to retire (WB) & retire with it 10/1/16 Pipelining. 62

Class Example – real dependencies 1 2 3 4 5 6 7 ADD OR SUB FMUL FSUB AND R 1, R 2, R 3 R 2, R 1, R 4 R 3, R 2, R 3 F 7, F 5, F 6 F 8, F 10, F 7 R 4, R 1, R 3 Clock 0 1 2 3 4 1 ADD Fet Dq DS EX C/WB 2 ADD Fet DQ DS EX EX 3 OR Fet DQ DS DS EX C/WB 4 SUB Fet DQ DS DS EX EX C/WB Fet DQ DS EX EX EX C/WB 5 FMUL 5 6 7 8 9 10 C/WB 6 FSUB Fet DQ DS DS EX EX EX C/WB 7 AND FET DQ DQ DS EX EX C C/WB 11/13 10/1/16 63 Pipelining. 63

Reservation Stations & Result Buses 11/13 10/1/16 64 Pipelining. 64

Nehalem – intel Multicore 10/1/16 Pipelining. 65

Nehalem Chip Plan 10/1/16 Pipelining. 66

Nehalem Core Nehalem Microarchitecture 10/1/16 Pipelining. 67