Pipelining Hakim Weatherspoon CS 3410 Computer Science Cornell

Announcements C Programming Practice Assignment due next Tuesday Short. Do not wait till the

Review: Single Cycle Processor memory inst +4 register file +4 =? PC control offset

Review: Single Cycle Processor Advantages • Single cycle per instruction make logic and clock

Alice The Kids Bob They don’t always get along…

The Instructions N pieces, each built following same sequence: Saw Drill Glue Paint

Design 1: Sequential Schedule Alice owns the room Bob can enter when Alice is

Sequential Performance time 1 2 3 4 Latency: Elapsed Time for Alice: 4 Throughput:

Design 2: Pipelined Design Partition room into stages of a pipeline Dave Carol Bob

Pipelined Performance time 1 2 3 Latency: Throughput: Concurrency: 4 5 6 7… CPI

Pipelined Performance Time 1 2 3 4 5 6 7 8 9 10 What

Lessons Principle: Throughput increased by parallel execution Balanced pipeline very important Else slowest stage

Single Cycle Pipelining Single-cycle insn 0. fetch, dec, exec insn 1. fetch, dec, exec

Agenda 5 -stage Pipeline • Implementation • Working Example Hazards • Structural • Data

A Processor Review: Single cycle processor memory inst +4 register file +4 =? PC

A Processor memory inst register file alu +4 addr PC din control new pc

Pipelined Processor memory register file alu +4 addr PC din control new pc Fetch

A Pipelined Processor alu B D register file D memory +4 IF/ID ID/EX EX/MEM

Time Graphs Cycle 1 2 add IF ID EX MEM WB IF ID nand

Principles of Pipelined Implementation • Break datapath into multiple cycles (here 5) • Parallel

Pipeline Stages Stage Perform Functionality Latch values of interest Fetch Use PC to index

Instruction Fetch (IF) instruction memory addr mc +4 PC new pc

WE Rd register D file B Ra Rb B A A IF/ID ID/EX Rest

ctrl Rest of pipeline B imm B D alu target PC+4 Stage 2: Instruction

ctrl B din dout M addr memory Rest of pipeline target Stage 3: Execute

D M addr din dout EX/MEM Rd OP Rd mem OP ID/EX B D

Takeaway Pipelining is a powerful technique to mask latencies and increase throughput • Logically,

MIPS is designed for pipelining • Instructions same length • 32 bits, easy to

Example: : Sample Code (Simple) add nand lw add sw r 3 r 6

At time 1, Fetch Example: Start State @ Cycle 0 add r 3 r

Hazards Correctness problems associated w/processor design 1. Structural hazards Same resource needed for different

Dependences and Hazards Dependence: relationship between two insns • • Data: two insns use

Data Hazards • register file reads occur in stage 2 (ID) • register file

time Where are the Data Hazards? Clock cycle add r 3, r 1, r

IF/ID D EX/MEM Rd OP add r 3, r 1, r 2 b. ID/EX

IF/ID ID/EX B D EX/MEM Rd OP Rd mem OP imm Rt Rd PC+4

Takeaway Data hazards occur when a operand (register) depends on the result of a

Next Goal What to do if data hazard detected?

Possible Responses to Data Hazards 1. Do Nothing • Change the ISA to match

Stalling How to stall an instruction in ID stage • prevent IF/ID pipeline register

WE=0 IF/ID ID/EX B D Rd Rd EX/MEM OP Mem. Wr=0 Reg. Wr=0 mem

Stalling sub r 5, r 3, r 5 or r 6, r 3, r

Stalling M (Mem. Wr=0 Reg. Wr=0) nop (WE=0) /stall NOP = If(IF/ID. r. A

Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register).

Add the Forwarding Datapath B B IF/ID Rd Rb Ra detect hazard ID/Ex forward

Forwarding Datapath B B IF/ID Rd Rb Ra detect hazard ID/Ex forward unit Ex/Mem

Forwarding Datapath 1: Ex/MEM EX Ex/Mem A inst mem D B data mem sub

Forwarding Datapath 2: Mem/WB EX Mem/WB A inst mem D B data mem or

Register File Bypass A inst mem D B add r 6, r 3, r

time Forwarding Example 2 Clock cycle 1 2 3 4 5 6 7 8

Load-Use Hazard Explained A inst mem D B data mem or r 6, r

Load-Use Stall A inst mem D B or r 6, r 4, r 1

Load-Use Stall (1) A inst mem D B data mem or r 6, r

Load-Use Detection B B IF/ID Rd MC Ra Rb Rd detect hazard forward unit

Resolving Load-Use Hazards Two MIPS Solutions: • MIPS 2000/3000: delay slot – ISA says

Quiz Find all hazards, and say how they are resolved: add r 3, r

Data Hazard Recap Delay Slot(s) • Modify ISA to match implementation Stall • Pause

i = 0; A bit of do { n += 2; i++; } while(i

Control Hazards • instructions are fetched in stage 1 (IF) • branch and jump

Zap & Flush inst mem +4 • prevent PC update • clear IF/ID latch

Reducing the cost of control hazard 1. Delay Slot You MUST do this MIPS

Problem: Zapping 2 insns/branch inst mem +4 A D B PC New PC =

i = 0; Solution #1: Delay Slot do { n += 2; i r

Delay Slot in Action inst mem A D B +4 PC New PC =

Soln #2: Resolve Branches @ Decode inst mem A D B +4 PC branch

Optimization: Fill the Delay Slot x 10 addi r 1, x 14 Loop: addi

Optimization In Action! inst mem A D B +4 PC branch calc decide branch

Branch Prediction Most processor support Speculative Execution • Guess direction of the branch –

Data Hazard Takeaways Data hazards occur when a operand (register) depends on the result

Control Hazard Takeaways Control hazards occur because the PC following a control instruction is

Slides: 93

Download presentation

Pipelining Hakim Weatherspoon CS 3410 Computer Science Cornell University The slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, Mc. Kee, and Sirer.

Announcements C Programming Practice Assignment due next Tuesday Short. Do not wait till the end. Project 2 design doc Critical to do this, else Project 2 will be hard

Single Cycle vs Pipelined Processor

Review: Single Cycle Processor memory inst +4 register file +4 =? PC control offset new pc alu target imm extend cmp addr din dout memory

Review: Single Cycle Processor Advantages • Single cycle per instruction make logic and clock simple Disadvantages • Since instructions take different time to finish, memory and functional unit are not efficiently utilized • Cycle time is the longest delay – Load instruction • Best possible CPI is 1 (actually < 1 w parallelism) – However, lower MIPS and longer clock period (lower clock frequency); hence, lower performance

Alice The Kids Bob They don’t always get along…

The Bicycle

The Materials Drill Saw Glue Paint

The Instructions N pieces, each built following same sequence: Saw Drill Glue Paint

Design 1: Sequential Schedule Alice owns the room Bob can enter when Alice is finished Repeat for remaining tasks No possibility for conflicts

Sequential Performance time 1 2 3 4 Latency: Elapsed Time for Alice: 4 Throughput: Elapsed Time for Bob: 4 Concurrency: Total elapsed time: 4*N Can we do better? 5 6 CPI = 7 8…

Design 2: Pipelined Design Partition room into stages of a pipeline Dave Carol Bob Alice One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep It still takes all four stages for one job to complete

Pipelined Performance time 1 2 3 Latency: Throughput: Concurrency: 4 5 6 7… CPI =

Pipelined Performance Time 1 2 3 4 5 6 7 8 9 10 What if drilling takes twice as long, but gluing and paint take ½ as long? Latency: Throughput: CPI =

Lessons Principle: Throughput increased by parallel execution Balanced pipeline very important Else slowest stage dominates performance Pipelining: • Identify pipeline stages • Isolate stages from each other • Resolve pipeline hazards (next lecture)

Single Cycle vs Pipelined Processor

Single Cycle Pipelining Single-cycle insn 0. fetch, dec, exec insn 1. fetch, dec, exec Pipelined insn 0. fetch insn 0. dec insn 0. exec insn 1. fetch insn 1. dec insn 1. exec 17

Agenda 5 -stage Pipeline • Implementation • Working Example Hazards • Structural • Data Hazards • Control Hazards 18

A Processor Review: Single cycle processor memory inst +4 register file +4 =? PC control offset new pc alu target imm extend cmp addr din dout memory

A Processor memory inst register file alu +4 addr PC din control new pc Instruction Fetch imm extend Instruction Decode dout memory compute jump/branch targets Execute Memory Write. Back

Pipelined Processor memory register file alu +4 addr PC din control new pc Fetch Decode memory compute jump/branch targets extend Execute dout Memory WB

A Pipelined Processor alu B D register file D memory +4 IF/ID ID/EX EX/MEM Memory ctrl compute jump/branch targets Execute dout M B Instruction Decode Instruction Fetch din memory ctrl extend imm new pc control ctrl inst PC addr Write. Back MEM/WB

Time Graphs Cycle 1 2 add IF ID EX MEM WB IF ID nand lw add sw Latency: Throughput: Concurrency: 3 4 5 6 7 8 9 EX MEM WB CPI = 23

Principles of Pipelined Implementation • Break datapath into multiple cycles (here 5) • Parallel execution increases throughput • Balanced pipeline very important • • Slowest stage determines clock rate Imbalance kills performance • Add pipeline registers (flip-flops) for isolation • Each stage begins by reading values from latch • Each stage ends by writing values to latch • Resolve hazards 24

Pipeline Stages Stage Perform Functionality Latch values of interest Fetch Use PC to index Program Memory, Instruction bits (to be decoded) increment PC PC + 4 (to compute branch targets) Decode instruction, generate control signals, read register file Control information, Rd index, immediates, offsets, register values (Ra, Rb), PC+4 (to compute branch targets) Execute Perform ALU operation Compute targets (PC+4+offset, etc. ) in case this is a branch, decide if branch taken Control information, Rd index, etc. Result of ALU operation, value in case this is a store instruction Memory Perform load/store if needed, address is ALU result Control information, Rd index, etc. Result of load, pass result from execute Writeback Select value, write to register file 26

Instruction Fetch (IF) instruction memory addr mc +4 PC new pc

WE Rd register D file B Ra Rb B A A IF/ID ID/EX Rest of pipeline ctrl PC+4 imm inst PC+4 Stage 1: Instruction Fetch Decode

ctrl Rest of pipeline B imm B D alu target PC+4 Stage 2: Instruction Decode A Execute (EX) ID/EX EX/MEM

ctrl B din dout M addr memory Rest of pipeline target Stage 3: Execute D D MEM mc EX/MEM MEM/WB

ctrl M Stage 4: Memory D WB MEM/WB

D M addr din dout EX/MEM Rd OP Rd mem OP ID/EX B D A imm Rt Rd PC+4 IF/ID OP PC+4 +4 PC B Ra Rb Rd D inst mem A B Putting it all together! MEM/WB 37

Takeaway Pipelining is a powerful technique to mask latencies and increase throughput • Logically, instructions execute one at a time • Physically, instructions execute in parallel – Instruction level parallelism Abstraction promotes decoupling • Interface (ISA) vs. implementation (Pipeline)

MIPS is designed for pipelining • Instructions same length • 32 bits, easy to fetch and then decode • 3 types of instruction formats • Easy to route bits between stages • Can read a register source before even knowing what the instruction is • Memory access through lw and sw only • Access memory after ALU 39

Agenda 5 -stage Pipeline • Implementation • Working Example Hazards • Structural • Data Hazards • Control Hazards 40

Example: : Sample Code (Simple) add nand lw add sw r 3 r 6 r 4 r 5 r 7 r 1, r 2 r 4, r 5 20(r 2) r 2, r 5 12(r 3) Assume 8 -register machine 41

M U X 4 target + PC+4 R 0 R 1 reg. A R 2 reg. B Register file Inst mem instruction PC 0 R 3 Bits 11 -15 Bits 16 -20 Bits 26 -31 ALU result val. A R 4 R 5 R 6 val. B R 7 extend IF/ID PC+4 imm M U X A L U ALU result mdata Data mem data dest val. B Rd Rt op ID/EX M U X dest op op EX/MEM M U X MEM/WB

At time 1, Fetch Example: Start State @ Cycle 0 add r 3 r 1 r 2 M U X 4 + 0 R 1 Register file 4 0 R 2 nop PC add nand lw add sw R 3 R 4 R 5 R 6 R 7 0 36 9 12 18 7 41 22 extend Initial State Bits 11 -15 Bits 16 -20 Bits 26 -31 Time: 0 IF/ID 0 0 0 M U X A L U 0 0 0 Data mem data dest 0 0 0 nop ID/EX M U X 0 0 nop EX/MEM M U X MEM/WB

Agenda 5 -stage Pipeline • Implementation • Working Example Hazards • Structural • Data Hazards • Control Hazards 44

Hazards Correctness problems associated w/processor design 1. Structural hazards Same resource needed for different purposes at the same time (Possible: ALU, Register File, Memory) 2. Data hazards Instruction output needed before it’s available 3. Control hazards Next instruction PC unknown at time of Fetch 45

Dependences and Hazards Dependence: relationship between two insns • • Data: two insns use same storage location Control: 1 insn affects whether another executes at all Not a bad thing, programs would be boring otherwise Enforced by making older insn go before younger one – Happens naturally in single-/multi-cycle designs – But not in a pipeline Hazard: dependence & possibility of wrong insn order • Effects of wrong insn order cannot be externally visible • Hazards are a bad thing: most solutions either complicate the hardware or reduce performance 46

Data Hazards • register file reads occur in stage 2 (ID) • register file writes occur in stage 5 (WB) • next instructions may read values about to be written – i. e instruction may need values that are being computed further down the pipeline – in fact, this is quite common

time Where are the Data Hazards? Clock cycle add r 3, r 1, r 2 sub r 5, r 3, r 4 lw r 6, 4(r 3) or r 5, r 3, r 5 sw r 6, 12(r 3) 1 2 3 4 5 6 7 8 9

Data Hazards • register file reads occur in stage 2 (ID) • register file writes occur in stage 5 (WB) • next instructions may read values about to be written i. e. add r 3, r 1, r 2 sub r 5, r 3, r 4 How to detect?

IF/ID D EX/MEM Rd OP add r 3, r 1, r 2 b. ID/EX R r t fo a e p e r M B imm sub r 5, r 3, r 4 Rt Rd PC+4 IF/ID. Ra ≠ 0 && (IF/ID. Ra==ID/Ex. Rd IF/ID. Ra==Ex/M. Rd IF/ID. Ra==M/W. Rd) addr din dout mem OP PC+4 +4 PC D A B Ra Rb Rd D inst mem A B Detecting Data Hazards MEM/WB

Data Hazards • register file reads occur in stage 2 (ID) • register file writes occur in stage 5 (WB) • next instructions may read values about to be written How to detect? Logic in ID stage: stall = (IF/ID. Ra != 0 && (IF/ID. Ra == ID/EX. Rd || IF/ID. Ra == EX/M. Rd || IF/ID. Ra == M/WB. Rd)) || (same for Rb)

IF/ID ID/EX B D EX/MEM Rd OP Rd mem OP imm Rt Rd PC+4 detect hazard OP PC PC+4 +4 addr din dout M A B Ra Rb D A Rd D inst add r 3, r 1, r 2 sub inst r 5, r 3, r 5 or r 6, r 3, r 4 mem add r 6, r 3, r 8 B Detecting Data Hazards MEM/WB

Takeaway Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards.

Next Goal What to do if data hazard detected?

Possible Responses to Data Hazards 1. Do Nothing • Change the ISA to match implementation • “Hey compiler: don’t create code w/data hazards!” (We can do better than this) 2. Stall • Pause current and subsequent instructions till safe 3. Forward/bypass • Forward data value to where it is needed (Only works if value actually exists already) 55

Stalling How to stall an instruction in ID stage • prevent IF/ID pipeline register update – stalls the ID stage instruction • convert ID stage instr into nop for later stages – innocuous “bubble” passes through pipeline • prevent PC update – stalls the next (IF stage) instruction

WE=0 IF/ID ID/EX B D Rd Rd EX/MEM OP Mem. Wr=0 Reg. Wr=0 mem OP imm If detect hazard OP detect hazard PC+4 PC Rt Rd PC+4 +4 addr din dout M A B Ra Rb D A Rd D inst add r 3, r 1, r 2 sub inst r 5, r 3, r 5 or r 6, r 3, r 4 mem add r 6, r 3, r 8 B Detecting Data Hazards MEM/WB

time add r 3, r 1, r 2 sub r 5, r 3, r 5 or r 6, r 3, r 4 add r 6, r 3, r 8 1 Clock cycle 2 Stalling 3 4 5 6 7 8

Stalling sub r 5, r 3, r 5 or r 6, r 3, r 4 add r 3, r 1, r 2 (WE=0) /stall NOP = If(IF/ID. r. A ≠ 0 && (IF/ID. r. A==ID/Ex. Rd IF/ID. r. A==Ex/M. Rd IF/ID. r. A==M/W. Rd)) Rd WE Rd Op nop M WE PC B data mem Op (Mem. Wr=0 Reg. Wr=0) B Rd +4 D D WE inst mem D r. D B r. A r. B A Op A STALL CONDITION MET

Stalling sub r 5, r 3, r 5 or r 6, r 3, r 4 nop Rd Rd WE (Mem. Wr=0 Reg. Wr=0) add r 3, r 1, r 2 (WE=0) /stall NOP = If(IF/ID. r. A ≠ 0 && (IF/ID. r. A==ID/Ex. Rd IF/ID. r. A==Ex/M. Rd IF/ID. r. A==M/W. Rd)) M Op nop WE PC B data mem Op (Mem. Wr=0 Reg. Wr=0) B Rd +4 D D WE inst mem D r. D B r. A r. B A Op A STALL CONDITION MET

Stalling M (Mem. Wr=0 Reg. Wr=0) nop (WE=0) /stall NOP = If(IF/ID. r. A ≠ 0 && (IF/ID. r. A==ID/Ex. Rd IF/ID. r. A==Ex/M. Rd IF/ID. r. A==M/W. Rd)) STALL CONDITION MET WE Rd Rd (Mem. Wr=0 Reg. Wr=0) sub r 5, r 3, r 5 or r 6, r 3, r 4 data mem Op nop WE PC B Op (Mem. Wr=0 Reg. Wr=0) B Rd +4 D D WE inst mem D r. D B r. A r. B A Op A add r 3, r 1, r 2

time r 3 = 10 add r 3, r 1, r 2 r 3 = 20 sub r 5, r 3, r 5 or r 6, r 3, r 4 add r 6, r 3, r 8 1 Clock cycle 2 Stalling 3 4 5 6 7 8

Takeaway Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards. Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs (“bubbles”) into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. *Bubbles in pipeline significantly decrease performance.

Possible Responses to Data Hazards 1. Do Nothing • Change the ISA to match implementation • “Compiler: don’t create code with data hazards!” (Nice try, we can do better than this) 2. Stall • Pause current and subsequent instructions till safe 3. Forward/bypass • Forward data value to where it is needed (Only works if value actually exists already) 65

Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Three types of forwarding/bypass • Forwarding from Ex/Mem registers to Ex stage (M Ex) • Forwarding from Mem/WB register to Ex stage (W Ex) • Register. File Bypass

Add the Forwarding Datapath B B IF/ID Rd Rb Ra detect hazard ID/Ex forward unit Ex/Mem data mem M Rd B imm inst mem D D D MC WE A Mem/WB 67

Forwarding Datapath B B IF/ID Rd Rb Ra detect hazard ID/Ex forward unit Ex/Mem data mem M Rd B imm inst mem D D D MC WE A Mem/WB Three types of forwarding/bypass • Forwarding from Ex/Mem registers to Ex stage (M Ex) • Forwarding from Mem/WB register to Ex stage (W Ex) • Register. File Bypass 68

Forwarding Datapath 1: Ex/MEM EX Ex/Mem A inst mem D B data mem sub r 5, r 3, r 1 add r 3, r 1, r 2 sub r 5, r 3, r 1 Problem: EX needs ALU result that is in MEM stage Solution: add a bypass from EX/MEM. D to start of EX 69

Forwarding Datapath 1: Ex/MEM EX Ex/Mem A inst mem D B data mem sub r 5, r 3, r 1 add r 3, r 1, r 2 Detection Logic in Ex Stage: forward = (Ex/M. WE && EX/M. Rd != 0 && ID/Ex. Ra == Ex/M. Rd) || (same for Rb) 70

Forwarding Datapath 2: Mem/WB EX Mem/WB A inst mem D B data mem or r 6, r 3, r 4 sub r 5, r 3, r 1 add r 3, r 1, r 2 sub r 5, r 3, r 1 or r 6, r 3, r 4 Problem: EX needs value being written by WB Solution: Add bypass from WB final value to start of EX 71

Forwarding Datapath 2: Mem/WB EX Mem/WB A inst mem D B data mem or r 6, r 3, r 4 sub r 5, r 3, r 1 add r 3, r 1, r 2 Detection Logic: forward = (M/WB. WE && M/WB. Rd != 0 && ID/Ex. Ra == M/WB. Rd && not (ID/Ex. WE && Ex/M. Rd != 0 && ID/Ex. Ra == Ex/M. Rd) || (same for Rb) 72

Register File Bypass A inst mem D B add r 6, r 3, r 8 add r 3, r 1, r 2 sub r 5, r 3, r 1 or r 6, r 3, r 4 add r 6, r 3, r 8 data mem or r 6, r 3, r 4 sub r 5, r 3, r 1 add r 3, r 1, r 2

Register File Bypass A inst mem D B add r 6, r 3, r 8 data mem or r 6, r 3, r 4 sub r 5, r 3, r 1 add r 3, r 1, r 2 Problem: Reading a value that is currently being written Solution: just negate register file clock • writes happen at end of first half of each clock cycle • reads happen during second half of each clock cycle

time Forwarding Example 2 Clock cycle 1 2 3 4 5 6 7 8 add r 3, r 1, r 2 sub r 5, r 3, r 4 lw r 6, 4(r 3) or r 5, r 3, r 5 sw r 6, 12(r 3) 75

time add r 3, r 1, r 2 sub r 5, r 3, r 4 lw r 6, 4(r 3) or r 5, r 3, r 5 sw r 6, 12(r 3) Forwarding Example 2 Clock cycle 1 2 3 4 5 IF ID Ex M W IF ID Ex M 6 W 7 8

Load-Use Hazard Explained A inst mem D B data mem or r 6, r 3, r 4 lw r 4, 20(r 8) Data dependency after a load instruction: • Value not available until after the M stage Next instruction cannot proceed if dependent THE KILLER HAZARD 77

Load-Use Stall A inst mem D B or r 6, r 4, r 1 data mem lw r 4, 20(r 8) or r 6, r 3, r 4 load-use stall 78

Load-Use Stall (1) A inst mem D B data mem or r 6, r 4, r 1 lw r 4, 20(r 8) or r 6, r 3, r 4 IF lw r 4, 20(r 8) ID Ex IF ID load-use stall 79

Load-Use Detection B B IF/ID Rd MC Ra Rb Rd detect hazard forward unit ID/Ex Stall = If(ID/Ex. Mem. Read && IF/ID. Ra == ID/Ex. Rd Ex/Mem data mem M Rd B imm inst mem D D D MC WE A Mem/WB 80

Resolving Load-Use Hazards Two MIPS Solutions: • MIPS 2000/3000: delay slot – ISA says results of loads are not available until one cycle later – Assembler inserts nop, or reorders to fill delay slot • MIPS 4000 onwards: stall – But really, programmer/compiler reorders to avoid stalling in the load delay slot 81

Takeaway Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards. Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs (“bubbles”) into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. Bubbles (nops) in pipeline significantly decrease performance. Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Better performance than stalling.

Quiz Find all hazards, and say how they are resolved: add r 3, r 1, r 2 nand r 5, r 3, r 4 add r 2, r 6, r 3 lw r 6, 24(r 3) sw r 6, 12(r 2)

Data Hazard Recap Delay Slot(s) • Modify ISA to match implementation Stall • Pause current and all subsequent instructions Forward/Bypass • Try to steal correct value from elsewhere in pipeline • Otherwise, fall back to stalling or require a delay slot Tradeoffs?

Agenda 5 -stage Pipeline • Implementation • Working Example Hazards • Structural • Data Hazards • Control Hazards 86

i = 0; A bit of do { n += 2; i++; } while(i < max) i = 7; n--; Context x 10 addi r 1, r 0, 0 x 14 Loop: addi r 2, 2 x 18 addi r 1, 1 x 1 C blt r 1, r 3, Loop x 20 addi r 1, r 0, 7 x 24 subi r 2, 1 i r 1 Assume: n r 2 max r 3 # # # i=0 n += 2 i++ i<max? i = 7 n-- 87

Control Hazards • instructions are fetched in stage 1 (IF) • branch and jump decisions occur in stage 3 (EX) next PC not known until 2 cycles after branch/jump x 1 C x 20 blt r 1, r 3, Loop addi r 1, r 0, 7 Branch not taken? x 24 subi r 2, 1 Branch taken? 88

Zap & Flush inst mem +4 • prevent PC update • clear IF/ID latch • branch continues A D B PC New PC = 14 1 C blt r 1, r 3, L 20 addi r 1, r 0, 7 24 subi r 2, 1 14 L: addi r 2, 2 branch calc decide branch data mem If branch Taken Zap

Reducing the cost of control hazard 1. Delay Slot You MUST do this MIPS ISA: 1 insn after ctrl insn always executed • • • Whether branch taken or not 2. Resolve Branch at Decode • • • Some groups do this for Project 3, your choice Move branch calc from EX to ID Alternative: just zap 2 nd instruction when branch taken 3. Branch Prediction • Not in 3410, but every processor worth anything does this (no offense!) 90

Problem: Zapping 2 insns/branch inst mem +4 A D B PC New PC = 1 C branch calc 1 C blt r 1, r 3, Loop 20 addi r 1, r 0, 7 24 subi r 2, 1 decide branch data mem

i = 0; Solution #1: Delay Slot do { n += 2; i r 1 i++; Assume: } while(i < max) n r 2 i = 7; max r 3 n--; x 10 addi r 1, r 0, 0 # i=0 x 14 Loop: addi r 2, 2 # n += 2 x 18 addi r 1, 1 # i++ x 1 C blt r 1, r 3, Loop # i<max? x 20 nop x 24 addi r 1, r 0, 7 # i = 7 x 28 subi r 2, 1 # n++ 92

Delay Slot in Action inst mem A D B +4 PC New PC = 1 C branch calc 1 C blt r 1, r 3, Loop 20 nop 24 addi r 1, r 0, 7 decide branch data mem

Soln #2: Resolve Branches @ Decode inst mem A D B +4 PC branch calc decide branch data mem New PC = 1 C 1 C blt r 1, r 3, Loop 20 nop 14 Loop: addi r 2, 2 94

Optimization: Fill the Delay Slot x 10 addi r 1, x 14 Loop: addi r 2, x 18 addi r 1, x 1 C blt r 1, x 20 nop r 0, r 2, r 1, r 3, 0 2 1 Loop # # i=0 n += 2 i++ i<max? Compiler transforms code x 10 addi x 14 Loop: addi x 18 blt x 1 C addi r 1, r 2, r 0, r 1, r 3, r 2, 0 1 Loop 2 # # i=0 i++ i<max? n += 2 95

Optimization In Action! inst mem A D B +4 PC branch calc decide branch data mem New PC = 1 C 1 C blt r 1, r 3, Loop 20 addi r 2, 2 14 Loop: addi r 1, 1 96

Branch Prediction Most processor support Speculative Execution • Guess direction of the branch – Allow instructions to move through pipeline – Zap them later if guess turns out to be wrong • A must for long pipelines 97

Data Hazard Takeaways Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. Pipelined processors need to detect data hazards. Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs (“bubbles”) into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. Nops significantly decrease performance. Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Better performance than stalling. 98

Control Hazard Takeaways Control hazards occur because the PC following a control instruction is not known until control instruction is executed. If branch is taken need to zap instructions. 1 cycle performance penalty. Delay Slots can potentially increase performance due to control hazards. The instruction in the delay slot will always be executed. Requires software (compiler) to make use of delay slot. Put nop in delay slot if not able to put useful instruction in delay slot. We can reduce cost of a control hazard by moving branch decision and calculation from Ex stage to ID stage. With a delay slot, this removes the need to flush instructions on taken branches. 99