Stalling The easiest solution is to stall the

Stalling and forwarding § Without forwarding, we’d have to stall for two cycles to

Load-Use Hazard Detection • Check when using instruction is decoded in ID stage •

How to Stall the Pipeline • Force control values in ID/EX register to 0

Stalling delays the entire pipeline § If we delay the second instruction, we’ll have

What about EX, MEM, WB § But what about the ALU during cycle 4,

Detecting Stalls, cont. § DM memwb Reg ex/mem Reg memwb IM DM id/ex and

PC Write IF/ID Write Adding hazard detection to the CPU ID/EX. Mem. Read Hazard

Stalls and Performance § Stalls reduce performance — But are required to get correct

Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in

Branches in the original pipelined datapath 1 0 PCSrc Control IF/ID 4 When are

Branch Hazards If branch outcome determined in MEM: Flush these instructions (Set control values

Reducing Branch Delay Move hardware to determine outcome to ID stage — Target address

Data Hazards for Branches If a comparison register is a destination of 2 nd

Data Hazards for Branches If a comparison register is a destination of preceding ALU

Data Hazards for Branches If a comparison register is a destination of immediately preceding

Branch Prediction • Longer pipelines can’t readily determine branch outcome early • Stall penalty

Dynamic Branch Prediction § In deeper and superscalar pipelines, branch penalty is more significant

1 -Bit Predictor: Shortcoming Inner loop branches mispredicted twice! outer: … … inner: …

2 -Bit Predictor Only change prediction on two successive mispredictions

Calculating the Branch Target § Even with predictor, still need to calculate the target

Concluding Remarks n n ISA influences design of datapath and control Datapath and control

Slides: 24

Download presentation

Stalling § The easiest solution is to stall the pipeline § We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes called a bubble lw $2, 20($3) and $12, $5 1 2 IM Reg IM Clock cycle 3 4 DM Reg 5 6 7 DM Reg § Notice that we’re still using forwarding in cycle 5, to get data from the MEM/WB pipeline register to the ALU 1

Stalling and forwarding § Without forwarding, we’d have to stall for two cycles to wait for the LW instruction’s writeback stage lw $2, 20($3) and $12, $5 1 2 IM Reg IM 3 Clock cycle 4 5 DM 6 7 8 DM Reg Reg § In general, you can always stall to avoid hazards—but dependencies are very common in real code, and stalling often can reduce performance by a significant amount 2

Load-Use Hazard Detection • Check when using instruction is decoded in ID stage • ALU operand register numbers in ID stage are given by • IF/ID. Register. Rs, IF/ID. Register. Rt • Load-use hazard when • ID/EX. Mem. Read and ((ID/EX. Register. Rt = IF/ID. Register. Rs) or (ID/EX. Register. Rt = IF/ID. Register. Rt)) • If detected, stall and insert bubble

How to Stall the Pipeline • Force control values in ID/EX register to 0 • EX, MEM and WB do nop (no-operation) • Prevent update of PC and IF/ID register • Using instruction is decoded again • Following instruction is fetched again • 1 -cycle stall allows MEM to read data for lw • Can subsequently forward to EX stage

Stalling delays the entire pipeline § If we delay the second instruction, we’ll have to delay the third one too — This is necessary to make forwarding work between AND and OR — It also prevents problems such as two instructions trying to write to the same register in the same cycle 1 lw $2, 20($3) and $12, $5 or $13, $12, $2 IM 2 3 Reg IM Clock cycle 4 5 DM 7 8 Reg IM 6 DM Reg 5

What about EX, MEM, WB § But what about the ALU during cycle 4, the data memory in cycle 5, and the register file write in cycle 6? lw $2, 20($3) and $12, $5 or $13, $12, $2 1 2 IM Reg IM 3 Clock cycle 4 5 DM Reg IM IM 6 7 DM Reg 8 Reg DM Reg § Those units aren’t used in those cycles because of the stall, so we can set the EX, MEM and WB control signals to all 0 s. 6

Detecting Stalls, cont. § DM memwb Reg ex/mem Reg memwb IM DM id/ex and $12, $5 Reg if/id IM id/ex $2, 20($3) if/id lw ex/mem When should stalls be detected? EX stage (of the instruction causing the stall) if/id § Reg What is the stall condition? if (ID/EX. Mem. Read = 1 and (ID/EX. rt = IF/ID. rs or ID/EX. rt = IF/ID. rt)) then stall 7

PC Write IF/ID Write Adding hazard detection to the CPU ID/EX. Mem. Read Hazard Unit Rs ID/EX Rt 0 0 1 Control PC EX/MEM WB MEM/WB EX M WB IF/ID Read register 1 Addr ID/EX. Register. Rt Instr Read data 1 Read register 2 Write register Instruction memory Write data Read data 2 Registers 0 1 2 ALU Zero ALUSrc 0 1 2 Result 0 Address Data memory 1 Instr [15 - 0] Reg. Dst Extend Rt Write Read data 1 0 0 Rd 1 Rs EX/MEM. Register. Rd Forwarding Unit MEM/WB. Register. Rd 8

Stalls and Performance § Stalls reduce performance — But are required to get correct results § Compiler can arrange code to avoid hazards and stalls — Requires knowledge of the pipeline structure

Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in the next instruction Ex: c code for A = B + E; C = B + F; stall lw lw add sw $t 1, $t 2, $t 3, $t 4, $t 5, 0($t 0) 4($t 0) $t 1, $t 2 12($t 0) 8($t 0) $t 1, $t 4 16($t 0) 13 cycles lw lw lw add sw $t 1, $t 2, $t 4, $t 3, $t 5, 0($t 0) 4($t 0) 8($t 0) $t 1, $t 2 12($t 0) $t 1, $t 4 16($t 0) 11 cycles

Branches in the original pipelined datapath 1 0 PCSrc Control IF/ID 4 When are they resolved? ID/EX WB EX/MEM M WB MEM/WB EX M WB Add P C Add Reg. Write Read Instruction address [31 -0] Instruction memory Read register 1 Read data 1 Read register 2 Read data 2 Write register Write data Instr [15 - 0] Instr [20 - 16] Instr [15 - 11] Shift left 2 ALU 0 Mem. Write Zero Result 1 Registers ALUOp ALUSrc Sign extend Address Data memory Write data Reg. Dst Mem. To. Reg Read data 1 Mem. Read 0 0 1 11

Branch Hazards If branch outcome determined in MEM: Flush these instructions (Set control values to 0) PC

Reducing Branch Delay Move hardware to determine outcome to ID stage — Target address adder — Register comparator Example: branch taken 36: sub $10, $4, $8 40: beq $1, $3, 7 44: and $12, $5 48: or $13, $2, $6 52: add $14, $2 56: slt $15, $6, $7. . . 72: lw $4, 50($7)

Example: Branch Taken

Data Hazards for Branches If a comparison register is a destination of 2 nd or 3 rd preceding ALU instruction add $1, $2, $3 IF add $4, $5, $6 … beq $1, $4, target Can resolve using forwarding ID EX MEM WB IF ID EX MEM WB

Data Hazards for Branches If a comparison register is a destination of preceding ALU instruction or 2 nd preceding load instruction Need 1 stall cycle lw $1, addr IF add $4, $5, $6 beq stalled beq $1, $4, target ID EX MEM WB IF ID ID EX MEM WB

Data Hazards for Branches If a comparison register is a destination of immediately preceding load instruction — Need 2 stall cycles lw $1, addr IF beq stalled beq $1, $0, target ID EX IF ID MEM WB ID ID EX MEM WB

Branch Prediction • Longer pipelines can’t readily determine branch outcome early • Stall penalty becomes unacceptable • Predict (i. e. , guess) outcome of branch • Only stall if prediction is wrong • Simplest prediction strategy • predict branches not taken • Works well for loops if the loop tests are done at the start. • Fetch instruction after branch, with no delay

Dynamic Branch Prediction § In deeper and superscalar pipelines, branch penalty is more significant § Use dynamic prediction § Branch prediction buffer (aka branch history table) § Indexed by recent branch instruction addresses § Stores outcome (taken/not taken) § To execute a branch § Check table, expect the same outcome § Start fetching from fall-through or target § If wrong, flush pipeline and flip prediction

1 -Bit Predictor: Shortcoming Inner loop branches mispredicted twice! outer: … … inner: … … beq …, …, inner … beq …, …, outer § Mispredict as taken on last iteration of inner loop § Then mispredict as not taken on first iteration of inner loop next time around

2 -Bit Predictor Only change prediction on two successive mispredictions

Calculating the Branch Target § Even with predictor, still need to calculate the target address § 1 -cycle penalty for a taken branch § Branch target buffer § Cache of target addresses § Indexed by PC when instruction fetched § If hit and instruction is branch predicted taken, can fetch target immediately

Concluding Remarks n n ISA influences design of datapath and control Datapath and control influence design of ISA Pipelining improves instruction throughput using parallelism n More instructions completed per second n Latency for each instruction not reduced Hazards: structural, data, control n Main additions in hardware: n forwarding unit n hazard detection and stalling n branch predictor n branch target table