Stalling The easiest solution is to stall the

  • Slides: 24
Download presentation
Stalling § The easiest solution is to stall the pipeline § We could delay

Stalling § The easiest solution is to stall the pipeline § We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes called a bubble lw $2, 20($3) and $12, $5 1 2 IM Reg IM Clock cycle 3 4 DM Reg 5 6 7 DM Reg § Notice that we’re still using forwarding in cycle 5, to get data from the MEM/WB pipeline register to the ALU 1

Stalling and forwarding § Without forwarding, we’d have to stall for two cycles to

Stalling and forwarding § Without forwarding, we’d have to stall for two cycles to wait for the LW instruction’s writeback stage lw $2, 20($3) and $12, $5 1 2 IM Reg IM 3 Clock cycle 4 5 DM 6 7 8 DM Reg Reg § In general, you can always stall to avoid hazards—but dependencies are very common in real code, and stalling often can reduce performance by a significant amount 2

Load-Use Hazard Detection • Check when using instruction is decoded in ID stage •

Load-Use Hazard Detection • Check when using instruction is decoded in ID stage • ALU operand register numbers in ID stage are given by • IF/ID. Register. Rs, IF/ID. Register. Rt • Load-use hazard when • ID/EX. Mem. Read and ((ID/EX. Register. Rt = IF/ID. Register. Rs) or (ID/EX. Register. Rt = IF/ID. Register. Rt)) • If detected, stall and insert bubble

How to Stall the Pipeline • Force control values in ID/EX register to 0

How to Stall the Pipeline • Force control values in ID/EX register to 0 • EX, MEM and WB do nop (no-operation) • Prevent update of PC and IF/ID register • Using instruction is decoded again • Following instruction is fetched again • 1 -cycle stall allows MEM to read data for lw • Can subsequently forward to EX stage

Stalling delays the entire pipeline § If we delay the second instruction, we’ll have

Stalling delays the entire pipeline § If we delay the second instruction, we’ll have to delay the third one too — This is necessary to make forwarding work between AND and OR — It also prevents problems such as two instructions trying to write to the same register in the same cycle 1 lw $2, 20($3) and $12, $5 or $13, $12, $2 IM 2 3 Reg IM Clock cycle 4 5 DM 7 8 Reg IM 6 DM Reg 5

What about EX, MEM, WB § But what about the ALU during cycle 4,

What about EX, MEM, WB § But what about the ALU during cycle 4, the data memory in cycle 5, and the register file write in cycle 6? lw $2, 20($3) and $12, $5 or $13, $12, $2 1 2 IM Reg IM 3 Clock cycle 4 5 DM Reg IM IM 6 7 DM Reg 8 Reg DM Reg § Those units aren’t used in those cycles because of the stall, so we can set the EX, MEM and WB control signals to all 0 s. 6

Detecting Stalls, cont. § DM memwb Reg ex/mem Reg memwb IM DM id/ex and

Detecting Stalls, cont. § DM memwb Reg ex/mem Reg memwb IM DM id/ex and $12, $5 Reg if/id IM id/ex $2, 20($3) if/id lw ex/mem When should stalls be detected? EX stage (of the instruction causing the stall) if/id § Reg What is the stall condition? if (ID/EX. Mem. Read = 1 and (ID/EX. rt = IF/ID. rs or ID/EX. rt = IF/ID. rt)) then stall 7

PC Write IF/ID Write Adding hazard detection to the CPU ID/EX. Mem. Read Hazard

PC Write IF/ID Write Adding hazard detection to the CPU ID/EX. Mem. Read Hazard Unit Rs ID/EX Rt 0 0 1 Control PC EX/MEM WB MEM/WB EX M WB IF/ID Read register 1 Addr ID/EX. Register. Rt Instr Read data 1 Read register 2 Write register Instruction memory Write data Read data 2 Registers 0 1 2 ALU Zero ALUSrc 0 1 2 Result 0 Address Data memory 1 Instr [15 - 0] Reg. Dst Extend Rt Write Read data 1 0 0 Rd 1 Rs EX/MEM. Register. Rd Forwarding Unit MEM/WB. Register. Rd 8

Stalls and Performance § Stalls reduce performance — But are required to get correct

Stalls and Performance § Stalls reduce performance — But are required to get correct results § Compiler can arrange code to avoid hazards and stalls — Requires knowledge of the pipeline structure

Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in

Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in the next instruction Ex: c code for A = B + E; C = B + F; stall lw lw add sw $t 1, $t 2, $t 3, $t 4, $t 5, 0($t 0) 4($t 0) $t 1, $t 2 12($t 0) 8($t 0) $t 1, $t 4 16($t 0) 13 cycles lw lw lw add sw $t 1, $t 2, $t 4, $t 3, $t 5, 0($t 0) 4($t 0) 8($t 0) $t 1, $t 2 12($t 0) $t 1, $t 4 16($t 0) 11 cycles

Branches in the original pipelined datapath 1 0 PCSrc Control IF/ID 4 When are

Branches in the original pipelined datapath 1 0 PCSrc Control IF/ID 4 When are they resolved? ID/EX WB EX/MEM M WB MEM/WB EX M WB Add P C Add Reg. Write Read Instruction address [31 -0] Instruction memory Read register 1 Read data 1 Read register 2 Read data 2 Write register Write data Instr [15 - 0] Instr [20 - 16] Instr [15 - 11] Shift left 2 ALU 0 Mem. Write Zero Result 1 Registers ALUOp ALUSrc Sign extend Address Data memory Write data Reg. Dst Mem. To. Reg Read data 1 Mem. Read 0 0 1 11

Branch Hazards If branch outcome determined in MEM: Flush these instructions (Set control values

Branch Hazards If branch outcome determined in MEM: Flush these instructions (Set control values to 0) PC

Reducing Branch Delay Move hardware to determine outcome to ID stage — Target address

Reducing Branch Delay Move hardware to determine outcome to ID stage — Target address adder — Register comparator Example: branch taken 36: sub $10, $4, $8 40: beq $1, $3, 7 44: and $12, $5 48: or $13, $2, $6 52: add $14, $2 56: slt $15, $6, $7. . . 72: lw $4, 50($7)

Example: Branch Taken

Example: Branch Taken

Example: Branch Taken

Example: Branch Taken

Data Hazards for Branches If a comparison register is a destination of 2 nd

Data Hazards for Branches If a comparison register is a destination of 2 nd or 3 rd preceding ALU instruction add $1, $2, $3 IF add $4, $5, $6 … beq $1, $4, target Can resolve using forwarding ID EX MEM WB IF ID EX MEM WB

Data Hazards for Branches If a comparison register is a destination of preceding ALU

Data Hazards for Branches If a comparison register is a destination of preceding ALU instruction or 2 nd preceding load instruction Need 1 stall cycle lw $1, addr IF add $4, $5, $6 beq stalled beq $1, $4, target ID EX MEM WB IF ID ID EX MEM WB

Data Hazards for Branches If a comparison register is a destination of immediately preceding

Data Hazards for Branches If a comparison register is a destination of immediately preceding load instruction — Need 2 stall cycles lw $1, addr IF beq stalled beq $1, $0, target ID EX IF ID MEM WB ID ID EX MEM WB

Branch Prediction • Longer pipelines can’t readily determine branch outcome early • Stall penalty

Branch Prediction • Longer pipelines can’t readily determine branch outcome early • Stall penalty becomes unacceptable • Predict (i. e. , guess) outcome of branch • Only stall if prediction is wrong • Simplest prediction strategy • predict branches not taken • Works well for loops if the loop tests are done at the start. • Fetch instruction after branch, with no delay

Dynamic Branch Prediction § In deeper and superscalar pipelines, branch penalty is more significant

Dynamic Branch Prediction § In deeper and superscalar pipelines, branch penalty is more significant § Use dynamic prediction § Branch prediction buffer (aka branch history table) § Indexed by recent branch instruction addresses § Stores outcome (taken/not taken) § To execute a branch § Check table, expect the same outcome § Start fetching from fall-through or target § If wrong, flush pipeline and flip prediction

1 -Bit Predictor: Shortcoming Inner loop branches mispredicted twice! outer: … … inner: …

1 -Bit Predictor: Shortcoming Inner loop branches mispredicted twice! outer: … … inner: … … beq …, …, inner … beq …, …, outer § Mispredict as taken on last iteration of inner loop § Then mispredict as not taken on first iteration of inner loop next time around

2 -Bit Predictor Only change prediction on two successive mispredictions

2 -Bit Predictor Only change prediction on two successive mispredictions

Calculating the Branch Target § Even with predictor, still need to calculate the target

Calculating the Branch Target § Even with predictor, still need to calculate the target address § 1 -cycle penalty for a taken branch § Branch target buffer § Cache of target addresses § Indexed by PC when instruction fetched § If hit and instruction is branch predicted taken, can fetch target immediately

Concluding Remarks n n ISA influences design of datapath and control Datapath and control

Concluding Remarks n n ISA influences design of datapath and control Datapath and control influence design of ISA Pipelining improves instruction throughput using parallelism n More instructions completed per second n Latency for each instruction not reduced Hazards: structural, data, control n Main additions in hardware: n forwarding unit n hazard detection and stalling n branch predictor n branch target table