EECS 322 Computer Architecture Pipeline Control Data Hazards

Models Single-cycle model (non-overlapping) • The instruction latency executes in a single cycle •

Recap: Can pipelining get us into trouble? • Yes: Pipeline Hazards – structural hazards:

Review: Single-Cycle Datapath 2 adders: PC+4 adder, Branch/Jump offset adder Add Result 4 Reg.

Review: Multi vs. Single-cycle Processor Datapath Combine adders: add 1½ Mux & 3 temp.

Multi-cycle Processor Datapath Single-cycle= 1 ALU + 2 Mem + 4 Muxes + 2

Figure 6. 25 PC 32 bits IR 32 Datapath Registers 160 FFs + 213

Overhead Single-cycle model Chip Area • 8 ns Clock (125 MHz), (non-overlapping) • 1

Pipeline Control: Controlpath Register bits 9 control bits 5 control bits Figure 6. 29

Pipeline Control: Controlpath table Figure 5. 20, Single Cycle Instruction Reg Dst ALU Src

Pipeline Hazards Pipeline hazards • Solution #1 always works (for non-realtime) applications: stall, delay

Pipeline Datapath and Controlpath Figure 6. 30

load inst. Clock 1 Clock 2 Clock 3 WB=11 M=010 WB=11, M=010 EX=0001 Aluout

Pipeline single stepping Contents of Register 1 = C$1 = 3; C$2=4; C$3=4; C$4=6;

Clock 1: Figure 6. 31 a PC=4 IR=lw $10, 20($1) PC=0

C PC=4 A=C$1 PC=4 PC=8 B=X S=20 T=$10 IR=lw $10, 20($1) IR=sub $11, $2,

C PC=4+20<<2 T=$10 T=$3 ALU=20+C$1 S=X S=20 IR=sub $12, $4, $5 IR=and $11, $2,

C C Clock 4: Figure 6. 32 b ALU T=$3 ALU=20+C$1 ALU=C$2 -C$3 MDR=Mem[20+C$1]

Data Dependencies: that can be resolved by forwarding Resolved by forwarding At same time:

Data Hazards: arithmetic Forwards in time: Can be resolved At same time: Not a

Data Dependencies: no forwarding Clock 1 IF sub $2, $1, $3 and $12, $5

Data Dependencies: no forwarding A dependant instruction will take = 1 + 2 stalls

Data Dependencies: with forwarding Clock 1 IF sub $2, $1, $3 and $12, $5

Data Dependencies: Hazard Conditions Data Hazard Condition occurs whenever a data source needs a

Data Dependencies: Hazard Conditions 1 a Data Hazard: sub $2, $1, $3 and $12,

Data Dependencies: Worst case Data Hazard: sub $2, $1, $3 sub $rd, $rs, $rt

Data Dependencies: Hazard Conditions Hazard Type Source 1 a. 1 b. ID/EX. $rs ID/EX.

Data Hazards: Loads Backwards in time: Cannot be resolved Forwards in time: Cannot be

Data Hazards: load stalling Stall Figure 6. 45

Data Hazards: Hazard detection unit (page 490) Stall Condition Source IF/ID. $rs IF/ID. $rt

Data Hazards: Hazard detection unit (page 490) No Stall Example: (only need to look

Hazard Detection Unit: when to stall Figure 6. 46

Data Dependency Units Forwarding Condition Source ID/EX. $rs ID/EX. $rt } ID/EX. $rs ID/EX.

Data Dependency Units Pipeline Registers Stalling Comparisons IF/ID ID/EX $rs $rt $rd Stall Condition

Branch Hazards: Soln #1, Stall until Decision made (fig. 6. 4) @3 C: @40:

Branch Hazards: Soln #2, Predict until Decision made Clock beq $1, $3, 7 1

Branch Hazards: Soln #3, Delayed Decision Clock beq $1, $3, 7 1 IF 2

Branch Hazards: Soln #3, Delayed Decision Clock beq $1, $3, 7 and $12, $5

Branch Hazards: Decision made in the ID stage (figure 6. 4) Clock 1 2

Branch Hazards: Soln #2, Predict until Decision made Branch Decision made in MEM stage:

Figure 6. 51 Early branch comparison Flush: if wrong prediciton, add nops

Performance load: assume half of the instructions are immediately followed by an instruction that

Also known as the instruction latency with in a pipeline Performance, page 504 Instruction

Slides: 45

Download presentation

EECS 322 Computer Architecture Pipeline Control, Data Hazards and Branch Hazards Instructor: Francis G. Wolff wolff@eecs. cwru. edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow

Models Single-cycle model (non-overlapping) • The instruction latency executes in a single cycle • Every instruction and clock-cycle must be stretched to the slowest instruction (p. 438) Multi-cycle model (non-overlapping) • The instruction latency executes in multiple-cycles • The clock-cycle must be stretched to the slowest step • Ability to share functional units within the execution of a single instruction Pipeline model (overlapping, p. 522) • The instruction latency executes in multiple-cycles • The clock-cycle must be stretched to the slowest step • The throughput is mainly one clock-cycle/instruction • Gains efficiency by overlapping the execution of multiple instructions, increasing hardware utilization. (p. 377)

Recap: Can pipelining get us into trouble? • Yes: Pipeline Hazards – structural hazards: attempt to use the same resource two different ways at the same time • e. g. , multiple memory accesses, multiple register writes • solutions: – multiple memories (separate instruction & data memory) – stretch pipeline – control hazards: attempt to make a decision before condition is evaulated • e. g. , any conditional branch • solutions: prediction, delayed branch – data hazards: attempt to use item before it is ready • e. g. , add r 1, r 2, r 3; sub r 4, r 1 , r 5; lw r 6, 0(r 7); or r 8, r 6 , r 9 • solutions: forwarding/bypassing, stall/bubble

Review: Single-Cycle Datapath 2 adders: PC+4 adder, Branch/Jump offset adder Add Result 4 Reg. Write Reg. Dst PC Read address Instruction memory M u x Shift left 2 Read register 1 Read data 1 register 2 Write register Write data 16 ALUSrc Read data 2 Sign extend M u x 32 Branch Mem. Write 3 And M u x Mem. Read ALUctl Zero ALU result Memto. Reg Address Read data Data Write memory data Harvard Architecture: Separate instruction and data memory M u x

Review: Multi vs. Single-cycle Processor Datapath Combine adders: add 1½ Mux & 3 temp. registers, A, B, ALUOut Combine Memory: add 1 Mux & 2 temp. registers, IR, MDR Ior. D PC 0 M u x 1 Mem. Read Mem. Write Reg. Dst Reg. Write Instruction [25– 21] Address Memory Mem. Data Write data IRWrite Instruction [20– 16] Instruction [15– 0] Instruction register Instruction [15– 0] Memory data register ALUSrc. A 0 M u x 1 Read register 1 Read data 1 register 2 Registers Write Read register data 2 0 M Instruction u x [15– 11] 1 0 M u x 1 A B 4 Write data 16 Sign extend 32 Shift left 2 Zero ALU result ALUOut 0 1 M u 2 x 3 ALU control Memto. Reg Instruction [5– 0] ALUSrc. B ALUOp Single-cycle= 1 ALU + 2 Mem + 4 Muxes + 2 adders + Opcode. Decoders Multi-cycle = 1 ALU + 1 Mem + 5½ Muxes + 5 Reg (IR, A, B, MDR, ALUOut) + FSM

Multi-cycle Processor Datapath Single-cycle= 1 ALU + 2 Mem + 4 Muxes + 2 adders + Opcode. Decoders Multi-cycle = 1 ALU + 1 Mem + 5½ Muxes + 5 Reg (IR, A, B, MDR, ALUOut) + FSM Ior. D PC 0 M u x 1 Mem. Read Mem. Write Reg. Dst Reg. Write Instruction [25– 21] Address Memory Mem. Data Write data IRWrite Instruction [20– 16] Instruction [15– 0] Instruction register Instruction [15– 0] Memory data register ALUSrc. A 0 M u x 1 Read register 1 Read data 1 register 2 Registers Write Read register data 2 0 M Instruction u x [15– 11] 1 0 M u x 1 A B 4 Write data 16 Sign extend 32 Shift left 2 Zero ALU result ALUOut 0 1 M u 2 x 3 ALU control Memto. Reg Instruction [5– 0] ALUSrc. B ALUOp 5 x 32 = 160 additional FFs for multi-cycle processor over single-cycle processor

Figure 6. 25 PC 32 bits IR 32 Datapath Registers 160 FFs + 213 FFs + 16 FFs bits 2 W 3 M 4 EX PC 3 2 2 W 3 M PC 2 W 32 M A 32 Z D 1 R B 32 ALU Out 32 32 ALU Out B 32 32 Si 32 RT 5 D 5 RD 5 213+16 = 229 additional FFs for pipeline over multi-cycle processor D 5

Overhead Single-cycle model Chip Area • 8 ns Clock (125 MHz), (non-overlapping) • 1 ALU + 2 adders • 0 Muxes • 0 Datapath Register bits (Flip-Flops) Speed Multi-cycle model • 2 ns Clock (500 MHz), (non-overlapping) • 1 ALU + Controller • 5 Muxes • 160 Datapath Register bits (Flip-Flops) Pipeline model • 2 ns Clock (500 MHz), (overlapping) • 2 ALU + Controller • 4 Muxes • 373 Datapath + 16 Controlpath Register bits (Flip-Flops)

Pipeline Control: Controlpath Register bits 9 control bits 5 control bits Figure 6. 29 2 control bits

Pipeline Control: Controlpath table Figure 5. 20, Single Cycle Instruction Reg Dst ALU Src Mem Reg Wrt Mem Red Mem Wrt Branch ALU op 1 ALU op 0 R-format 1 0 0 0 1 0 lw 1 1 1 0 0 sw X 1 X 0 0 1 0 0 0 beq X 0 0 0 1 ID / EX control lines Figure 6. 28 EX / MEM control lines MEM / WB cntrl lines Instruction Reg Dst ALU Op 1 ALU Op 0 ALU Src Branch Mem Red Mem Wrt Reg Wrt Mem Reg R-format 1 1 0 0 0 1 0 lw 1 0 0 1 0 1 1 sw X 0 0 1 0 X beq X 0 1 0 0 0 X

Pipeline Hazards Pipeline hazards • Solution #1 always works (for non-realtime) applications: stall, delay & procrastinate! Structural Hazards (i. e. fetching same memory bank) • Solution #2: partition architecture Control Hazards (i. e. branching) • Solution #1: stall! but decreases throughput • Solution #2: guess and back-track • Solution #3: delayed decision: delay branch & fill slot Data Hazards (i. e. register dependencies) • Worst case situation • Solution #2: re-order instructions • Solution #3: forwarding or bypassing: delayed load

Pipeline Datapath and Controlpath Figure 6. 30

load inst. Clock 1 Clock 2 Clock 3 WB=11 M=010 WB=11, M=010 EX=0001 Aluout T=$10 ALU=20+C$1 S=20 lw $10, 20($1) MDR=Mem[20+C$1] B=X IR= A=C$1 PC=4 PC=0 D=$10 D=0 Figure 6. 30 PC=4+20<<2 PC=4 Clock 0 PC=4+20<<2

Pipeline single stepping Contents of Register 1 = C$1 = 3; C$2=4; C$3=4; C$4=6; C$5=7; C$10=8; … Memory[23]=9; Formats: add $rd, $rs=A, $rt=B; lw $rt=B, @($rs=A) Clock <IF/ID> <ID/EX> <EX/MEM> <MEM/WB> <PC, IR> <PC, A, B, S, Rt, Rd> <PC, Z, ALU, B, R> <MDR, ALU, R> 0 <0, ? > <? , ? , ? > <? , ? > 1 <4, lw $10, 20($1)> <0, ? , ? , ? > <? , ? > 2 <8, sub $11, $2, $3> <4, C$1 3, C$10 8, 20, $10, 0> <0, ? , ? > 3 <12, and $12, $4, $5> <8, C$2 4, C$3 4, X, $3, $11> <4+20<<2 84, 0, 20+3 23, 8, $10><? , ? > 4 <16, or $13, $6, $7> 5 <20, add $14, $8, $9> <16, C$6 , C$7, X, $7, $13> <X, 0, 1, 7, $12> <? , ? > <12, C$4 6, C$5 7, X, $5, $12><X, 1, 4 -4=0, 4, $11> <Mem[23] 9, 23, $10> <X, 0, $11>

Clock 1: Figure 6. 31 a PC=4 IR=lw $10, 20($1) PC=0

C PC=4 A=C$1 PC=4 PC=8 B=X S=20 T=$10 IR=lw $10, 20($1) IR=sub $11, $2, $3 PC=4 D=0 Figure 6. 31 b

C PC=4+20<<2 T=$10 T=$3 ALU=20+C$1 S=X S=20 IR=sub $12, $4, $5 IR=and $11, $2, $3 C C PC=8 PC=4 A=C$2 A=C$1 B=C$3 B=X PC=8 PC=12 PC=8 D=$10 D=$11 D=0 Figure 6. 32 a

C C Clock 4: Figure 6. 32 b ALU T=$3 ALU=20+C$1 ALU=C$2 -C$3 MDR=Mem[20+C$1] PC=4+20<<2 PC=X PC=4+20<<2 C S=X IR=and$13, $6, $7 $12, $4, $5 IR=or PC=8 A=C$2 A=C$4 B=C$3 B=C$5 PC=12 PC=16 PC=20 D=$10 D=$11 D=$12

Data Dependencies: that can be resolved by forwarding Resolved by forwarding At same time: Not a hazard Data Dependencies Figure 6. 36 Data Hazards Forward in time: Not a hazard

Data Hazards: arithmetic Forwards in time: Can be resolved At same time: Not a hazard Figure 6. 37

Data Dependencies: no forwarding Clock 1 IF sub $2, $1, $3 and $12, $5 2 3 ID IF EX ID Stall 5 6 7 8 WB M ID Stall 4 ID Write 1 st Half EX M Read 2 nd Half Suppose every instruction is dependant = 1 + 2 stalls = 3 clocks MIPS = Clock = 500 Mhz = 167 MIPS CPI 3 WB

Data Dependencies: no forwarding A dependant instruction will take = 1 + 2 stalls = 3 clocks An independent instruction will take = 1 + 0 stalls = 1 clocks Suppose 10% of the time the instructions are dependant? Averge instruction time = 10%*3 + 90%*1 = 0. 10*3 + 0. 90*1 = 1. 2 clocks MIPS = Clock = 500 Mhz = 417 MIPS (10% dependency) CPI 1. 2 MIPS = Clock = 500 Mhz = 167 MIPS (100% dependency) CPI 3 MIPS = Clock = 500 Mhz = 500 MIPS (0% dependency) CPI 1

Data Dependencies: with forwarding Clock 1 IF sub $2, $1, $3 and $12, $5 2 ID IF 3 EX ID 4 M EX 5 6 WB M WB Detected Data Hazard 1 a ID/EX. $rs = EX/M. $rd Suppose every instruction is dependant = 1 + 0 stalls = 1 clock MIPS = Clock = 500 Mhz = 500 MIPS CPI 1

Data Dependencies: Hazard Conditions Data Hazard Condition occurs whenever a data source needs a previous unavailable result due to a data destination. Example sub $2, $1, $3 and $12, $5 sub and $rd, $rs, $rt Data Hazard Detection is always comparing a destination with a source. Destination EX/MEM. $rdest = MEM/WB. $rdest = { { Source ID/EX. $rs ID/EX. $rt Hazard Type 1 a. 1 b. 2 a. 2 b.

Data Dependencies: Hazard Conditions 1 a Data Hazard: sub $2, $1, $3 and $12, $5 EX/MEM. $rd = ID/EX. $rs sub $rd, $rs, $rt and $rd, $rs, $rt 1 b Data Hazard: sub $2, $1, $3 and $12, $1, $2 EX/MEM. $rd = ID/EX. $rt sub $rd, $rs, $rt and $rd, $rs, $rt 2 a Data Hazard: sub $2, $1, $3 and $12, $1, $5 or $13, $2, $1 MEM/WB. $rd = ID/EX. $rs sub $rd, $rs, $rt and $rd, $rs, $rt 2 b Data Hazard: sub $2, $1, $3 and $12, $1, $5 or $13, $6, $2 MEM/WB. $rd = ID/EX. $rt sub $rd, $rs, $rt and $rd, $rs, $rt

Data Dependencies: Worst case Data Hazard: sub $2, $1, $3 sub $rd, $rs, $rt and $12, $2 and $rd, $rs, $rt or and $rd, $rs, $rt $13, $2 Data Hazard 1 a: Data Hazard 1 b: Data Hazard 2 a: Data Hazard 2 b: EX/MEM. $rd = ID/EX. $rs EX/MEM. $rd = ID/EX. $rt MEM/WB. $rd = ID/EX. $rs MEM/WB. $rd = ID/EX. $rt

Data Dependencies: Hazard Conditions Hazard Type Source 1 a. 1 b. ID/EX. $rs ID/EX. $rt 2 a. 2 b. ID/EX. $rs ID/EX. $rt ID/EX } Destination = EX/MEM. $rdest } Pipeline Registers EX/MEM = MEM/WB. $rdest MEM/WB $rs $rt $rd $rd

Figure 6. 38

Data Hazards: Loads Backwards in time: Cannot be resolved Forwards in time: Cannot be resolved At same time: Not a hazard Figure 6. 44

Data Hazards: load stalling Stall Figure 6. 45

Data Hazards: Hazard detection unit (page 490) Stall Condition Source IF/ID. $rs IF/ID. $rt } Destination = ID/EX. $rt ID/EX. Mem. Read=1 Stall Example lw $2, 20($1) and $4, $2, $5 lw $rt, addr($rs) and $rd, $rs, $rt No Stall Example: (only need to look at next instruction) lw $2, 20($1) lw $rt, addr($rs) and $4, $1, $5 and $rd, $rs, $rt or $8, $2, $6 or $rd, $rs, $rt

Data Hazards: Hazard detection unit (page 490) No Stall Example: (only need to look at next instruction) lw $2, 20($1) lw $rt, addr($rs) and $4, $1, $5 and $rd, $rs, $rt or $8, $2, $6 or $rd, $rs, $rt Example load: assume half of the instructions are immediately followed by an instruction that uses it. What is the average number of clocks for the load? load instruction time: 50%*(1 clock) + 50%*(2 clocks)=1. 5

Hazard Detection Unit: when to stall Figure 6. 46

Data Dependency Units Forwarding Condition Source ID/EX. $rs ID/EX. $rt } ID/EX. $rs ID/EX. $rt Stall Condition Source IF/ID. $rs IF/ID. $rt } Destination = EX/MEM. $rd } = MEM/WB. $rd Destination = ID/EX. $rt ID/EX. Mem. Read=1

Data Dependency Units Pipeline Registers Stalling Comparisons IF/ID ID/EX $rs $rt $rd Stall Condition Source IF/ID. $rs IF/ID. $rt } Forwarding Comparisons EX/MEM $rd MEM/WB $rd Destination = ID/EX. $rt ID/EX. Mem. Read=1

Branch Hazards: Soln #1, Stall until Decision made (fig. 6. 4) @3 C: @40: add beq $4, $5, $6 $1, $3, 7 Soln #1: Stall until Decision is made @44: @48: @4 C: @50: Stall and or add lw $12, $5 $13, $6, $2 $14, $2 $4, 50($7) Decision made in ID stage: do load

Branch Hazards: Soln #2, Predict until Decision made Clock beq $1, $3, 7 1 IF 2 ID 3 EX 4 M 5 6 7 8 WB Predict false branch and $12, $5 IF ID EX M WB discard “and $12, $5” instruction lw $4, 50($7) IF ID EX M WB Decision made in ID stage: discard & branch

Branch Hazards: Soln #3, Delayed Decision Clock beq $1, $3, 7 1 IF 2 ID 3 EX 4 M 5 6 7 WB Move instruction before branch add $4, $6 IF ID EX M WB Do not need to discard instruction lw $4, 50($7) IF ID EX M Decision made in ID stage: branch WB 8

Branch Hazards: Soln #3, Delayed Decision Clock beq $1, $3, 7 and $12, $5 1 IF 2 ID IF 3 EX ID 4 M EX 5 6 7 WB M WB Decision made in ID stage: do branch lw $4, 50($7) IF ID EX M WB 8

Branch Hazards: Decision made in the ID stage (figure 6. 4) Clock 1 2 IF ID nop No decision yet: insert a nop IF beq $1, $3, 7 lw $4, 50($7) 3 EX ID 4 M EX 5 6 7 WB M WB Decision: do load IF ID EX M WB 8

Branch Hazards: Soln #2, Predict until Decision made Branch Decision made in MEM stage: Discard values when wrong prediction Predict false branch Same effect as 3 stalls Figure 6. 50

Figure 6. 51 Early branch comparison Flush: if wrong prediciton, add nops

Performance load: assume half of the instructions are immediately followed by an instruction that uses it (i. e. data dependency) load instruction time = 50%*(1 clock) + 50%*(2 clocks)=1. 5 Jump: assume that jumps always pay 1 full clock cycle delay (stall). Jump instruction time = 2 Branch: the branch delay of misprediction is 1 clock cycle that 25% of the branches are mispredicted. branch time = 75%*(1 clocks) + 25%*(2 clocks) = 1. 25

Also known as the instruction latency with in a pipeline Performance, page 504 Instruction Single. Cycle Multi-Cycle Clocks Pipeline Cycles Instruction Mix loads 1 5 1. 5 23% Pipeline throughput (50% dependancy) stores 1 4 1 13% arithmetic 1 43% branches 1 3 1. 25 19% (25% dependancy) jumps 1 3 2 2% Clock speed 125 Mhz 8 ns 500 Mhz 2 ns CPI 1 4. 02 1. 18 = Cycles*Mix MIPS 125 MIPS 424 MIPS = Clock/CPI load instruction time = 50%*(1 clock) + 50%*(2 clocks)=1. 5 branch time = 75%*(1 clocks) + 25%*(2 clocks) = 1. 25