Abstraction Question General purpose processors have an abstraction

Abstraction Question • General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the machine • Embedded processors tend to have entire control over the code run, compiler (if any), the ISA, and the hardware – so breaking the abstraction layer makes sense. • Bottom line becomes who controls what elements of the design. Intel side note

We’ve been doing mips – how’d intel do this for a CISC x 86 ISA? Microops from Pentium Pro on

Read 5. 1, 5. 2 Branch Hazards or “Which way did he go? ” Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-Non. Commercial-Share. Alike 3. 0 Unported License.

Control Dependence • Just as an instruction will be dependent on other data instructions to provide its operands ( dependence), it will also be dependent on other instructions to determine branchwhether it gets executed or not ( dependence or control ______ dependence). • Control dependences are particularly critical with _______ branches. conditional add $5, $3, $2 sub $6, $5, $2 beq $6, $7, somewhere and $9, $6, $1. . . somewhere: or $10, $5, $2 add $12, $11, $9 . . .

Dealing With Branch Hazards • Hardware – stall until you know which direction – reduce hazard through earlier computation of branch direction – guess which direction § assume not taken (easiest) § more educated guess based on history (requires that you know it is a branch before it is even decoded!) • Hardware/Software – noops, or instructions that get executed either way (delayed branch).

Stalling the pipeline Given our current pipeline – let’s assume we stall until we know the branch outcome. How many cycles will you lose per branch? Sel cycles ecti on A 0 B 1 C 2 D 3 E 4

Stalling for Branch Hazards CC 1 beq $4, $0, there IM and $12, $5 or. . . add. . . sw. . . CC 2 CC 3 Reg Bubble CC 4 DM Bubble CC 5 CC 6 CC 7 CC 8 Reg IM DM Reg IM Reg DM Reg

Stalling for Branch Hazards • • Why? Seems wasteful, particularly when the branch isn’t taken. Makes all branches cost 4 cycles. • What if we just assume ______ branches. Aren’t taken

Assume Branch Not Taken • works pretty well when you’re right CC 1 CC 2 beq $4, $0, there IM Reg and $12, $5 IM or. . . add. . . sw. . . CC 3 CC 4 DM Reg IM CC 5 CC 7 Reg DM Reg IM CC 6 Reg DM Reg IM Reg DM Reg CC 8

Assume Branch Not Taken • same performance as stalling when you’re wrong CC 1 CC 2 beq $4, $0, there IM Reg and $12, $5 IM CC 3 CC 4 DM Reg CC 5 CC 6 CC 7 Reg Flush Wrong Path insts or. . . add. . . there: sub $12, $4, $2 IM Reg Flush IM Reg CC 8

Stalling the pipeline Let’s improve the pipeline so we move branch resolution to Decode. How many cycles would we lose then on a taken branch? Sel cycles ecti on A 0 B 1 C 2 D 3 E 4 Add drawing of Resolving in decode.

The Pipeline with flushing for taken branches • Notice the IF/ID flush line added.

Branch Hazards – Assume Not Taken • Great if most of your branches aren’t taken. • What about loops which are taken 95% of the time? – we would like the option of assuming not taken for some branches, and taken for others, depending on ? ? ?

Branch Hazards – Predicting Taken CC 3 IM Reg DM IM Reg ALU here: lw CC 2 ALU beq $2, $1, here CC 1 CC 4 Required information to predict Taken: CC 5 CC 6 CC 7 CC 8 Reading quiz Reg DM Reg Selection Required knowledge 1. An instruction is a branch before decode 2. The target of the branch A 2, 3 3. The outcome of the branch B 1, 2, 3 C 1, 2 D 2 E None of the above

Branch Target Buffer • Keeps track of the PCs of recently seen branches and their • targets. Consult during Fetch (in parallel with Instruction Memory read) to determine: – Is this a branch? – If so, what is the target

Branch Hazards – Predict Taken • Static policy: – Forward branches (if statements) predict not taken – Backward branches (loops) predict taken • Dynamic prediction (coming soon) • First – Branch Delay Slots

Eliminating the Branch Stall • There’s no rule that says we have to see the effect of the • • branch immediately. Why not wait an extra instruction before branching? The original SPARC and MIPS processors each used a single branch delay slot to eliminate single-cycle stalls after branches. The instruction after a conditional branch is always executed in those machines, regardless of whether the branch is taken or not!

Branch Delay Slot CC 1 CC 2 beq $4, $0, there IM Reg and $12, $5 IM there: or. . . add. . . sw. . . CC 3 CC 4 DM Reg IM CC 5 CC 7 CC 8 Reg DM Reg IM CC 6 Reg DM Reg IM Reg DM Reg Branch delay slot instruction (next instruction after a branch) is executed even if the branch is taken.

Filling the branch delay slot 1 add $5, $3, $7 No-R 7 WAR 2 add $9, $1, $3 Safe, $1 and $3 are fine 3 sub $6, $1, $4 No-R 6 No-R 7 4 and $7, $8, $2 5 beq $6, $7, there nop /* branch delay slot */ 6 add $9, $1, $2 Not safe ($9 on nt path) 7 sub $2, $9, $5 Not safe (needs $9 not yet produced) . . . there: 8 mult $2, $10, $11 Not safe ($2 is used before overwritten) … E is the correct answer * It is not safe to assume anything about the … code Selection Safe instructions A 1, 2 B 2, 6 C 6, 8 D 1, 2, 7, 8 E None of the above

Filling the branch delay slot • The branch delay slot is only useful if you can find • something to put there. If you can’t find anything, you must put a noop to insure correctness.

Branch Delay Slots • This works great for this implementation of the • architecture, but becomes a permanent part of the ISA. What about the MIPS R 10000, which has a 5 -cycle branch penalty, and executes 4 instructions per cycle? ? ? • Bottom line: Exposed a detail of the hardware implementation to the ISA.

Dynamic Branch Prediction • Can we guess the outcome of branches? • What should we base that guess on?

Branch Prediction program counter 1 0 1 for (i=0; i<10; i++) {. . . } . . . add $i, #1 beq $i, #10, loop Accuracy? Too easily swayed

Two-bit predictors give better loop prediction 00 Weakly Taken 10 Weakly Not Taken 01 Strongly Not Taken 00 for (i=0; i<10; i++) {. . . } Increment when taken PHT Decrement when not taken branch address Strongly Taken 11 . . . add $i, #1 beq $i, #10, loop Better (less sway) Slower learning

2 -bit A. T T N B. T T N T N C. N T N Strongly Taken 11 Weakly Taken 10 Weakly Not Taken 01 Strongly Not Taken 00 Increment when taken 1 -bit A. T T N B. T T N Decrement when not taken Suppose we have the following branch patterns for 3 branches (A, B, C). What is the accuracy of a 1 -bit and 2 -bit Branch History Table. Assume initial values of 1 (1 -bit) and (10) 2 -bit.

Modern Branch Prediction - Pentium 4 • Performance dependent on accurate Branch Prediction • 20 Stage Pipeline – 3 -way issue – 60 instructions in flight (12 branches) – 17 th stage is branch resolution – ~17*3=51 instructions lost on mispredict 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 X

Branch Prediction • Latest branch predictors are significantly more sophisticated, using more advanced correlating techniques, larger structures, and even AI techniques • Use patterns of branches (local history) and recent other branch history (global history) to make predictions

Putting it all together. For a given program on our 5 -stage MIPS pipeline processor: 20% of insts are loads, 50% of instructions following a load are arithmetic instructions depending on the load 20% of instructions are branches. Using dynamic branch prediction, we achieve 80% prediction accuracy. What is the CPI of your program? Selection CPI A 0. 76 B 0. 9 C 1. 0 D 1. 1 E 1. 14

Control Hazards -- Key Points • Control (or branch) hazards arise because we must fetch • • the next instruction before we know if we are branching or where we are branching. Control hazards are detected in hardware. We can reduce the impact of control hazards through: – early detection of branch address and condition – branch prediction – branch delay slots

Given our 5 -stage MIPS pipeline – what is the steady state CPI for the following code? Assume the branch is taken thousands of times. Recall – a processor is in steady state when all stages are active. Steady-State CPI = (#insts+#stalls+#flushed_insts) #insts Loop: lw r 1, 0 (r 2) add r 2, r 3, r 4 sub r 5, r 1, r 2 beq r 5, $zero, Loop Selection CPI A 1 B 1. 25 C 1. 5 D 1. 75 E None of the above

Hardware engineers determine these to be the execution IF = 200 ps times per stage of the MIPS 5 -stage pipeline processor. Consider splitting IF and M into 2 stages each. (So IF 1 IF 2 ID = 100 ps and M 1 M 2. ) The most important code run by the EX = 100 ps company is (assume branch is taken most of the time): M = 200 ps Loop: lw r 1, 0 (r 2) Isomorphic add r 2, r 3, r 4 WB = 100 ps sub r 5, r 1, r 2 beq r 5, $zero, Loop What would be the impact of the new 7 -stage pipeline compared to the original 5 -stage MIPS pipeline. . Assume the pipeline has forwarding where available, predicts branch not taken, and resolves branches in ID. Selection CPI CT A Increase B Increase Decrease C Decrease Increase D Decrease E Increase No Change

Hardware engineers determine these to be the execution IF = 200 ps times per stage of the MIPS 5 -stage pipeline processor. Consider splitting IF and M into 2 stages each. (So IF 1 IF 2 ID = 100 ps and M 1 M 2. ) The most important code run by the EX = 200 ps company is (assume branch is taken most of the time): M = 200 ps Loop: lw r 1, 0 (r 2) add r 2, r 3, r 4 WB = 100 ps sub r 5, r 1, r 2 beq r 5, $zero, Loop What would be the impact of the new 7 -stage pipeline compared to the original 5 -stage MIPS pipeline. . Assume the pipeline has forwarding where available, predicts branch not taken, and resolves branches in ID. Selection CPI CT A Increase B Increase Decrease C Decrease Increase D Decrease E Increase No Change

7 -stage Pipeline Loop: lw r 1, 0 (r 2) add r 2, r 3, r 4 sub r 5, r 1, r 2 beq r 5, $zero, Loop For diagramming the code in the last slide. Drive home – more stages = more hazards

Pipelining -- Key Points • Pipelining focuses on improving instruction throughput, • • not individual instruction latency. Data hazards can be handled by hardware or software – but most modern processors have hardware support for stalling and forwarding. Control hazards can be handled by hardware or software – but most modern processors use Branch Target Buffers and advanced dynamic branch prediction to reduce the hazard. • ET = IC*CPI*CT