Branch Prediction Static Dynamic Branch prediction techniques 1014

Branch Prediction Static, Dynamic Branch prediction techniques 10/14 branch. 1

Control Flow Penalty Why Branch Prediction Next fetch started Modern processors have 10 -14 pipeline stages between next PC calculation and branch resolution ! PC I-cache Fetch Buffer Fetch Decode Issue Buffer work lost if pipeline makes wrong prediction Func. Units ~ Loop length x pipeline width Branch executed Result Buffer Execute Commit Arch. State 10/14 branch. 2

Branch Penalties in a Superscalar are extensive 10/14 branch. 3

Reducing Control Flow Penalty Software solutions • Minimize branches - loop unrolling Increases the run length Hardware solutions • Find something else to do - delay slots • Speculate –Dynamic branch prediction Speculative execution of instructions beyond branch 10/14 branch. 4

Branch Prediction Motivation: Branch penalties limit performance of deeply pipelined processors Much worse for superscalar processors Modern branch predictors have high accuracy (>95%) and can reduce branch penalties significantly Required hardware support: Dynamic Prediction HW: • Branch history tables, branch target buffers, etc. 10/14 Mispredict recovery mechanisms: • Keep computation result separate from commit • Kill instructions following branch • Restore state to state following branch. 5

Static Branch Prediction- review Overall probability a branch is taken is ~60 -70% but: backward 90% JZ forward 50% JZ ISA can attach preferred direction semantics to branches, e. g. , Motorola MC 88110 bne 0 (preferred taken) beq 0 (not taken) ISA can allow arbitrary choice of statically predicted direction, e. g. , HP PA-RISC, Intel IA-64 typically reported as ~80% accurate 10/14 branch. 6

Branch Prediction Needs • Target address generation – Get register: PC, Link reg, GP reg. – Calculate: +/- offset, auto inc/dec – Target speculation • Condition resolution – Get register: condition code reg, count reg. , other reg. – Compare registers – Condition speculation 10/14 branch. 7

Target address generation takes time 10/14 branch. 8

Condition resolution takes time 10/14 branch. 9

Solution: Branch speculation 10/14 branch. 10

Branch Prediction Schemes 1. 2. 3. 4. 5. 6. 2 -bit Branch-Prediction Buffer Branch Target Buffer Correlating Branch Prediction Buffer Tournament Branch Predictor Integrated Instruction Fetch Units Return Address Predictors (for subroutines, Pentium, Core Duo) 7. Predicated Execution (Itanium) 10/14 branch. 11

Dynamic Branch Prediction learning based on past behavior History Information Incoming Branches { Address } Branch Predictor Prediction { Address, Value } Corrections { Address, Value } 10/14 • Incoming stream of addresses • Fast outgoing stream of predictions • Correction information returned from pipeline branch. 12

Branch History Table (BHT) Table of predictors • Each branch given its own predictor • BHT is table of “Predictors” Branch PC Predictor 0 Predictor 1 – Could be 1 -bit or more – Indexed by PC address of Branch • Problem: in a loop, 1 -bit BHT will cause two mispredictions (avg is 9 iterations before exit): – End of loop case: when it exits loop – First time through loop, it predicts exit instead of looping • most schemes use at least 2 bit predictors • Performance = ƒ(accuracy, cost of misprediction) Predictor 7 – Misprediction Flush Reorder Buffer • In Fetch state of branch: – Use Predictor to make prediction • When branch completes – Update corresponding Predictor 10/14 branch. 13

Branch History Table Organization Target PC calculation takes time Fetch PC 00 k I-Cache BHT Index 2 k-entry BHT, 2 bits/entry Instruction Opcode offset + Branch? Target PC Taken/¬Taken? 4 K-entry BHT, 2 bits/entry, ~80 -90% correct predictions 10/14 branch. 14

2 -bit Dynamic Branch Prediction more accurate than 1 -bit • Better Solution: 2 -bit scheme where change prediction only if get misprediction twice: T Predict Taken NT T T NT NT Predict Not Taken Predict Taken T Predict Not Taken NT • Red: stop, not taken • Green: go, taken • Adds hysteresis to decision making process 10/14 branch. 15

BTB: Branch Address at Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) PC of instruction FETCH Branch PC =? Predicted PC Yes: instruction is prediction state branch and use bits predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4) Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded later: check prediction, if wrong kill instruction, update BPb 10/14 branch. 16

BTB contains only Branch & Jump Instructions BTB contains information for branch and jump instructions only not updated for other instructions For all other instructions the next PC is PC+4 ! Achieved without decoding instruction 10/14 branch. 17

Combining BTB and BHT • BTB entries considerably more expensive than BHT, fetch redirected earlier in pipeline - can accelerate indirect branches (JR) • BHT can hold many more entries - more accurate A BTB BHT in later pipeline stage corrects when BTB misses a predicted taken branch BHT P F B I J R E PC Generation/Mux Instruction Fetch Stage 1 Instruction Fetch Stage 2 Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute BTB/BHT only updated after branch resolves in E stage 10/14 branch. 18

Subroutine Return Stack • Small stack – accelerate subroutine returns • more accurate than BTBs. Pop return address when subroutine return decoded Push return address when function call executed &nextc &nextb &nexta 10/14 k entries (typically k=8 -16) branch. 19

Mispredict Recovery In-order execution machines: – Instructions issued after branch cannot write-back before branch resolves – all instructions in pipeline behind mispredicted branch Killed 10/14 branch. 20

Predicated Execution • Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP – If false, then neither store result nor cause exception – Expanded ISA of Alpha, MIPS, Power. PC, SPARC have conditional move; PA-RISC can annul any following instr. – IA-64: 64 1 -bit condition fields selected so conditional execution of any instruction – This transformation is called “if-conversion” x A= B op C • Drawbacks to conditional instructions – Still takes a clock even if “annulled” – Stall if condition evaluated late – Complex conditions reduce effectiveness; condition becomes known late in pipeline 10/14 branch. 21

Accuracy v. Size (SPEC 89) 10/14 branch. 22

Dynamic Branch Prediction Summary • Prediction becoming important part of scalar execution • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch. • Tournament Predictor: more resources to competitive solutions and pick between them • Branch Target Buffer: include branch address & prediction • Predicated Execution can reduce number of branches, number of mispredicted branches • Return address stack for prediction of indirect jump 10/14 branch. 23