CSL 718 Pipelined Processors Improving Branch Performance contd
CSL 718 : Pipelined Processors Improving Branch Performance – contd. 21 st Jan, 2006 Anshul Kumar, CSE IITD
Improving Branch Performance • Branch Elimination – replace branch with other instructions • Branch Speed Up – reduce time for computing CC and TIF • Branch Prediction – guess the outcome and proceed, undo if necessary • Branch Target Capture – make use of history Anshul Kumar, CSE IITD 2
Improving Branch Performance • Branch Elimination – replace branch with other instructions • Branch Speed Up – reduce time for computing CC and TIF • Branch Prediction – guess the outcome and proceed, undo if necessary • Branch Target Capture – make use of history Anshul Kumar, CSE IITD 3
Branch Elimination F C Use conditional/guarded instructions (predicated execution) T S OP 1 BC CC = Z, + 2 ADD R 3, R 2, R 1 OP 2 Examples: C: S OP 1 ADD R 3, R 2, R 1, NZ OP 2 HP PA (all integer arithmetic/logical instructions) DEC Alpha, SPARC V 9 (conditional move) Anshul Kumar, CSE IITD 4
Branch Elimination - contd. IF OP 1 BC IF IF D AG DF DF DF EX EX IF IF IF D AG TIF TIF IF D’ D IF IF D AG DF DF DF EX EX ADD/OP 2 ADD (cond) CC IF Anshul Kumar, CSE IITD AG 5
Improving Branch Performance • Branch Elimination – replace branch with other instructions • Branch Speed Up – reduce time for computing CC and TIF • Branch Prediction – guess the outcome and proceed, undo if necessary • Branch Target Capture – make use of history Anshul Kumar, CSE IITD 6
Branch Speed Up : early target address generation • • Assume each instruction is Branch Generate target address while decoding If target in same page omit translation After decoding discard target address if not Branch BC IF IF IF Anshul Kumar, CSE IITD D TIF TIF AG 7
Branch Speed Up : increase CC - branch gap Increase the gap between the instruction which sets CC and branching • Early CC setting • Delayed branch Anshul Kumar, CSE IITD 8
delayed early CC branch setting Summary - Branch Speed Up uncond (T) cond (I) Anshul Kumar, CSE IITD n=0 4 6 5 n=1 4 5 4 3 5 4 n=2 4 4 3 2 4 3 n=3 4 4 2 1 3 2 n=4 4 4 1 0 2 1 n=5 4 4 0 0 1 0 9
Delayed Branch with Nullification • • (Also called annulment ) Delay slot is used optionally Branch instruction specifies the option Option may be exercised based on correctness of branch prediction Helps in better utilization of delay slots Anshul Kumar, CSE IITD 10
Improving Branch Performance • Branch Elimination – replace branch with other instructions • Branch Speed Up – reduce time for computing CC and TIF • Branch Prediction – guess the outcome and proceed, undo if necessary • Branch Target Capture – make use of history Anshul Kumar, CSE IITD 11
Branch Prediction • Treat conditional branches as unconditional branches / NOP • Undo if necessary Strategies: – Fixed (always guess inline) – Static (guess on the basis of instruction type / displacement) – Dynamic (guess based on recent history) Anshul Kumar, CSE IITD 12
Static Branch Prediction Total 68. 2% Anshul Kumar, CSE IITD 13
Threshold for Static prediction I-1 I IF IF D AG AG DF DF EX EX IF IF D CC AG AG TIF actual T I guess T 4 5 I 6 0 guess target if 4 p + 5 (1 - p) < 6 p + 0 (1 - p) i. e. p >. 71 Anshul Kumar, CSE IITD 14
Dynamic Branch Prediction basic idea Predict based on the history of previous branch loop: xxx xxx BC loop Anshul Kumar, CSE IITD 2 mispredictions for every occurrence 15
Dynamic Branch Prediction 2 bit prediction scheme N 0 T 0/1 predict taken 1 3/2 T T N N 2 3 predict not taken N T Anshul Kumar, CSE IITD 16
Dynamic Branch Prediction second scheme Predict based on the history of previous n branches e. g. , if n = 3 then 3 branches taken predict taken 2 branches taken predict taken 1 branch taken predict not taken 0 branches taken predict not taken Anshul Kumar, CSE IITD 17
Dynamic Branch Prediction Bimodal predictor Maintain saturating counters T T N 0 1 N T 2 N 3 T N One counter per branch or One counter per cache line merge results if multiple branches Anshul Kumar, CSE IITD 18
Dynamic Branch Prediction History of last n occurrences current entry outcome of last three occurrences of this branch 1 1 0 updated entry actual outcome ‘taken’ 1 1 1 0 : not taken 1 : taken prediction using majority decision Anshul Kumar, CSE IITD 19
Dynamic Branch Prediction storing prediction counters store in separate buffer or store in cache directory CACHE directory storage cache line counter Anshul Kumar, CSE IITD 20
Correct guesses vs. history length Anshul Kumar, CSE IITD 21
Two-Level Prediction • Uses two levels of information to make a direction prediction – Branch History Table (BHT) - last n occurrences – Pattern History Table (PHT) - saturating 2 bit counters • Captures patterned behavior of branches – Groups of branches are correlated – Particular branches have particular behavior Anshul Kumar, CSE IITD 22
Correlation between branches B 1: if (x). . . B 2: if (y). . . z = x && y B 3: if (z). . . Anshul Kumar, CSE IITD • B 3 can be predicted with 100% accuracy based on the outcomes of B 1 and B 2 23
Some Two-level Predictors PC GBHR 10110 BHT PHT 11010 T/NT PHT T/NT 01111 11100 00111 Global Predictor Local Predictor bits from PC and BHT can be combined to index PHT Anshul Kumar, CSE IITD 24
Two-level Predictor Classification • Yeh and Patt 3 -letter naming scheme – Type of history collected • G (global), P (per branch), S (per set) – PHT type • A (adaptive), S (static) – PHT organization • g (global), p (per branch), s (per set) • Examples - GAs, PAp etc. Anshul Kumar, CSE IITD 25
Improving Branch Performance • Branch Elimination – replace branch with other instructions • Branch Speed Up – reduce time for computing CC and TIF • Branch Prediction – guess the outcome and proceed, undo if necessary • Branch Target Capture – make use of history Anshul Kumar, CSE IITD 26
Branch Target Capture • Branch Target Buffer (BTB) • Target Instruction Buffer (TIB) instr addr pred stats prob of target change < 5% Anshul Kumar, CSE IITD target addr target instr 27
BTB Performance decision result BTB miss go inline. 4 inline . 6 target inline . 8. 2 delay 0 BTB hit go to target . 2. 8 5 4 0 . 4*. 8*0 +. 4*. 2*5 +. 6*. 2*4 +. 6*. 8*0 = 0. 88 Anshul Kumar, CSE IITD 28
Dynamic information about branch • Previous branch decisions • Explicit prediction • Stored in cache directory Branch History Table, BHT • Previous target address / instruction • Implicit prediction • Stored in separate buffer Branch Target Buffer, BTB Br Target Addr Cache, BTAC Target Instr Buffer, TIB Br Target Instr Cache, BTIC These two can be combined Anshul Kumar, CSE IITD 29
Storing prediction info directory storage In cache line counter In separate buffer instr addr Anshul Kumar, CSE IITD pred stats target 30
Combined prediction mechanism • Explicit : use history bits • Implicit : use BTB hit/miss – hit go to target, miss go inline • Combined : BTB hit/miss followed by explicit prediction using history bits. One of the following is commonly used – hit go to target, miss explicit prediction – miss go inline, hit explicit prediction Anshul Kumar, CSE IITD 31
Combined prediction BTB miss I I BTB miss BTB hit expl predict I I T T T I BTB hit T I TI T T Prediction T: Target, I: Inline Actual outcome T: Target, I: Inline Anshul Kumar, CSE IITD 32
Structure of Tables Instruction fetch path with • BHT • BTAC • BTIC Anshul Kumar, CSE IITD 33
Compute/fetch scheme (no dynamic branch prediction) BTA IIFA Compute BTA Instruction I Fetch address F A R A I I+1 I+2 I+3 I - cache + Next sequential address Anshul Kumar, CSE IITD BTI+1 BTI+2 BTI+3 34
BHT (Branch History Table) Instruction Fetch address 2 2 128 x 4 lines 8 instr/line I-cache 16 K 4 -way set assoc BHT 2 2 4 instr/cycle decode queue issue queue 128 x 4 entries History bits 4 x 1 instr Prediction logic 4 x 1 instr Taken / not taken BTA for a taken guess Anshul Kumar, CSE IITD 35
BTAC scheme BTA IIFA Instruction I Fetch address F A R A I I+1 I+2 I+3 I - cache BA BTAC + Next sequential address Anshul Kumar, CSE IITD BTI+1 BTI+2 BTI+3 36
BTIC scheme - 1 BTA IIFA Instruction I Fetch address F A R A I I - cache BA BTI BTA+ BTIC + Next sequential address To decoder Anshul Kumar, CSE IITD 37
BTIC scheme - 2 computed BTA+ IIFA Instruction I Fetch address F A R A I I - cache I+1 BA BTI+1 BTIC + Next sequential address To decoder Anshul Kumar, CSE IITD 38
Successor index in I-cache IIFA Instruction I Fetch address F A R A I I+1 successor I+2 I+3 index I - cache Next address BTI+1 BTI+2 BTI+3 Anshul Kumar, CSE IITD 39
- Slides: 39