COM 506 Computer Design Lecture 3 Branch Prediction
COM 506 Computer Design Lecture 3. Branch Prediction Prof. Taeweon Suh Computer Science Education Korea University
Predict What? • Direction (1 -bit) § Single direction for unconditional jumps and calls/returns § Binary for conditional branches • Target (32 -bit or 64 -bit addresses) § Some are easy • One: Uni-directional jumps • Two: Fall through (Not Taken) vs. Taken § Many: Function Pointer or Indirect Jump (e. g. jr r 31) Prof. Sean Lee’s Slide 2 Korea Univ
Categorizing Branches Source: H&P using Alpha Prof. Sean Lee’s Slide 3 Korea Univ
Branch Misprediction 1 2 3 4 5 6 7 8 9 10 11 12 PC Next PC Fetch. Drive. Alloc Rename Queue Schedule 13 14 15 16 17 18 19 20 Dispatch Reg File Exec. Flags. Br Resolve Single Issue Prof. Sean Lee’s Slide 4 Korea Univ
Branch Misprediction 1 2 3 4 5 6 7 8 9 10 11 12 PC Next PC Fetch. Drive. Alloc Rename Queue Schedule 13 14 15 16 17 18 19 20 Dispatch Reg File Exec. Flags. Br Resolve Single Issue Mispredict Prof. Sean Lee’s Slide 5 Korea Univ
Branch Misprediction 1 2 3 4 5 6 7 8 9 10 11 12 PC Next PC Fetch. Drive. Alloc Rename Queue Schedule 13 14 15 16 17 18 19 20 Dispatch Reg File Exec. Flags. Br Resolve Single Issue (flush entailed instructions and refetch) Mispredict Prof. Sean Lee’s Slide 6 Korea Univ
Branch Misprediction 1 2 3 4 5 6 7 8 9 10 11 12 PC Next PC Fetch. Drive. Alloc Rename Queue Schedule 13 14 15 16 17 18 19 20 Dispatch Reg File Exec. Flags. Br Resolve Single Issue Fetch the correct path Prof. Sean Lee’s Slide 7 Korea Univ
Branch Misprediction 1 2 3 4 5 6 7 8 9 10 11 12 PC Next PC Fetch. Drive. Alloc Rename Queue Schedule 13 14 15 16 17 18 19 20 Dispatch Reg File Exec. Flags. Br Resolve Single Issue Mispredict 8 -issue Superscalar Processor (Worst case) Prof. Sean Lee’s Slide 8 Korea Univ
Why Branch is Predictable? if (aa==2) aa = 0; if (bb==2) bb = 0; if (aa!=bb) …. for (i=0; i<100; i++) { …. } addi bne xor j r 10, r 0, 100 r 1, r 0 L 1: … … …… addi r 1, 1 bne r 1, r 10, L 1 …… Prof. Sean Lee’s Slide 9 r 2, r 0, 2 r 10, r 2, L_bb r 10, r 10 L_exit L_bb: bne r 11, r 2, L_xx xor r 11, r 11 j L_exit L_xx: … Lexit: beq r 10, r 11, L_exit Korea Univ
Control Speculation • Execute instruction beyond a branch before the branch is resolved Performance • Speculative execution • What if mis-speculated? need § Recovery mechanism § Squash instructions on the incorrect path • Branch prediction: Dynamic vs. Static • What to predict? Prof. Sean Lee’s Slide 10 Korea Univ
Static Branch Prediction • Uni-directional, always predict taken • Backward taken, Forward not taken § Need offset information • Compiler hints with branch annotation • Static predication is used as a fall-back technique in some processors with dynamic branch when there is not any information for dynamic predictors to use • Example § Pentium 4 introduced static hints to branches § Pentium 4 uses it as a fall-back – instruction prefixes can be added before a branch instruction • 0 x 3 E – statically predict a branch as taken • 0 x 2 E – statically predict a branch as not taken Modified from Prof. Sean Lee’s Slide 11 Korea Univ
Simplest Dynamic Branch Predictor • Prediction based on latest outcome • Index by some bits in the branch PC § Aliasing for (i=0; i<100; i++) { …. } 0 x 40010100 0 x 40010104 addi 0 x 40010108 L 1: …… … …… 0 x 40010 A 04 addi 0 x 40010 A 08 bne …… Prof. Sean Lee’s Slide NT T T NT r 10, r 0, 100 r 1, r 0 1 -bit Branch History Table T. . . T NT r 1, 1 r 1, r 10, L 1 How accurate? 12 NT Korea Univ
Typical Table Organization PC (32 bits) 2 N entries Hash ……… N bits table update FSM Update Logic Actual outcome Prediction Prof. Sean Lee’s Slide 13 Korea Univ
Simplest Dynamic Branch Predictor for (i=0; i<100; i++) { if (a[i] == 0) { … } 0 x 40010100 0 x 40010104 0 x 40010108 0 x 4001010 c 0 x 40010110 0 x 40010210 0 x 40010 B 0 c 0 x 40010 B 10 NT T T addi r 10, r 0, 100 addi r 1, r 0 L 1: add r 21, r 20, r 1 lw r 2, (r 21) beq r 2, r 0, L 2 …… j L 3 L 2: ……… L 3: addi r 1, 1 bne r 1, r 10, L 1 Prof. Sean Lee’s Slide NT 1 -bit Branch History Table T. . . T NT NT 14 Korea Univ
FSM of the Simplest Predictor • A 2 -state machine • Change mind fast 0 1 If branch taken If branch not taken Prof. Sean Lee’s Slide 0 Predict not taken 1 Predict taken Korea Univ 15
Example using 1 -bit branch history table addi L 1: …… addi bne for (i=0; i<4; i++) { …. } r 10, r 0, 4 r 1, r 0 r 1, 1 r 1, r 10, L 1 0 1 1 1 1 0 Pred Actual T T T T NT 1 T 60% accuracy Prof. Sean Lee’s Slide 16 Korea Univ
2 -bit Saturating Up/Down Counter Predictor MSB: Direction bit LSB: Hysteresis bit 10/ WT 11/ ST 01/ WN 00/ SN Taken Not Taken ST: Strongly Taken Predict Not taken WT: Weakly Taken WN: Weakly Not Taken Predict taken Prof. Sean Lee’s Slide SN: Strongly Not Taken 17 Korea Univ
2 -bit Counter Predictor (Another Scheme) 11/ ST 10/ WT 01/ WN 00/ SN Taken Not Taken ST: Strongly Taken Predict Not taken WT: Weakly Taken Predict taken SN: Strongly Not Taken Prof. Sean Lee’s Slide WN: Weakly Not Taken 18 Korea Univ
Example using 2 -bit up/down counter addi L 1: …… addi bne for (i=0; i<4; i++) { …. } r 10, r 0, 4 r 1, r 0 r 1, 1 r 1, r 10, L 1 Pred 01 10 11 11 10 Actual T T T T NT 1 T 80% accuracy Prof. Sean Lee’s Slide 19 Korea Univ
Branch Correlation Code Snippet if (aa==2) // b 1 aa = 0; if (bb==2) // b 2 bb = 0; if (aa!=bb) { // b 3 ……. } 1 (T) 1 b 3 Path: A: 1 -1 aa=0 bb=0 b 2 b 1 0 (NT) 0 1 b 3 b 2 0 b 3 B: 1 -0 C: 0 -1 D: 0 -0 aa=0 aa 2 bb=0 bb 2 • Branch direction § Not independent § Correlated to the path taken • Example: Path 1 -1 of b 3 can be surely known beforehand • Track path using a 2 -bit register Prof. Sean Lee’s Slide 20 Korea Univ
Correlated Branch Predictor [Pan. So. Rahmeh’ 92] 2 -bit shift register (global branch history) Subsequent branch direction select Branch PC X X hash 2 -bit Sat. Prediction hash . . 2 -bit counte r Counter w 2 w Scheme X X . . . . 2 -bit counter Prediction (2, 2) Correlation Scheme • (M, N) correlation scheme § M: shift register size (# bits) § N: N-bit counter Prof. Sean Lee’s Slide 21 Korea Univ
Two-Level Branch Predictor [Yeh. Patt 91, 92, 93] Pattern History Table (PHT) 00…. . 00 2 N entries 00…. . 01 Branch History Register (BHR) 00…. . 10 Rc-1 Rc-k 1 1. . . ……. (Shift left when update) 1 0 N Prediction 11…. . 10 11…. . 11 Branch History Pattern Rc: Actual Branch Outcome Current State PHT update FSM Update Logic • Generalized correlated branch predictor • 1 st level keeps branch history in Branch History Register (BHR) • 2 nd level segregates pattern history in Pattern History Table (PHT) Prof. Sean Lee’s Slide 22 Korea Univ
Branch History Register • An N-bit Shift Register = 2 N patterns in PHT • Shift-in branch outcomes § 1 taken § 0 not taken • First-in First-Out • BHR can be § Global § Per-set § Local (Per-address) Prof. Sean Lee’s Slide 23 Korea Univ
Pattern History Table • 2 N entries addressed by N-bit BHR • Each entry keeps a counter (2 -bit or more) for prediction § Counter update: the same as 2 -bit counter § Can be initialized in alternate patterns (01, 10, . . ) • Alias (or interference) problem Prof. Sean Lee’s Slide 24 Korea Univ
Source: A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History by Yeh and Patt, 1993 25 Korea Univ
Global History Schemes GAg Set. P(B) Global PHT Global BHR Per-set PHTs (SPHTs) Global BHR GAp Addr(B) Global BHR … . . … … … . . Per-addr PHTs (PPHTs) Set can be determined by branch opcode, compiler classification, or branch PC address. * [Pan. So. Rahmeh’ 92] Prof. Sean Lee’s Slide similar to GAp 26 Korea Univ
GAs Two-Level Branch Prediction The 2 LSBs are insignificant for 32 -bit instruction Set PHT 00000001 00000010 PC = 0 x 4001000 C … 00110110 10 00110111 … 0110 BHR 11111101 11111110 1111 Modified from Prof. Sean Lee’s Slide 27 MSB = 1 Predict Taken Korea Univ
Predictor Update (Actually, Not Taken) PHT 00000001 00000010 PC = 0 x 4001000 C … 00111100 00110110 01 10 ng Wro ction di Pre decremented 00110111 00111100 BHR 11111101 … 0110 1100 11111111 • Update Predictor after branch is resolved Prof. Sean Lee’s Slide 28 Korea Univ
Per-Address History Schemes PAg PAs Per-addr BHT (PBHT) . . … Addr(B) Per-addr PHTs (PPHTs) … … . . PAp … Prof. Sean Lee’s Slide … … Ex: Alpha 21264’s local predictor Set. P(B) … Addr(B) … … … Addr(B) Global PHT Per-set PHTs (SPHTs) Ex: P 6, Itanium 29 Korea Univ
PAs Two-Level Branch Predictor Set PC = 1110 0000 1001 0010 1100 1110 1000 PHT 00000001 00000010 … 11010110 001 010 11010101 11010110 11 … 011 100 101 110 11111101 11111110 BHT Modified from Prof. Sean Lee’s Slide 1111 30 MSB = 1 Predict Taken Korea Univ
Per-Set History Schemes SAg SAs Per-set BHT (SBHT) Addr(B) Per-addr PHTs (PPHTs) … . . … … 31 Per-set BHT (SBHT) … . . SAp Set. H(B) … Prof. Sean Lee’s Slide Set. P(B) … … Set. H(B) Global PHT Per-set PHTs (SPHTs) Korea Univ
PHT Indexing Branch addr Global history Gselect 4/4 00000001 00000000 1111 0000 11110000 1111 10000000 11110000 Insufficient History • Tradeoff between more history bits and address bits • Too many bits needed in Gselect sparse table entries Prof. Sean Lee’s Slide 32 Korea Univ
Gshare Branch Predictor [Mc. Farling 93] Branch addr Global history Gselect 4/4 Gshare 8/8 00000001 00000001 00000000 1111 0000 11110000 11111111 10000000 11110000 01111111 Gselect 4/4: Index PHT by concatenate low order 4 bits Gshare 8/8: Index PHT by {Branch address Global history} • • Tradeoff between more history bits and address bits Too many bits needed in Gselect sparse table entries Gshare Not to lose global history bits Ex: AMD Athlon, MIPS R 12000, Sun MAJC, Broadcom Si. Byte’s SB-1 Prof. Sean Lee’s Slide 33 Korea Univ
Gshare Branch Predictor PHT PC Address 1 1. 0 1. . . 0 1 0 0 1 1. . . … 00 1 0 Prof. Sean Lee’s Slide … Global BHR 34 MSB = 0 Predict Not Taken Korea Univ
Aliasing Example GAp BHR PC || 1101 0110 ---1001 1010 ---1001 Prof. Sean Lee’s Slide PHT (indexed by 10) Gshare 0000 PHT 0000 0001 BHR PC 0010 0011 0100 XOR 0101 1101 0110 ---1011 0001 0010 0011 0100 0101 0110 0111 1000 1001 BHR PC 1010 1011 1100 XOR 1101 1010 ---0011 1001 1010 1011 1100 1101 1110 1111 35 Korea Univ
Hybrid Branch Predictor [Mc. Farling 93] Branch PC P 0 P 1 … Final Prediction Choice (or Meta) Predictor • Some branches correlated to global history, some correlated to local history • Only update the meta-predictor when 2 predictors disagree Prof. Sean Lee’s Slide 36 Korea Univ
Alpha 21264 (EV 6) Hybrid Predictor • • PC A “tournament branch predictor” Multi-predictor scheme w/ Single § Local predictor (~PAg) Local History • Self-correlation 10 Predictor Table § Global predictor 1024 x • Inter-correlation 1024 x 3 bits § Choice predictor as the 10 bits decision maker: a 2 -bit sat. counter to credit Local prediction either local or global predictors. Die size impact § History info tables ~2% § BTB ~ 2. 7% (associated with I-$ on a per-line basis) 2 cycle latency, we will discuss more later Virtual address Prof. Sean Lee’s Slide 37 Global history 12 Global Predictor Choice Predictor 4096 x 2 bits Global prediction Meta prediction Final Branch Prediction Next Line/set Prediction L 1 I-cache (64 KB 2 w) & TLB For Single-cycle Prediction 4 instr. /cycle Korea Univ
Branch Target Prediction • Try the easy ones first § Direct jumps § Call/Return § Conditional branch (bi-directional) • Branch Target Buffer (BTB) • Return Address Stack (RAS) Prof. Sean Lee’s Slide 38 Korea Univ
Branch Target Buffer (BTB) BTB Branch PC Tag Target … Tag Target 4 + = = … Predicted Branch Direction = 1 Branch Target 0 Prof. Sean Lee’s Slide 39 Korea Univ
Return Address Stack (RAS) • Different call sites make return address hard to predict § Printf() being called by many callers § The target of “return” instruction in printf() is a moving target • A hardware stack (LIFO) § Call will push return address on the stack § Return uses the prediction off of TOS Prof. Sean Lee’s Slide 40 Korea Univ
Return Address Stack Call PC Return PC 4 + Push Return Address BTB Return? • Does it always work? § Call depth § Setjmp/Longjmp § Speculative call? Prof. Sean Lee’s Slide • May not know it is a return instruction prior to decoding – Rely on BTB for speculation – Fix once recognize Return Korea Univ 41
Indirect Jump • Need Target Prediction § Many (potentially 230 for 32 -bit machine) § In reality, not so many § Similar to predicting values • Tagless Target Prediction • Tagged Target Prediction Prof. Sean Lee’s Slide 42 Korea Univ
Tagless Target Prediction PC BHR Pattern Target Cache (2 N entries) 00…. . 00 00…. . 01 00…. . 10 Branch PC Predicted Target Address Hash 1 1. . . [Chang. Hao. Patt’ 97] 1 0 Branch History Register (BHR) 11…. . 10 11…. . 11 • Modify the PHT to be a “Target Cache” § (indirect jump) ? (from target cache) : (from BTB) • Alias? Prof. Sean Lee’s Slide 43 Korea Univ
Tagged Target Prediction Target Cache (2 n entries per way) 00…. . 00 00…. . 01 00…. . 10 Branch PC Hash 1 1. . . [Chang. Hao. Patt’ 97] 1 0 BHR Predicted Target Address n 11…. . 10 11…. . 11 =? Tag Array • To reduce aliasing with set-associative target cache • Use branch PC and/or history for tags Prof. Sean Lee’s Slide 44 Korea Univ
Multiple Branch Prediction • For a really wide machine § Across several basic blocks § Need to predict multiple branches per cycle • How to fetch non-contiguous instructions in one cycle? • Prediction accuracy extremely critical (will be reduced geometrically) Prof. Sean Lee’s Slide 45 Korea Univ
Backup Slides 46 Korea Univ
Alpha EV 8 Branch Predictor Branch PC F 4 Bimodal F 1 G 0 F 2 Global history G 1 F 3 Meta e-gskew predictor majority vote • Real silicon never sees the daylight prediction • Use a 2 Bc-gskew predictor (one form of enhanced gskew) § Bimodal predictor used as (1) static biased predictor and (2) part of e-gskew predictor § Global predictors G 0 and G 1 are part of e-gskew predictor § Table sizes: 352 Kbits in total (208 Kbits for prediction table; 144 Kbits for hysteresis table. ) Prof. Sean Lee’s Slide 47 Korea Univ
- Slides: 47