CS 7810 Lecture 6 The Impact of Delay

  • Slides: 24
Download presentation
CS 7810 Lecture 6 The Impact of Delay on the Design of Branch Predictors

CS 7810 Lecture 6 The Impact of Delay on the Design of Branch Predictors D. A. Jimenez, S. W. Keckler, C. Lin Proceedings of MICRO-33 2000

Bimodal Predictor 14 bits Branch PC Table of 16 K entries of 2 -bit

Bimodal Predictor 14 bits Branch PC Table of 16 K entries of 2 -bit saturating counters

Global Predictor A single register that keeps track of recent history for all branches

Global Predictor A single register that keeps track of recent history for all branches 00110101 8 bits 6 bits Branch PC Also referred to as a two-level predictor Table of 16 K entries of 2 -bit saturating counters

Local Predictor Branch PC A two-level predictor that only uses local histories at the

Local Predictor Branch PC A two-level predictor that only uses local histories at the first level Use 6 bits of branch PC to index into local history table 1011011001 Table of 64 entries of 14 -bit histories for a single branch 14 -bit history indexes into next level Table of 16 K entries of 2 -bit saturating counters

Tournament Predictors • A local predictor might work well for some branches or programs,

Tournament Predictors • A local predictor might work well for some branches or programs, while a global predictor might work well for others • Provide one of each and maintain another predictor to identify which predictor is best for each branch Local Predictor Global Predictor Branch PC Tournament Predictor Table of 2 -bit saturating counters M U X

Terminology • GAG: Global history indexes into global array of saturating counters • PAG:

Terminology • GAG: Global history indexes into global array of saturating counters • PAG: Per-address history indexes into global array of saturating counters • GAP: Global history indexes into each PC’s private array of counters (gselect) • PAP: Per-address history indexes into each PC’s private array of counters

Prediction Accuracy Vs. IPC

Prediction Accuracy Vs. IPC

Prediction Accuracy Vs. IPC • Fig. 1 – IPC saturates at around 1. 28,

Prediction Accuracy Vs. IPC • Fig. 1 – IPC saturates at around 1. 28, assuming single-cycle predictions • A 2 KB predictor takes two cycles to access – multi-cycle predictors can’t yield IPC > 1. 0 (reduced fetch bandwidth) • However, note that a single cycle predictor is within 10% of optimal IPC (might not be true for more aggressive o-o-o processors)

Long Latency Predictions • Total branch latency C = d + (r x p)

Long Latency Predictions • Total branch latency C = d + (r x p) d = delay = 1 r = mpred rate = 0. 04 p = penalty = 20 • Always better to reduce d than r • Note that correctly predicted branches are often not on the program critical path

Branch Frequency • Branches are not as common as we think – on average,

Branch Frequency • Branches are not as common as we think – on average, they occur every 6 instructions, but 61% of the time, there is at least 1 cycle of separation • Branches can be treated differently, based on whether they can tolerate latency or not

Branch Predictor Cache • The cache is a subset of the 3 -cycle predictor

Branch Predictor Cache • The cache is a subset of the 3 -cycle predictor and requires tags • ABP provides a prediction if there is a cache miss Xor of address and history Tags 1 -cycle PHT ABP 3 -cycle PHT Hit/Miss Prediction

Cascading Lookahead Prediction • Use the current PC to predict where the next branch

Cascading Lookahead Prediction • Use the current PC to predict where the next branch will go – initiate the look-up before you see that branch • Use predictors with different latencies – when you do see the branch, use the prediction available to you • You can use a good prediction 60% of the time and a poor prediction 40% of the time

Overriding Branch Predictor • Use a quick-and-dirty prediction • When you get the slow-and-clean

Overriding Branch Predictor • Use a quick-and-dirty prediction • When you get the slow-and-clean prediction and it disagrees, initiate recovery action • If prediction rates are 92% and 97%, 5% of all branches see a 2 -cycle mispredict penalty and 3% see a 20 -cycle penalty

Combining the Predictors? • Lookahead into a number of predictors • When you see

Combining the Predictors? • Lookahead into a number of predictors • When you see a branch (after 3 cycles), use the prediction from your cache (in case of a hit) or the prediction from the regular 3 -cycle predictor (in case of a miss) • When you see the super-duper 5 -cycle prediction, let it override any previous incorrect prediction

Latencies Technology ABP Delay ABP Entries PHTC Entries PHT Delay PHT Entries 100 nm

Latencies Technology ABP Delay ABP Entries PHTC Entries PHT Delay PHT Entries 100 nm 1 1 K 256 4 256 K 35 nm 1 512 128 2 16 K

Results (Fig. 8)

Results (Fig. 8)

Results (Fig. 8) • The cache doesn’t seem to help at all (IPC of

Results (Fig. 8) • The cache doesn’t seem to help at all (IPC of 1. 1!) (it is very surprising that the ABP and PHT have matching predictions most of the time) • For the cascading predictor, the slow predictor is used 45% of the time and it gives a better prediction than the 1 -cycle predictor 5. 5% of the time • The overriding predictor disagrees 16. 5% of the time and yields an IPC of 1. 2 – hmmm…

Alpha 21264 Predictor global history 512 chooser entries PHT global history global predictor PHT

Alpha 21264 Predictor global history 512 chooser entries PHT global history global predictor PHT 3200 bits PC local 128 entries history PHT 128 entries

Alpha 21464 (EV 8) • 352 Kb! 2 -cycle access time – 4 predictor

Alpha 21464 (EV 8) • 352 Kb! 2 -cycle access time – 4 predictor arrays accessed in parallel – overrides line prediction • 14 -25 cycle mispredict penalty – 8 -wide processor -- 256 in-flight instructions

Predictor Sizes BIM G 0 G 1 Meta Prediction table 16 K 64 K

Predictor Sizes BIM G 0 G 1 Meta Prediction table 16 K 64 K 64 K Hysteresis table 16 K 32 K 64 K 32 K History length 4 13 21 15 • All tables are indexed using combinations of history and PC

2 Bc-gskew Address BIM Pred G 0 Address+History G 1 Meta Vote

2 Bc-gskew Address BIM Pred G 0 Address+History G 1 Meta Vote

Rules • On a correct prediction Ø if all agree, no update Ø if

Rules • On a correct prediction Ø if all agree, no update Ø if they disagree, strengthen correct preds and chooser • On a misprediction Ø update chooser and recompute the prediction § on a correct prediction, strengthen correct preds § on a misprediction, update all preds

Design Choices • Local predictor was avoided because you need up to 16 predictions

Design Choices • Local predictor was avoided because you need up to 16 predictions in a cycle and it is hard maintaining speculative local histories Ø You have no control over local histories – will need 16 -ported PHT Ø Since global history is common for all 16 predictions, you can control indexing into PHT • They advocate the use of larger overriding predictors for future technologies

Title • Bullet

Title • Bullet