CS 7810 Paper critiques and class participation 25

Superscalar Pipeline BPred I-Cache PC B T B Rename Table IFQ checkpoints in 2

Resolving Branches A: lr 1 lr 2 + lr 3 B: lr 2 lr

Resolving Exceptions A B C D lr 1 lr 2 + lr 3 lr

LSQ Ld/St Address Data Completed Store Unknown 1000 -- Load x 40000000 -- --

LSQ Ld/St Address Data Completed Store x 45000000 1000 Yes Load x 40000000 --

LSQ Ld/St Address Data Completed Store x 45000000 1000 Yes Load x 40000000 10

Instruction Wake-Up p 2 p 1 p 2 p 6 add p 2 p

Paper I Limits of Instruction-Level Parallelism David W. Wall WRL Research Report 93/6 Also

Goals of the Study • Under optimistic assumptions, you can find a very high

Dependencies For registers and memory: True data dependency RAW Anti dependency WAR Output dependency

Perfect Scheduling For a long loop: Read a[i] and b[i] from memory and store

Impediments to Perfect Scheduling • Register renaming • Alias analysis • Branch prediction •

Register Renaming lr 1 … … lr 1 … pr 22 … … pr

Alias Analysis • You have to respect RAW dependences for memory as well –

Alias Analysis • Policies: Ø Perfect: You magically know all addresses and only delay

Global, Stack, and Heap main() int a, b; call func(); func() int c, d;

Branch Prediction • If you go the wrong way, you are not extracting useful

Dynamic Branch Prediction • Tables of 2 -bit counters that get biased towards being

Branch Fanout • Execute both directions of the branch – an exponential growth in

Indirect Jumps • Indirect jumps do not encode the target in the instruction –

Latency • In their study, every instruction has unit latency -- highly questionable assumption

Window Size & Cycle Width 8 available slots in each cycle Window of 2048

Window Size & Cycle Width • Discrete windows: grab 2048 instructions, schedule them, retire

Simulated Models • Seven models: control, register, and memory dependences • Today’s processors: ?

Aggressive Models • Parallelism steadily increases as we move to aggressive models (Fig 12,

Cycle Width and Window Size • Unlimited cycle width buys very little (much less

Memory Latencies • The ability to prefetch has a huge impact on IPC –

Branch Prediction • Obviously, better prediction helps (Fig. 22) • Fanout does not help

Alias Analysis • Has a big impact on performance – compiler analysis results in

Instruction Latency • Parallelism almost unaffected by increased latency (increases marginally in some cases!)

Conclusions of Wall’s Study • Branch prediction, alias analysis, mispredict penalty are huge bottlenecks

Questions • Weaknesses: caches, register model, value pred • Will most of the available

Next Week’s Paper • “Complexity-Effective Superscalar Processors”, Palacharla, Jouppi, Smith, ISCA ’ 97 •

Slides: 42

Download presentation

CS 7810 • Paper critiques and class participation: 25% • Final exam: 25% • Project: Simplescalar (? ) modeling, simulation, and analysis: 50% • Read and think about the papers before class!

Superscalar Pipeline BPred I-Cache PC B T B Rename Table IFQ checkpoints in 2 in 1 out op D-Cache Regfile LSQ FU FU FU R O B FU Issue queue

Rename A B C D lr 1 lr 2 + lr 3 lr 2 lr 4 + lr 5 lr 6 lr 1 + lr 3 lr 6 lr 1 + lr 2 pr 7 pr 2 + pr 3 pr 8 pr 4 + pr 5 pr 9 pr 7 + pr 3 pr 10 pr 7 + pr 8 RAR lr 3 RAW lr 1 WAR lr 2 WAW lr 6 A ; BC ; D RAR pr 3 RAW pr 7 WAR x WAW x AB ; CD

Resolving Branches A: lr 1 lr 2 + lr 3 B: lr 2 lr 1 + lr 4 C: lr 1 lr 4 + lr 5 E: lr 1 lr 2 + lr 3 D: lr 2 lr 1 + lr 5 A: pr 6 pr 2 + pr 3 B: pr 7 pr 6 + pr 4 C: pr 8 pr 4 + pr 5 E: pr 10 pr 7 + pr 3 D: pr 9 pr 8 + pr 5

Resolving Exceptions A B C D lr 1 lr 2 + lr 3 lr 2 lr 1 + lr 4 br lr 1 lr 2 + lr 3 lr 2 lr 1 + lr 5 pr 6 pr 2 + pr 3 pr 7 pr 6 + pr 4 br pr 8 pr 7 + pr 3 pr 9 pr 8 + pr 5

Resolving Exceptions A B C D lr 1 lr 2 + lr 3 lr 2 lr 1 + lr 4 br lr 1 lr 2 + lr 3 lr 2 lr 1 + lr 5 pr 6 pr 2 + pr 3 pr 7 pr 6 + pr 4 br pr 8 pr 7 + pr 3 pr 9 pr 8 + pr 5 ROB A pr 6 pr 1 B pr 7 pr 2 br C pr 8 pr 6 D pr 9 pr 7

LSQ Ld/St Address Data Completed Store Unknown 1000 -- Load x 40000000 -- -- Store x 50000000 -- -- Load x 30000000 -- --

LSQ Ld/St Address Data Completed Store Unknown 1000 -- Load x 40000000 -- -- Store x 50000000 100 Yes Load x 50000000 -- -- Load x 30000000 -- --

LSQ Ld/St Address Data Completed Store Unknown 1000 -- Load x 40000000 -- -- Store x 50000000 100 Yes Load x 30000000 -- --

LSQ Ld/St Address Data Completed Store x 45000000 1000 Yes Load x 40000000 -- -- Store x 50000000 100 Yes Load x 30000000 -- -- can commit

LSQ Ld/St Address Data Completed Store x 45000000 1000 Yes Load x 40000000 10 Yes Store x 50000000 100 Yes Load x 30000000 1 Yes

Instruction Wake-Up p 2 p 1 p 2 p 6 add p 2 p 6 p 7 add p 1 p 2 p 8 sub p 7 p 8 p 9 mul p 1 p 7 p 10 add

Paper I Limits of Instruction-Level Parallelism David W. Wall WRL Research Report 93/6 Also appears in ASPLOS’ 91

Goals of the Study • Under optimistic assumptions, you can find a very high degree of parallelism (1000!) Ø What about parallelism under realistic assumptions? Ø What are the bottlenecks? What contributes to parallelism?

Dependencies For registers and memory: True data dependency RAW Anti dependency WAR Output dependency WAW Control dependency Structural dependency

Perfect Scheduling For a long loop: Read a[i] and b[i] from memory and store in registers Add the register values Store the result in memory c[i] The whole program should finish in 3 cycles!! Anti and output dependences : the assembly code keeps using lr 1 Control dependences : decision-making after each iteration Structural dependences : how many registers and cache ports do I have?

Impediments to Perfect Scheduling • Register renaming • Alias analysis • Branch prediction • Branch fanout • Indirect jump prediction • Window size and cycle width • Latency

Register Renaming lr 1 … … lr 1 … pr 22 … … pr 22 pr 24 … • If the compiler had infinite registers, you would not have WAR and WAW dependences • The hardware can renumber every instruction and extract more parallelism • Implemented models: Ø None Ø Finite registers Ø Perfect (infinite registers – only RAW)

Alias Analysis • You have to respect RAW dependences for memory as well – store value to addr. A load from addr. A • Problem is: you do not know the address at compile-time or even during instruction dispatch

Alias Analysis • Policies: Ø Perfect: You magically know all addresses and only delay loads that conflict with earlier stores Ø None: Until a store address is known, you stall every subsequent load Ø Analysis by compiler: (addr) does not conflict with (addr+4) – global and stack data are allocated by the compiler, hence conflicts can be detected – accesses to the heap can conflict with each other

Global, Stack, and Heap main() int a, b; call func(); func() int c, d; int *e; int *f; global data stack data e, f are stack data This is a conflict if you had previously done e=e+8 e = (int *)malloc(8); e, f point to heap data f = (int *)malloc(8); … *e = c; d = *f; store c into addr stored in e read value in addr stored in f

Branch Prediction • If you go the wrong way, you are not extracting useful parallelism • You can predict the branch direction statically or dynamically • You can execute along both directions and throw away some of the work (need more resources)

Dynamic Branch Prediction • Tables of 2 -bit counters that get biased towards being taken or not-taken • Can use history (for each branch or global) • Can have multiple predictors and dynamically pick the more promising one • Much more in a few weeks…

Static Branch Prediction • Profile the application and provide hints to the hardware • Dynamic predictors are much better

Branch Fanout • Execute both directions of the branch – an exponential growth in resource requirements • Hence, do this until you encounter four branches, after which, you employ dynamic branch prediction • Better still, execute both directions only if the prediction confidence is low Not commonly used in today’s processors.

Indirect Jumps • Indirect jumps do not encode the target in the instruction – the target has to be computed • The address can be predicted by Ø using a table to store the last target Ø using a stack to keep track of subroutine call and returns (the most common indirect jump) • The combination achieves 95% prediction rates

Latency • In their study, every instruction has unit latency -- highly questionable assumption today! • They also model other “realistic” latencies • Parallelism is being defined as cycles for sequential exec / cycles for superscalar, not as instructions / cycles • Hence, increasing instruction latency can increase parallelism – not true for IPC

Window Size & Cycle Width 8 available slots in each cycle Window of 2048 instructions

Window Size & Cycle Width • Discrete windows: grab 2048 instructions, schedule them, retire all cycles, grab the next window • Continuous windows: grab 2048 instructions, schedule them, retire the oldest cycle, grab a few more instructions • Window size and register renaming are not related

Simulated Models • Seven models: control, register, and memory dependences • Today’s processors: ? • However, note optimistic scheduling, 2048 instr window, cycle width of 64, and 1 -cycle latencies • SPEC’ 92 benchmarks, utility programs (grep, sed, yacc), CAD tools

Simulated Models

Aggressive Models • Parallelism steadily increases as we move to aggressive models (Fig 12, Pg. 16) • Branch fanout does not buy much • IPC of Great model: 10 Reality: 1. 5 • Numeric programs can do much better

Aggressive Models

Cycle Width and Window Size • Unlimited cycle width buys very little (much less than 10%) (Figure 15) • Decreasing the window size seems to have little effect as well (you need only 256? ! – are registers the bottleneck? ) (Figure 16) • Unlimited window size and cycle widths don’t help (Figure 18) Would these results hold true today?

Memory Latencies • The ability to prefetch has a huge impact on IPC – to hide a 300 cycle latency, you have to spot the instruction very early • Hence, registers and window size are extremely important today!

Branch Prediction • Obviously, better prediction helps (Fig. 22) • Fanout does not help much (Fig. 24 -b) – not selecting the right branches? • Luckily, small tables are good enough for good indirect jump prediction • Mispredict penalty has a major impact on ILP (Fig. 30)

Alias Analysis • Has a big impact on performance – compiler analysis results in a two-fold speed-up • Later, we’ll read a paper that attempts this in hardware (Chrysos ’ 98)

Instruction Latency • Parallelism almost unaffected by increased latency (increases marginally in some cases!) • Note: “unconventional” definition of parallelism • Today, latency strongly influences IPC

Conclusions of Wall’s Study • Branch prediction, alias analysis, mispredict penalty are huge bottlenecks • Instr latency, registers, window size, cycle width are not huge bottlenecks • Today, they are all huge bottlenecks because they all influence effective memory latency…which is the biggest bottleneck

Questions • Weaknesses: caches, register model, value pred • Will most of the available IPC (IPC = 10 for superb model) go away with realistic latencies? • What stops us from building the following: 400 Kb bpred, cache hierarchies, 512 -entry window/regfile, 16 ALUs, memory dependence predictor? • The following may need a re-evaluation: effect of window size, branch fan-out

Next Week’s Paper • “Complexity-Effective Superscalar Processors”, Palacharla, Jouppi, Smith, ISCA ’ 97 • The impact of increased issue width and window size on clock speed

Title • Bullet