How to Turn the Technological Constraint Problem into

How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian Fields Rastislav Bodík University of Wisconsin-Madison Mark D. Hill

The Problem: Managing constraints Technological constraints dominate memory design Constraint: Memory latency Design: Cache hierarchy Non-uniformity: Policy: Load latencies What to replace?

The Problem: Managing constraints In the future, technological constraints will also dominate microprocessor design Constraint: Wires Power Complexity Design: Clusters Fast/Slow ALUs Grid, ILDP Non-uniformity: Bypasses Exe. Latencies L 1 latencies Policy: ? ? ? Policy Goal: Minimize effect of lower-quality resources

Key Insight: Control policy crucial With non-uniform machines, the technological constraint problem becomes a control policy problem

Key Insight: Control policy crucial The best possible policy: Delays are imposed only on instructions so that execution time is not increased Achieved through slack: The amount an instruction can be delayed without increasing execution time

Contributions/Outline Understanding (measure slack in a simulator? ) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware? ) • simple, delay and observe approach works well Case study (how to design a control policy? ) • on power-efficient machine, up to 20% speedup

Determining slack: Why hard? Microprocessors are complex: Sometimes slack is determined by resources (e. g. ROB) “Probe the processor” approach: Delay and observe 1. Delay dynamic instruction by n cycles 2. See if execution time increased a) No, increase n; restart; go to step 1 Srinivasan and Lebeck approximation, for loads • (MICRO ’ 98) heuristics to predict execution time increase

Determining slack Alternative approach: Dependence-graph analysis 1. Build resource-sensitive dependence graph 2. Analyze to find slack But, how to build resource-sensitive graph? Casmira and Grunwald’s solution (Kool Chips Workshop ’ 00) Graphs only with instructions in issue window

Data-Dependence Graph 1 1 2 1 1 3 Slack = 0 cycles 1

Our Dependence Graph Model (ISCA ‘ 01) F F F E E E C C C Slack = 0 cycles

Our Dependence Graph Model F 0 1 E 1 1 1 C 10 F E C 0 1 E 3 1 1 C 1 F 1 1 1 0 F (ISCA ‘ 01) E 1 2 1 C E 1 1 0 F 1 C Slack = 6 cycles Modeling resources increases observable slack

Reporting slack Global slack: # cycles a dynamic operation can be delayed without increasing execution time 0 0 35 3 1 GS = 15 AS = 10 10 10 GS = 15 AS = 5 2 Apportioned slack: Distribute global slack among operations using an apportioning strategy

Slack measurements (Perl) 6 -wide out-of-order superscalar 128 -entry issue window 12 -stage pipeline

Slack measurements (Perl) global

Slack measurements (Perl) global apportioned

Analysis via apportioning strategy What non-uniform designs can slack tolerate? Design Fast/slow ALU Non-uniformity Exe. latency App. Strategy Double latency Good news: 80% of dynamic instructions can have latency doubled

Contributions/Outline Understanding (measure slack in a simulator? ) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware? ) • simple, delay and observe approach works well Case study (how to design a control policy? ) • on power-efficient machine, up to 20% speedup

Measuring slack in hardware delay and observe Goal: Determine whether static instruction has n cycles of slack 1. Delay a dynamic instance by n cycles 2. Check if critical (via critical-path analyzer): ISCA ‘ 01 a) No, instruction has n cycles of slack b) Yes, instruction does not have n cycles of slack

Two predictor designs Explicit slack predictor 1. • Retry delay and observe with different values of slack Problem: obtaining unperturbed measurements Implicit slack predictor 2. • delay and observe with natural non-uniform delays • “Bin” instructions to match non-uniform hardware

Contributions/Outline Understanding (measure slack in a simulator? ) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware? ) • simple, delay and observe approach works well Case study (how to design a control policy? ) • on power-efficient machine, up to 20% speedup

Fast/slow pipeline microarchitecture P F 2 save ~37% core power Fetch + Rename Steer WIN Reg ALUs Fast, 3 -wide pipeline WIN Reg ALUs Data Cache Bypass Bus Slow, 3 -wide pipeline Design has three nonuniformities: • Higher execution latencies • Increased (cross-domain) bypass latency • Decreased effective issue bandwidth

Selecting bins for implicit slack predictor Two decisions 1. Steer to fast/slow pipeline, then 2. Schedule with high/low priority within a pipeline Use implicit slack predictor with four (22) bins: Fast Slow High 1 3 Low Schedule Steer 2 4

Putting it all together Prediction Path Slack prediction table 4 KB PC Fast/slow pipeline core Slack bin # Training Path 4 -bin slack state machine Criticality Analyzer ~1 KB

Fast/slow pipeline performance 2 fast, high-power pipelines slack-based policy reg-dep steering

Slack used up Average global slack per dynamic instruction 2 fast, high-power pipelines slack-based policy

Slack used up Average global slack per dynamic instruction 2 fast, high-power pipelines slack-based policy reg-dep steering

Conclusion: Future processor design flow Future processors will be non-uniform. A slack-based policy can control them. 1. Measure slack in a simulator • decide early on what designs to build 2. Predict slack in hardware • simplementation 3. Design a control policy • policy decisions slack bins

Backup slides

Define local slack Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions 2 cycles 1 cycle 1 1 1 3 1 1 cycle In real programs, ~20% insts have local slack of at least 5 cycles

Compute local slack Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions 2 cycles 1 cycle 1 1 1 Arrival Time 1 1 2 3 1 4 3 3 5 1 cycle In real programs, ~20% insts have local slack of at least 5 cycles

Define global slack Global Slack: # cycles edge latency can be increased without delaying the last instruction in the program 2 cycles 1 1 1 cycle 3 1 1 cycle In real programs, >90% insts have global slack of at least 5 cycles

Compute global slack Calculate global slack: backward propagate, accumulating local slacks GS 1=MIN(GS 3, GS 5)+LS 1=2 LS 1=1 LS 2=0 GS 5=LS 5=2 LS 3=1 GS 6=LS 6=0 GS 3=GS 6+LS 3=1 In real programs, >90% insts have global slack of at least 5 cycles

Apportioned slack Goal: Distribute slack to instructions that need it Thus, apportioning strategy depends upon nature of non-uniformities in machine e. g. : non-uniformity: 2 speed bypass busses (1 cycle, 2 cycle) strategy: give 1 cycle slack to as many edges as possible

Define apportioned slack Apportioned slack: Distribute global slack among edges For example: GS 1=2, AS 1=1 GS 2=1, AS 2=1 GS 5=2, AS 5=1 GS 3=1, AS 3=0 In real programs, >75% insts can be apportioned slack of at least 5 cycles

Slack measurements global apportioned local

Multi-speed ALUs Can we tolerate ALUs running at half frequency? Yes, but: 1. For all types of operations? (needed for multi-speed clusters) 2. Can we make all integer ops double latency?

Load slack Can we tolerate a long-latency L 1 hit? design: wire-constrained machine, e. g. Grid non-uniformity: multi-latency L 1 apportioning strategy: apportion ALL slack to load instructions

Apportion all slack to loads Most loads can tolerate an L 2 cache hit

Multi-speed ALUs Can we tolerate ALUs running at half frequency? design: fast/slow ALUs non-uniformity: multi-latency execution latency, bypass apportioning strategy: give slack equal to original latency + 1

Latency+1 apportioning Most instructions can tolerate doubling their latency

Breakdown by operation (Latency+1 apportioning)

Validation Two steps: Increase latencies of insts. by their apportioned slack 1. • for three apportioning strategies: 1) latency+1, 2) 5 -cycles to as many instructions as possible, 3) 12 -cycles to as many loads as possible 2. Compare to baseline (no delays inserted)

Validation Worst case: Inaccuracy of 0. 6%

Predicting slack Two steps to PC-indexed, history-based prediction: 1. Measure slack of a dynamic instruction Need: Ability to measure slack of a dynamic instruction 2. Store in array indexed by PC of static instruction Need: Locality of slack • can capture 80% of potential exploitable slack

Locality of slack experiment For each static instruction: 1. Measure % slackful dynamic instances 2. Multiply by # of dynamic instances 3. 4. Sum across all static instructions Compare to total slackful dynamic instructions (ideal case) slackful = has enough apportioned slack to double latency

Locality of slack

Locality of slack PC-indexed, history-based predictor can capture most of the available slack

Predicting slack Two steps to PC-indexed, history-based prediction: 1. Measure slack of a dynamic instruction Need: Ability to measure slack of a dynamic instruction 2. Store in array indexed by PC of static instruction Need: Locality of slack • can capture 80% of potential exploitable slack

Measuring slack in hardware delay and observe Goal: Determine whether static instruction has n cycles of slack 1. Delay a dynamic instance by n cycles 2. Check if critical (via critical-path analyzer): a) No, instruction has n cycles of slack b) Yes, instruction does not have n cycles of slack

Review: Critical-path analyzer (ISCA ’ 01) 1 1 1 4 1

Review: Critical-path analyzer (ISCA ’ 01) Don’t need to measure latencies

Review: Critical-path analyzer (ISCA ’ 01) Just observe last-arriving edges

Review: Critical-path analyzer (ISCA ’ 01) • Plant token and propagate forward • If token survives, node is critical • If token dies, node is noncritical

Baseline policies (existing, not based on slack) 1. Simple reg dep steering (reg dep) 2. Send to fast cluster until: 2. Window half full (fast-first win) 3. Too many ready insts (fast-first rdy)

Baseline policies (existing, not based on slack) 2 fast clusters register dependence fast-first window fast-first ready

Slack-based policies 2 fast clusters token-passing slack ALOLD slack reg-dep steering 10% better performance from hiding non-uniformities

Extra slow cluster (still save ~25% core power) 2 fast clusters token-passing slack ALOLD slack best-existing policy