How to Turn the Technological Constraint Problem into
How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian Fields Rastislav Bodík University of Wisconsin-Madison Mark D. Hill
The Problem: Managing constraints Technological constraints dominate memory design Constraint: Memory latency Design: Cache hierarchy Non-uniformity: Policy: Load latencies What to replace?
The Problem: Managing constraints In the future, technological constraints will also dominate microprocessor design Constraint: Wires Power Complexity Design: Clusters Fast/Slow ALUs Grid, ILDP Non-uniformity: Bypasses Exe. Latencies L 1 latencies Policy: ? ? ? Policy Goal: Minimize effect of lower-quality resources
Key Insight: Control policy crucial With non-uniform machines, the technological constraint problem becomes a control policy problem
Key Insight: Control policy crucial The best possible policy: Delays are imposed only on instructions so that execution time is not increased Achieved through slack: The amount an instruction can be delayed without increasing execution time
Contributions/Outline Understanding (measure slack in a simulator? ) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware? ) • simple, delay and observe approach works well Case study (how to design a control policy? ) • on power-efficient machine, up to 20% speedup
Determining slack: Why hard? Microprocessors are complex: Sometimes slack is determined by resources (e. g. ROB) “Probe the processor” approach: Delay and observe 1. Delay dynamic instruction by n cycles 2. See if execution time increased a) No, increase n; restart; go to step 1 Srinivasan and Lebeck approximation, for loads • (MICRO ’ 98) heuristics to predict execution time increase
Determining slack Alternative approach: Dependence-graph analysis 1. Build resource-sensitive dependence graph 2. Analyze to find slack But, how to build resource-sensitive graph? Casmira and Grunwald’s solution (Kool Chips Workshop ’ 00) Graphs only with instructions in issue window
Data-Dependence Graph 1 1 2 1 1 3 Slack = 0 cycles 1
Our Dependence Graph Model (ISCA ‘ 01) F F F E E E C C C Slack = 0 cycles
Our Dependence Graph Model F 0 1 E 1 1 1 C 10 F E C 0 1 E 3 1 1 C 1 F 1 1 1 0 F (ISCA ‘ 01) E 1 2 1 C E 1 1 0 F 1 C Slack = 6 cycles Modeling resources increases observable slack
Reporting slack Global slack: # cycles a dynamic operation can be delayed without increasing execution time 0 0 35 3 1 GS = 15 AS = 10 10 10 GS = 15 AS = 5 2 Apportioned slack: Distribute global slack among operations using an apportioning strategy
Slack measurements (Perl) 6 -wide out-of-order superscalar 128 -entry issue window 12 -stage pipeline
Slack measurements (Perl) global
Slack measurements (Perl) global apportioned
Analysis via apportioning strategy What non-uniform designs can slack tolerate? Design Fast/slow ALU Non-uniformity Exe. latency App. Strategy Double latency Good news: 80% of dynamic instructions can have latency doubled
Contributions/Outline Understanding (measure slack in a simulator? ) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware? ) • simple, delay and observe approach works well Case study (how to design a control policy? ) • on power-efficient machine, up to 20% speedup
Measuring slack in hardware delay and observe Goal: Determine whether static instruction has n cycles of slack 1. Delay a dynamic instance by n cycles 2. Check if critical (via critical-path analyzer): ISCA ‘ 01 a) No, instruction has n cycles of slack b) Yes, instruction does not have n cycles of slack
Two predictor designs Explicit slack predictor 1. • Retry delay and observe with different values of slack Problem: obtaining unperturbed measurements Implicit slack predictor 2. • delay and observe with natural non-uniform delays • “Bin” instructions to match non-uniform hardware
Contributions/Outline Understanding (measure slack in a simulator? ) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware? ) • simple, delay and observe approach works well Case study (how to design a control policy? ) • on power-efficient machine, up to 20% speedup
Fast/slow pipeline microarchitecture P F 2 save ~37% core power Fetch + Rename Steer WIN Reg ALUs Fast, 3 -wide pipeline WIN Reg ALUs Data Cache Bypass Bus Slow, 3 -wide pipeline Design has three nonuniformities: • Higher execution latencies • Increased (cross-domain) bypass latency • Decreased effective issue bandwidth
Selecting bins for implicit slack predictor Two decisions 1. Steer to fast/slow pipeline, then 2. Schedule with high/low priority within a pipeline Use implicit slack predictor with four (22) bins: Fast Slow High 1 3 Low Schedule Steer 2 4
Putting it all together Prediction Path Slack prediction table 4 KB PC Fast/slow pipeline core Slack bin # Training Path 4 -bin slack state machine Criticality Analyzer ~1 KB
Fast/slow pipeline performance 2 fast, high-power pipelines slack-based policy reg-dep steering
Slack used up Average global slack per dynamic instruction 2 fast, high-power pipelines slack-based policy
Slack used up Average global slack per dynamic instruction 2 fast, high-power pipelines slack-based policy reg-dep steering
Conclusion: Future processor design flow Future processors will be non-uniform. A slack-based policy can control them. 1. Measure slack in a simulator • decide early on what designs to build 2. Predict slack in hardware • simplementation 3. Design a control policy • policy decisions slack bins
Backup slides
Define local slack Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions 2 cycles 1 cycle 1 1 1 3 1 1 cycle In real programs, ~20% insts have local slack of at least 5 cycles
Compute local slack Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions 2 cycles 1 cycle 1 1 1 Arrival Time 1 1 2 3 1 4 3 3 5 1 cycle In real programs, ~20% insts have local slack of at least 5 cycles
Define global slack Global Slack: # cycles edge latency can be increased without delaying the last instruction in the program 2 cycles 1 1 1 cycle 3 1 1 cycle In real programs, >90% insts have global slack of at least 5 cycles
Compute global slack Calculate global slack: backward propagate, accumulating local slacks GS 1=MIN(GS 3, GS 5)+LS 1=2 LS 1=1 LS 2=0 GS 5=LS 5=2 LS 3=1 GS 6=LS 6=0 GS 3=GS 6+LS 3=1 In real programs, >90% insts have global slack of at least 5 cycles
Apportioned slack Goal: Distribute slack to instructions that need it Thus, apportioning strategy depends upon nature of non-uniformities in machine e. g. : non-uniformity: 2 speed bypass busses (1 cycle, 2 cycle) strategy: give 1 cycle slack to as many edges as possible
Define apportioned slack Apportioned slack: Distribute global slack among edges For example: GS 1=2, AS 1=1 GS 2=1, AS 2=1 GS 5=2, AS 5=1 GS 3=1, AS 3=0 In real programs, >75% insts can be apportioned slack of at least 5 cycles
Slack measurements global apportioned local
Multi-speed ALUs Can we tolerate ALUs running at half frequency? Yes, but: 1. For all types of operations? (needed for multi-speed clusters) 2. Can we make all integer ops double latency?
Load slack Can we tolerate a long-latency L 1 hit? design: wire-constrained machine, e. g. Grid non-uniformity: multi-latency L 1 apportioning strategy: apportion ALL slack to load instructions
Apportion all slack to loads Most loads can tolerate an L 2 cache hit
Multi-speed ALUs Can we tolerate ALUs running at half frequency? design: fast/slow ALUs non-uniformity: multi-latency execution latency, bypass apportioning strategy: give slack equal to original latency + 1
Latency+1 apportioning Most instructions can tolerate doubling their latency
Breakdown by operation (Latency+1 apportioning)
Validation Two steps: Increase latencies of insts. by their apportioned slack 1. • for three apportioning strategies: 1) latency+1, 2) 5 -cycles to as many instructions as possible, 3) 12 -cycles to as many loads as possible 2. Compare to baseline (no delays inserted)
Validation Worst case: Inaccuracy of 0. 6%
Predicting slack Two steps to PC-indexed, history-based prediction: 1. Measure slack of a dynamic instruction Need: Ability to measure slack of a dynamic instruction 2. Store in array indexed by PC of static instruction Need: Locality of slack • can capture 80% of potential exploitable slack
Locality of slack experiment For each static instruction: 1. Measure % slackful dynamic instances 2. Multiply by # of dynamic instances 3. 4. Sum across all static instructions Compare to total slackful dynamic instructions (ideal case) slackful = has enough apportioned slack to double latency
Locality of slack
Locality of slack
Locality of slack PC-indexed, history-based predictor can capture most of the available slack
Predicting slack Two steps to PC-indexed, history-based prediction: 1. Measure slack of a dynamic instruction Need: Ability to measure slack of a dynamic instruction 2. Store in array indexed by PC of static instruction Need: Locality of slack • can capture 80% of potential exploitable slack
Measuring slack in hardware delay and observe Goal: Determine whether static instruction has n cycles of slack 1. Delay a dynamic instance by n cycles 2. Check if critical (via critical-path analyzer): a) No, instruction has n cycles of slack b) Yes, instruction does not have n cycles of slack
Review: Critical-path analyzer (ISCA ’ 01) 1 1 1 4 1
Review: Critical-path analyzer (ISCA ’ 01) Don’t need to measure latencies
Review: Critical-path analyzer (ISCA ’ 01) Just observe last-arriving edges
Review: Critical-path analyzer (ISCA ’ 01) • Plant token and propagate forward • If token survives, node is critical • If token dies, node is noncritical
Baseline policies (existing, not based on slack) 1. Simple reg dep steering (reg dep) 2. Send to fast cluster until: 2. Window half full (fast-first win) 3. Too many ready insts (fast-first rdy)
Baseline policies (existing, not based on slack) 2 fast clusters register dependence fast-first window fast-first ready
Slack-based policies 2 fast clusters token-passing slack ALOLD slack reg-dep steering 10% better performance from hiding non-uniformities
Extra slow cluster (still save ~25% core power) 2 fast clusters token-passing slack ALOLD slack best-existing policy
- Slides: 58