From Adaptive to SelfTuning Systems Sudhakar Yalamanchili Subramanian
From Adaptive to Self-Tuning Systems Sudhakar Yalamanchili, Subramanian Ramaswamy and Gregory Diamos School of Electrical and Computer Engineering
Architectural Challenges Frequency Wall Power Wall • Negative returns with power • Increasing inefficiencies due to • speculation • control flow Power Not much headroom left in the stage to stage times (currently 8 -12 FO 4 delays) [4] Single Thread Performance Leakage current increases 7. 5 X with each generation [3] ILP Memory Wall Source: http: //techreport. com/reviews/2005 q 2/opteron-x 75/dualcore-chip. jpg Cache Area 80% of transistor budget 50% of total area [1] Defects in cache affect processor yield Significant power consumers (e. g. > 40% of total power in Strong ARM)[2] On-chip-DRAM gap continues to grow 1. 2. 3. 4. Pipeline in-order OOO aggressive OOO Economic Wall Costs of developing next generation processors Design & Manufacturing costs Extreme Device Variability P. Ranganathan, S. Adve, N. Jouppi. Reconfigurable Caches and their Application to Media Processing. ISCA 2000 Michael Zhang, Krste Asanovic “Fine-Grain CAM-Tag Cache Resizing Using Miss Tags” ISLPED 02 S. Borkar “Design Challenges of Technology Scaling” Micro 1999 Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger. Clock rate versus IPC: the end of the road for conventional microarchitectures. In ISCA 2000 OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY SCHOOL 2
System View 1. Capture and adapt to intrinsic application behavior Static, off-line characterizations Many-core, Heterogeneous System P P P P P P M M M M M M M M M M M M P P P P P P M M M scale M M Large M M M M M M M M M M M Dynamic, on-line, evolutionary behaviors 2. Device-Level Variations reduce architecture yield Solution: Systems are self-tuning SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 3
The Space of Solutions Ill- Structured Workloads State of the Practice Rigid, HW/SW Boundaries P P M M Traditional Architectures (Fixed) Evolutionary or Self-Tuning Systems P M Ability to Customize Architectures Before Application Deployment P M Architectures Change At SWdetermined Points of Execution SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY P M Architectures continuously autonomously evolve and adapt 4
From Adaptive to Self Tuning Where do we make future investments in transistors and software? Hardware software co-design for continuous monitoring and/or tuning Expose Two and (dynamically) eliminate design redundancies Examples Cache memory hierarchy On-Chip Networks SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 5
Generational Behavior of Caches Memory Lines miss Idle interval hit new generation Time 1. Kaxiras, S. , Hu, Z. and Martonosi, M. , "Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power“ ISCA 2001 2. Jaume Abella, Antonio Gonzlez, Xavier Vera, Michael F. P. O'Boyle “IATAC: a smart predictor to turn-off L 2 cache lines. ” TACO 2005 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 6
Cache Tuning: Conceptual Model Remap memory into the cache shape the cache Match the program footprint resize the cache SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 7
Cache Tuning: System Model & Opportunities statement Structured accesses remapping directive statement end loop z Placement( B[][], param ) Region A loop y Static analysis or programmer supplied Placement ( B[][] , param) Profile based insertion x P Thread 2 Thread 1 L 1 Run-time tuning L 2 AT Alternative implementations LUT logic M SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 8
Static Tuning: Scientific Applications Targeted to programs with predictable access patterns Compiler can both resize and remap Advanced compiler optimizations made possible SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 9
Dynamic Tuning: Folding Heuristics Comparisons shown for a 256 KB L 2 cache Find and utilize redundancies in the design Miss folding fold misses via re-mapping memory lines into the same cache set S. Ramaswamy, S. Yalamanchili. Improving Cache Efficiency via Resizing + Remapping. ICCD 2007 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 10
Tuning for Yield: Decreasing Defect Sensitivity* Recovering Design Inefficiencies Performance Yield yield at a given performance (e. g. AMAT) for 1000 units Up to four times greater than modulo placement Exploiting redundancies application to power management S. Ramaswamy, S. Yalamanchili, “Customizable Fault Tolerant Caches for Embedded Processors, ” ICCD 2006 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 11
Opportunities Voltage scaling Combine voltage scaling and remapping for program phase dependent power management Compiler-directed For hardware optimizations example concurrent data layout + cache placement Application to multi-threaded and multi-core domains Cache sharing across threads Challenge: coherency traffic SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 12
The On-Chip Network The network is in the critical path (performance) Operand networks Cache hierarchy System on Chip Increasing Wire impact of wire (channel) delays must be actively managed On-demand resource management Initial studies: link tuning Reference: Research at EPFL & Stanford on robust link design SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 13
A System for Tuning and Actively Reconfiguring So. C Links (STARS) Too Fast Well Tuned Latch 1 Value 1 Too Slow Value 2 Latch 2 Value 1 Value 2 Latch 3 Value 1 Value 2 Time Variable Digital delays and cascaded registers measure link delay PLL tunes the clock to match the link delay SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 14
FPGA Tests Monitoring Find End of Link Transition Find Start of Link Transition Tuning Adjust Clock Frequency Low Determine Slack In the Link speed tests to validate the control strategy SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 15
Prototyping: 180 nm Variable Delay Elements (VDE) Variable delay from 118 ps to 1. 47 ns 10 bits of resolution 502 transistors Digitally Controlled Oscillator (DCO) Clock period from 240 ps to 2. 97 ns 10 bits of resolution 528 transistors Digital Clock Divider (DCD) Min input clock period 480 ps 8 bits of resolution 1127 transistors Allows From tuning links up to 2. 083 GHz reference clock of 8. 13 MHz SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 16
Extensions Modulate link widths Modulate buffer organizations Channels/depth Feedback between local congestion detection and link and buffer resources SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 17
Summary Application demands will be time varying Technology will introduce time-varying hardware characteristics Continuous cooperative HW/SW tuning provides a methodology for addressing these concerns Need the support of abstractions for tuning Influence of prior applications to datapaths (Razor- UMich), communication systems (Vizor-GT), and reliable links (Stanford/EPFL) Build on existing research in cache performance & power management SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 18
- Slides: 18