Sculptor Flexible Approximation with Selective Dynamic Loop Perforation
Sculptor: Flexible Approximation with Selective Dynamic Loop Perforation Shikai Li, Sunghyun Park, Scott Mahlke Compilers Creating Custom Processors (CCCP) Research Group EECS Department at the University of Michigan 1
Problem The end of Dennard Scaling and the upcoming end of Moore’s Law. VS https: //www. karlrupp. net/2015/06/40 -years-of-microprocessor-trend-data/ Data explosion and emerging compute-intensive applications in deep learning, data mining, etc. 2 Quantum Tunneling through the memory wall, K, Gpmez. http: //www. pdl. cmu. edu/SDI/2013/slides/QTtt. MW 7. pdf
Opportunity Compute-intensive applications in various domains are error-tolerant. -- Machine Learning -- Data Mining -- Image Processing -- Video Processing -- Gaming … Quality: 100% 95% 90% 85% An example of image quality of increasing quality losses From Mehrzard S. et al. “SAGE: Self-Tuning Approximation for Graphics Engines” 3
Solution Rising Computation Demands + Emerging Error-Tolerant Applications Approximate Trade off output accuracy with performance improvement or Computing energy reduction. 4
Approximate Computing • Reduce the amount of computation • Replace accurate computation with fuzzy computation • Perform computation without correctness guarantees 5
Previous Work Hardware Neural Network Accelerator -- ASIC (H. Esmaeilzadeh, MICRO 2012) -- Analog (R. St Amant , ISCA 2014) -- FPGA (T. Moreau, HPCA 2015) -- GPU (A. Yazdanbakhsh, MICRO 2015) Software Programmer-Assisted Framework -- Green (W. Baek, PLDI 2010) -- Ener. J (A. Sampson, PLDI 2011) Automatic Framework for GPU Approximate Value Prediction -- SAGE (M. Samadi, MICRO 2013) -- Paraprox (M. Samadi, ASPLOS 2013) -- J. S. Miguel, MICRO 2014; -- A. Yazdanbakhsh, TACO 2016; Unleash Parallelism Cache And Memory System -- Quick. Step (S. Misailovic, TECS 2013) -- Helix-Up (S. Campanoni, CGO 2015) -- Doppelgänger Cache (J. S. Miguel, MICRO 2015) -- Bunker Cache (J. S. Miguel, MICRO 2016) -- Concise Loads & Stores (A. Jain, MICRO 2016) Approximate Operation and Storage -- CPU (H. Esmaeilzadeh, ASPLOS 2012) -- GPU (D. Wong, HPCA 2016) 6 Approximation Dynamism -- M. A. Laurenzano, PLDI 2016 -- S. Mitra, CGO 2017 Task Skipping and Loop Perforation -- M. Rinard, et al. SC 2016, MIT Tech Report 2009, SAS 2011, FSE 2011
Loop Perforation Loops are transformed to periodically skip subsets of their iterations. y l l a ic d o i r Pe 7 ly e r i Ent
Skipping Different Instructions Skipping different instructions have different influences on accuracy. Data Addr Mem Cond Different Final Output Errors Caused by Skipping A Single Instruction at Rate 2 inside The Kernel Loop of Hotspot from Rodinia 8
Skipping Different Iterations Skipping different iterations have different influences on accuracy. Iteration ID Different Final Output Errors Caused by Skipping A Single Iteration inside A Kernel Loop of Bodytrack from PARSEC 9
Optimized Loop Perforation Traditional Loop Perforation Selective Instruction Loop Perforation 10 Dynamic Iteration Loop Perforation Selective Dynamic Loop Perforation
System Overview 11
Methodology • Selective Instruction Loop Perforation • Dynamic Iteration Loop Perforation • Runtime Error Management 12
Selective Instruction Loop Perforation Loops are transformed to skip a subset of instructions in each iteration. 13
Selective Perforation Methodology • Instruction Level Selective Perforation • Load Based Selective Perforation • Store Based Selective Perforation 14
Instruction Level Selective Perforation 1. Selection Stage 2. Expansion Stage 3. Transformation Stage 15
Selection Stage 1. Selection Stage a. Performance Impact 16
Selection Stage 1. Selection Stage a. Performance Impact b. Program Corruption 17
Selection Stage 1. Selection Stage a. Performance Impact b. Program Corruption c. Output Error … 101, 102, 103, 104, 105, 106, 107, 108, 109 … Good Temporal data similarity … 100, 200, 100, 300, 200, 500, 200, 300, 500 … Bad Temporal data similarity 18
Selection Stage 1. Selection Stage a. Performance Impact b. Program Corruption c. Output Error 19
Selection Stage 1. Selection Stage a. Performance Impact b. Program Corruption c. Output Error 20
Selection Stage 1. Selection Stage a. Performance Impact b. Program Corruption c. Output Error 21
Selection Stage 1. Selection Stage a. Performance Impact b. Program Corruption c. Output Error 22
Selection Stage 1. Selection Stage a. Performance Impact b. Program Corruption c. Output Error 23
Selection Stage 1. Selection Stage a. Performance Impact b. Program Corruption c. Output Error 24
Expansion Stage 2. Expansion Stage Perforate more instructions without additional output error. 1) Instructions that only use results of perforated instructions or loop invariants. 2) Instructions whose results are only used by perforated instructions. 25
Expansion Stage 26
Expansion Stage 27
Expansion Stage 28
Expansion Stage 29
Expansion Stage 2. Expansion Stage 30
Transformation Stage 3. Transformation Stage Reduce control divergence overhead with compiler optimization. 31
Transformation Stage 3. Transformation Stage 1) Instruction Re-ordering 32
Transformation Stage 3. Transformation Stage 1) Instruction Re-ordering 2) Loop Unswitching 33
Transformation Stage 3. Transformation Stage 1) Instruction Re-ordering 2) Loop Unswitching 3) Loop Unrolling 34
Methodology • Selective Instruction Loop Perforation • Dynamic Iteration Loop Perforation • Runtime Error Management 35
Dynamic Iteration Loop Perforation Loops are transformed to skip a flexible subset of iterations during program execution. 36
Dynamic Perforation Methodology • Dynamic Perforation Rate • Dynamic Start Point 37
Dynamic Rate Adapt approximation aggressiveness through changing skip rates at different circumstances during program execution. 38
Active Function Call Based Dynamic Rate Loop executions tend to have different accuracy impacts during different function calls. 39
Active Loop Iteration Based Dynamic Rate Loop executions tend to have different accuracy impacts during different “outer-loop” iterations. Iteration ID 40
Dynamic Start Coverage guarantee each iteration to be executed at least once. Fairness provides each iteration an equal chance to be executed. 41
Methodology • Selective Instruction Loop Perforation • Dynamic Iteration Loop Perforation • Runtime Error Management 42
Runtime Error Management A calibration-based aggressiveness adjustment mechanism to perform error management at runtime. 43
Evaluation Benchmark: 7 Benchmarks from PARSEC 2 Additional Benchmarks from Rodinia Error Metric: Most Error Metrics are Based on Relative Mean Error Cluster Applications use NMI Score as the Error Metric Evaluation Platform: • LLVM 4. 0 • Clang 4. 0, -O 3 • Ubuntu 16. 04 • Intel Skylake i 7 -6700 CPU @ 3. 40 GHz 44
Selective & Dynamic Perforation Selective Dynamic Loop Perforation Speedup with Different Error Budgets (left: 5%, right: 10%) Average speedup improved from 1. 47 x to 2. 89 x 45 Average speedup improved from 1. 93 x to 4. 07 x
Selective / Dynamic Loop Perforation Selective Loop Perforation Speedup with An Error Budget of 10% Average speedup 2. 62 x Compared to 4. 07 x of Selective Dynamic Loop Perforation 46 Dynamic Loop Perforation Speedup with An Error Budget of 10% Average speedup 2. 91 x Compared to 4. 07 x of Selective Dynamic Loop Perforation
Conclusion Motivation: Space Limitation in Loop Perforation Time Limitation in Loop Perforation Methodology: Selective Instruction Loop Perforation Dynamic Iteration Loop Perforation Evaluation: Average Speedup 1. 93 x -> 4. 07 x 47
Q & A 48
Thank you! 49
- Slides: 49