Bundled Execution of Recurring Traces for EnergyEfficient General

Bundled Execution of Recurring Traces for Energy-Efficient General Purpose Processing Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August University of Michigan (Intel, Northrup-Grumman, UIUC, Princeton) MICRO-44 December 6, 2011 1 University of Michigan Electrical Engineering and Computer Science

Computational Efficiency Landscape • Energy dilemma • More gates can fit on a die AMD 6850 S 1070 • But power constraints limit their use GTX 295 • To scale performance, need to increase efficiency IBM Cell GTX 280 Core i 7 AMD Opteron Core 2 Embedded Processors Pentium M 2 University of Michigan Electrical Engineering and Computer Science 2

Where Does The Energy Go? • Energy used in a single-issue RISC in-order core • Instruction fetch and decode energy dominates • Actual execution barely consumes 10% Plenty of opportunities to save energy…. 3 [Dally’ 08] University of Michigan Electrical Engineering and Computer Science

Increasing Efficiency with Accelerators Application regularity defines success: Flexibility FPGAs 1. Small dominant code segments 2. Little control flow 3. Narrow application set 4. Data parallelism General Purpose Processors ASIPs DSPs SIMD Loop Accelerators, ASICs Efficiency, Performance • Accelerators can give 10 – 50 X efficiency 4 University of Michigan Electrical Engineering and Computer Science

$Utility Factor for Accelerators • What fraction of the code gets accelerated? • Most$

Utility Factor for Accelerators • What fraction of the code gets accelerated? • Most solutions fail for “irregular” or “general-purpose” code Flexibility FPGAs General Purpose Processors ? ? ? ASIPs DSPs SIMD Loop Accelerators, ASICs Efficiency, Performance Goal: A design to target irregular codes 5 University of Michigan Electrical Engineering and Computer Science

• A compute engine for “hot regular regions” in irregular codes Program • Key insights: 1. Hot Regions 2. CPU BERET The BERET Architecture CPU BERET L 1 I$ L 1 D$ Exploits recurring instructions (traces) to save on redundant copy live-ins fetches and decodes copy live-outs Uses a bundled execution model to save on register reads/writes redundant BERET: Bundled Execution of REcurring Traces 6 University of Michigan Electrical Engineering and Computer Science

Insight 1: Recurring Instructions • How aboutsuch loops? We leverage looping traces for savings ► Typical loops in irregular codes are large and control intensive! 1. Straight-line code simple hardware BB 0 Hot basic blocks BB 1 easy to buffer 2. Typically short 85% 15% BB 1 3. Significant 10% fetch 90% / decode savings for buffered BB 4 exit? BB 2 BB 3 exit? BB 2 BB 5 BB 4 instructions 50% BB 6 BB 5 BB 1 BB 2 BB 5 BB 20 BB 7 A looping trace BB 20 Control Flow Graph (CFG) 7 University of Michigan Electrical Engineering and Computer Science

Frequency of Recurring Instructions Offload stable traces in irregular loops 8 University of Michigan Electrical Engineering and Computer Science

Insight 2: Bundled Execution • Traditional processors issue and execute instructions in isolation… >> LD >> + LD + / & ST >> << Bundled execution LD + ST LD + / & ST >> << ST 11 instrs, 14 reads, 10 writes 3 instrs, 6 reads, 2 writes 9 University of Michigan Electrical Engineering and Computer Science

Efficiency of Bundled Execution All results normalized to a bundle length of 1 2. 6 Normalized Perf/Power 2. 4 2. 2 2 1. 8 1. 6 1. 4 1. 2 1 2 3 Bundle length 4 5 Bundled execution increases datapath efficiency by more than 2 x 10 University of Michigan Electrical Engineering and Computer Science 10

BERET Hardware Design • I$ Hardware design objectives: Configuration RAM (CRAM) ► Store Buffer Internal Register File D$ Capable of executing straight-line code in a loop (traces) Index bits MUX ► Support for bundled execution of trace instructions ► SEB 1 SEB 2 SEB N SEB Handle trace side-exits, and transfer control to the main config. processor Configure SEB 1 – 2 cycles config. bits Input Latch Writeback Bus LD ALU << ALU Execute SEB Writeback 1 – 5 cycles 1 – 2 cycles 11 Output Latch SEB: Subgraph Execution Block University of Michigan Electrical Engineering and Computer Science

Compiler Support 1. Trace Detection Program 2. Mapping traces to SEBs Data flow Hotsubgraphs Trace exit Hot Traces (with high loop back probability) 2 3 Configuration SEB 0 SEB 1 SEB 2 SEB 3 | exit BR Assert 12 University of Michigan Electrical Engineering and Computer Science RF ST BERET with SEBs Control + MPY 2 ADD LD SUB BR & LD BR Assert AND SHIFT << ST ADD 3 ADD OR + + BR × 1 1

CPU-BERET Execution Flow RF-0 RF-1 RF-0 Header Copy Live-Outs Assert Header Body … Header Body Copy Live-Ins Header BERET Execution CPU Side Exit RF RF Execution Time RF-1 Registers Program Assert discovered, executes copied to back on last BERET main to iteration main processor squashed 13 University of Michigan Electrical Engineering and Computer Science

Energy Savings Training set Test set 14 University of Michigan Electrical Engineering and Computer Science

Performance Impact 15 University of Michigan Electrical Engineering and Computer Science

Concluding Remarks • Scaling program performance in energy-constrained environment requires improving computational efficiency • Most accelerators exploit program regularity for savings • BERET is a configurable engine that saves energy by: ► Exploiting hot traces to avoid redundant fetches and decodes ► Using a bundled execution model to reduce temporary variable reads and writes Energy Saving ~35% Performance Enhancement Area Overhead ~10% 20% 16 University of Michigan Electrical Engineering and Computer Science

Questions • For more ► See http: //cccp. eecs. umich. edu 17 University of Michigan Electrical Engineering and Computer Science

Fine Grain Program Phase Behavior Traditional phases too coarse-grained to match accelerator Traditional phases Fine-grain 0 M Accelerate the pink portions 10 M Hypothesis of This Work Irregular programs are composed of fine-grain periods of high degrees of regularity. We can identify these periods and run them on an accelerator customized 18 for “simple” execution. University of Michigan Electrical Engineering and Computer Science