Composite Cores Pushing Heterogeneity into a Core Andrew

  • Slides: 36
Download presentation
Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal

Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke University of Michigan Micro 45 May 8 th 2012 University of Michigan Electrical Engineering and Computer Science

High Performance Cores High energy yields high performance Performance Energy Time Low performance DOES

High Performance Cores High energy yields high performance Performance Energy Time Low performance DOES NOT yield low energy High performance cores waste energy on low performance phases 2 University of Michigan Electrical Engineering and Computer Science

Core Energy Comparison Out-of. Order Brooks, ISCA’ 00 In-Order Dally, IEEE Computer’ 08 •

Core Energy Comparison Out-of. Order Brooks, ISCA’ 00 In-Order Dally, IEEE Computer’ 08 • Do we always need the extra hardware? Out-Of-Order contains performance enhancing hardware • Not necessary for correctness 3 University of Michigan Electrical Engineering and Computer Science

Previous Solution: Heterogeneous Multicore • 2+ Cores • Same ISA, different implementations – High

Previous Solution: Heterogeneous Multicore • 2+ Cores • Same ISA, different implementations – High performance, but more energy – Energy efficient, but less performance • Share memory at high level – Share L 2 cache ( Kumar ‘ 04) – Coherent L 2 caches (ARM’s big. LITTLE) • Operating System (or programmer) maps application to smallest core that provides needed performance 4 University of Michigan Electrical Engineering and Computer Science

Current System Limitations • Migration between cores incurs high overheads – 20 K cycles

Current System Limitations • Migration between cores incurs high overheads – 20 K cycles (ARM’s big. LITTLE) • Sample-based schedulers – Sample different cores performances and then decide whether to reassign the application – Assume stable performance with a phase • Phase must be long to be recognized and Do finer grained phases exist? exploited Can we exploit them? – 100 M-500 M instructions in length 5 University of Michigan Electrical Engineering and Computer Science

Performance Change in GCC 3 Big Core Instructions / Cycle 2. 5 Little Core

Performance Change in GCC 3 Big Core Instructions / Cycle 2. 5 Little Core 2 1. 5 1 0. 5 0 K K K 1 K 1 K Instructions 1 K 1 K 1 K M • Average IPC over a 1 M instruction window (Quantum) • Average IPC over 2 K Quanta 6 University of Michigan Electrical Engineering and Computer Science

Finer Quantum Instructions / Cycle 3 Big Core 2. 5 Little Core 2 1.

Finer Quantum Instructions / Cycle 3 Big Core 2. 5 Little Core 2 1. 5 1 0. 5 0 160 K 170 K Instructions 180 K • What if we could map these to a Little 20 K instruction window from GCC • Average IPC over 100 instruction quanta Core? 7 University of Michigan Electrical Engineering and Computer Science

Our Approach: Composite Cores • Hypothesis: Exploiting fine-grained phases allows more opportunities to run

Our Approach: Composite Cores • Hypothesis: Exploiting fine-grained phases allows more opportunities to run on a Little core • Problems I. How to minimize switching overheads? II. When to switch cores? • Questions I. How fine-grained should we go? II. How much energy can we save? 8 University of Michigan Electrical Engineering and Computer Science

Problem I: State Transfer 10 s of KB i. Cache Fetch i. TLB State

Problem I: State Transfer 10 s of KB i. Cache Fetch i. TLB State transfer costs can be very high: Branch i. TLB Branch Pred RAT ~20 K cycles (ARM’s big. LITTLE) Decode <1 KB Rename Decode Pred Reg File 10 s of KB In. O Execute d. TLB Limits switching to coarse granularity: O 3 d. Cache Execute d. Cache 100 M Instructions ( Kumar’ 04) d. TLB 9 University of Michigan Electrical Engineering and Computer Science

Creating a Composite Core Only one i. Cache Big u. Engine active Fetch i.

Creating a Composite Core Only one i. Cache Big u. Engine active Fetch i. TLB Branch at a time u. Engin Pred i. Cache i. TLB Branch Pred Decod e O 3 Execute RAT Reg File e Load/Store Queue d. TLB Fetch Controll er <1 K B d. Cache d. TLB i. Cache Little Fetch i. TLB Branch u. Engin Pred e d. Cache d. TLB Decod e 10 Reg File Mem in. O Execute University of Michigan Electrical Engineering and Computer Science

Hardware Sharing Overheads • Big u. Engine needs – High fetch width – Complex

Hardware Sharing Overheads • Big u. Engine needs – High fetch width – Complex branch prediction – Multiple outstanding data cache misses • Little u. Engine wants – Low fetch width – Simple branch prediction – Single outstanding data cache miss • Must build shared units for Big u. Engine – over-provision for Little u. Engine Little pays ~8% energy overhead to use over • Assume clock gating for inactive u. Engine provisioned fetch + caches – Still has static leakage energy 11 University of Michigan Electrical Engineering and Computer Science

Problem II: When to Switch • Goal: Maximize time on the Little u. Engine

Problem II: When to Switch • Goal: Maximize time on the Little u. Engine subject to maximum performance loss • User-Configurable • Traditional OS-based schedulers won’t work – Decisions to frequent – Needs to be made in hardware • Traditional sampling-based approaches won’t work – Performance not stable for long enough 12 – Frequent switching just to sample wastes University of Michigan Electrical Engineering and Computer Science

What u. Engine to Pick 3 Big Core Instructions / Cycle 2. 5 2

What u. Engine to Pick 3 Big Core Instructions / Cycle 2. 5 2 Little Core Run on Big Run on Little 1. 5 1 Difference Run on Big 0. 5 0 K K K 1 K 1 K M Run on Little • This value is hard to determine a priori, depends on application Instructions Let user configure the target value – Use a controller to learn appropriate value over time 13 University of Michigan Electrical Engineering and Computer Science

Reactive Online Controller User-Selected Performance + Threshol d Controlle r Big Model Little Model

Reactive Online Controller User-Selected Performance + Threshol d Controlle r Big Model Little Model Switchin g Controlle r True False Little u. Engine Big u. Engine 14 University of Michigan Electrical Engineering and Computer Science

u. Engine Modeling Little u. Engine IPC: 1. 66 Collect Metrics of active u.

u. Engine Modeling Little u. Engine IPC: 1. 66 Collect Metrics of active u. Engine • i. L 1, d. L 1 cache misses • L 2 cache misses • Branch Mispredicts • ILP, MLP, CPI Use a linear model to while(flag){ foo(); flag = bar(); } estimate inactive u. Engine’s performance Big u. Engine 15 IPC: ? ? ? IPC: 2. 15 University of Michigan Electrical Engineering and Computer Science

Evaluation Architectural Feature Parameters Big u. Engine 3 wide O 3 @ 1. 0

Evaluation Architectural Feature Parameters Big u. Engine 3 wide O 3 @ 1. 0 GHz 12 stage pipeline 128 ROB Entries 128 entry register file Little u. Engine 2 wide In. Order @ 1. 0 GHz 8 stage pipeline 32 entry register file Memory System 32 KB L 1 i/d cache, 1 cycle access 1 MB L 2 cache, 15 cycle access 1 GB Main Mem, 80 cycle access Controller 5% performance loss relative to all big core 16 University of Michigan Electrical Engineering and Computer Science

Little Engine Utilization Traditional OS-Based Little Engine Utilization Fine-Grained Quantum 100% 90% 80% 70%

Little Engine Utilization Traditional OS-Based Little Engine Utilization Fine-Grained Quantum 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% astar bzip 2 gcc gobmk h 264 ref hmmer mcf omnetpp K K Quantum Length (Instructions) M M sjeng average • 3 -Wide O 3 (Big) vs. 2 -Wide In. Order (Little) More time on little engine with same performance loss • 5% performance loss relative to all Big 17 University of Michigan Electrical Engineering and Computer Science

Switches / Million Instructions Engine Switches 5000 4500 4000 3500 3000 2500 2000 1500

Switches / Million Instructions Engine Switches 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 astar bzip 2 ~1 Switch / 306 Instructions gcc gobmk h 264 ref ~1 Switch / 2800 Instructions hmmer mcf omnetpp K K Quantum Length (Instructions) M M sjeng average Need LOTS of switching to maximize utilization 18 University of Michigan Electrical Engineering and Computer Science

Performance Relative to Big Performance Loss Composite Cores 105% astar ( Quantum Length =

Performance Relative to Big Performance Loss Composite Cores 105% astar ( Quantum Length = 1000 ) 100% bzip 2 gcc 95% gobmk 90% h 264 ref 85% mcf hmmer omnetpp 80% K K Quantum Length (Instructions) M M sjeng average Switching overheads negligible until ~1000 instructions 19 University of Michigan Electrical Engineering and Computer Science

Fine-Grained vs. Coarse. Grained • Little u. Engine’s average power 8% higher – Due

Fine-Grained vs. Coarse. Grained • Little u. Engine’s average power 8% higher – Due to shared hardware structures • Fine-Grained can map 41% more instructions to the Little u. Engine over Coarse-Grained. • Results in overall 27% decrease in average power over Coarse-Grained 20 University of Michigan Electrical Engineering and Computer Science

Decision Techniques 1. Oracle Knows both u. Engine’s performance for all quantums 2. Perfect

Decision Techniques 1. Oracle Knows both u. Engine’s performance for all quantums 2. Perfect Past Knows both u. Engine’s past performance perfectly 3. Model Knows only active u. Engine’s past, models inactive u. Engine using default weights All models target 95% of the all Big u. Engine’s 21 University of Michigan Electrical Engineering and Computer Science

Dynamic Instructions On Little Engine Utilization 100% 90% 80% 70% 60% 50% 40% 30%

Dynamic Instructions On Little Engine Utilization 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Oracle Perfect Past Model Astar Bzip 2 Gcc Go. Bmk H 264 ref Hmmer Mcf Omnet. PP Sjeng Average Maps 25% of the dynamic instructions Issue width dominates computation High utilization for memory bound onto the Little u. Engine application bound 22 University of Michigan Electrical Engineering and Computer Science

Energy Savings Relative to Big Energy Savings 100% 90% 80% 70% 60% 50% 40%

Energy Savings Relative to Big Energy Savings 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Oracle Perfect Past Model Astar Bzip 2 Gcc Go. Bmk H 264 ref Hmmer Mcf Omnet. PP Sjeng Average • Includes the overhead of shared hardware 18% reduction in energy consumption structures 23 University of Michigan Electrical Engineering and Computer Science

User-Configured Performance 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1%

User-Configured Performance 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1% 5% 10% 20% 1% Utilization 5% 10% 20% 1% Overall Performance 5% 10% 20% Energy Savings 1% performance loss yields 4% energy 20% performance loss yields 44% energy savings 24 University of Michigan Electrical Engineering and Computer Science

More Details in the Paper • • Estimated u. Engine area overheads u. Engine

More Details in the Paper • • Estimated u. Engine area overheads u. Engine model accuracy Switching timing diagram Hardware sharing overheads analysis 25 University of Michigan Electrical Engineering and Computer Science

Conclusions Questions? • Even high performance applications experience fine-grained phases of low throughput –

Conclusions Questions? • Even high performance applications experience fine-grained phases of low throughput – Map those to a more efficient core • Composite Cores allows – Fine-grained migration between cores – Low overhead switching • 18% energy savings by mapping 25% of the instructions to Little u. Engine with a 5% performance loss 26 University of Michigan Electrical Engineering and Computer Science

Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal

Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke University of Michigan Micro 45 May 8 th 2012 University of Michigan Electrical Engineering and Computer Science

Back Up 28 University of Michigan Electrical Engineering and Computer Science

Back Up 28 University of Michigan Electrical Engineering and Computer Science

The DVFS Question • Lower voltage is useful when: – L 2 Miss (stalled

The DVFS Question • Lower voltage is useful when: – L 2 Miss (stalled on commit) • Little u. Arch is useful when: – Stalled on L 2 Miss (stalled at issue) – Frequent branch mispredicts (shorter pipeline) – Dependent Computation http: //www. arm. com/files/downloads/big_LITTLE_Final. pdf 29 University of Michigan Electrical Engineering and Computer Science

Average Power Relative to the Big Core Sharing Overheads Big u. Engine 110% 100%

Average Power Relative to the Big Core Sharing Overheads Big u. Engine 110% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% astar bzip 2 Little Core gcc gobmk h 264 ref hmmer 30 Little u. Engine mcf omnetpp sjeng average University of Michigan Electrical Engineering and Computer Science

Performance Relative to Big 103% Oracle Perfect Past Model 100% 98% 95% 93% 90%

Performance Relative to Big 103% Oracle Perfect Past Model 100% 98% 95% 93% 90% Astar Bzip 2 Gcc Go. Bmk H 264 ref Hmmer Mcf Omnet. PP Sjeng Average 5% performance loss 31 University of Michigan Electrical Engineering and Computer Science

Model Accuracy Average Performance Model Average Performance 30% 25% 20% 15% 10% 5% 0%

Model Accuracy Average Performance Model Average Performance 30% 25% 20% 15% 10% 5% 0% -100% 35% Percent of Quantums 30% Model 25% 20% 15% 10% 5% -50% 0% 50% Percent Deviation From Actual 100% Little -> Big 0% -100% -50% 0% 50% Percent Deviation From Actual Big -> Little 32 University of Michigan Electrical Engineering and Computer Science 100%

Regression Coefficients 100% Relative Coefficient Magnatude 90% 80% L 2 Miss 70% Branch Mispredicts

Regression Coefficients 100% Relative Coefficient Magnatude 90% 80% L 2 Miss 70% Branch Mispredicts 60% ILP 50% L 2 Hit 40% MLP 30% Active u. Engine Cycles 20% Constant 10% 0% Little -> Big 33 Big -> Little University of Michigan Electrical Engineering and Computer Science

Different Than Kumar et al. Composite Cores • Coarse-grained switching • OS Managed •

Different Than Kumar et al. Composite Cores • Coarse-grained switching • OS Managed • Fine-grain switching • Hardware Managed • Minimal shared state (L 2’s) • Maximizes shared state (L 2’s, L 1’s, Branch Predictor, TLBs) • Requires sampling • On-the-fly prediction • 6 Wide O 3 vs. 8 Wide O 3 • Has In. Order, but never uses it! • 3 Wide O 3 vs. 2 Wide In. Order Coarse-grained vs. fine-grained 34 University of Michigan Electrical Engineering and Computer Science

Register File Transfer RA T Num - Num Value Commit Registers 3 stage pipeline

Register File Transfer RA T Num - Num Value Commit Registers 3 stage pipeline 1. Map to physical register in RAT 2. Read physical register 3. Write to new register file If commit updates, repeat 35 Registers University of Michigan Electrical Engineering and Computer Science

u. Engine Model • 36 University of Michigan Electrical Engineering and Computer Science

u. Engine Model • 36 University of Michigan Electrical Engineering and Computer Science