Composite Cores Pushing Heterogeneity into a Core Andrew

  • Slides: 33
Download presentation
Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal

Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke University of Michigan Micro 45 May 8 th 2012 University of Michigan Electrical Engineering and Computer Science

High Performance Cores High energy yields high performance Performance Energy Time Low performance DOES

High Performance Cores High energy yields high performance Performance Energy Time Low performance DOES NOT yield low energy High performance cores waste energy on low performance phases 2 University of Michigan Electrical Engineering and Computer Science

Core Energy Comparison Out-of. Order Brooks, ISCA’ 00 In-Order Dally, IEEE Computer’ 08 •

Core Energy Comparison Out-of. Order Brooks, ISCA’ 00 In-Order Dally, IEEE Computer’ 08 • Do Out-Of-Order contains performance enhancing we always need the extra hardware? hardware • Not necessary for 3 correctness University of Michigan Electrical Engineering and Computer Science

Previous Solution: Heterogeneous Multicore • 2+ Cores • Same ISA, different implementations – High

Previous Solution: Heterogeneous Multicore • 2+ Cores • Same ISA, different implementations – High performance, but more energy – Energy efficient, but less performance • Share memory at high level – Share L 2 cache ( Kumar ‘ 04) – Coherent L 2 caches (ARM’s big. LITTLE) • Operating System (or programmer) maps application to smallest core that provides needed performance 4 University of Michigan Electrical Engineering and Computer Science

Current System Limitations • Migration between cores incurs high overheads – 20 K cycles

Current System Limitations • Migration between cores incurs high overheads – 20 K cycles (ARM’s big. LITTLE) • Sample-based schedulers – Sample different cores performances and then decide whether to reassign the application – Assume stable performance with a phase • Phase must be long to be recognized and Do finer grained phases exist? exploited Can we exploit them? – 100 M-500 M instructions in length 5 University of Michigan Electrical Engineering and Computer Science

Performance Change in GCC 3 Big Core Little Core Instructions / Cycle 2. 5

Performance Change in GCC 3 Big Core Little Core Instructions / Cycle 2. 5 2 1. 5 1 0. 5 0 K K K 1 K Instructions 1 K 1 K 1 K • Average IPC over a 1 M instruction window (Quantum) • Average IPC over 2 K Quanta 6 University of Michigan Electrical Engineering and Computer Science M M

Finer Quantum 3 Big Core Little Core Instructions / Cycle 2. 5 2 1.

Finer Quantum 3 Big Core Little Core Instructions / Cycle 2. 5 2 1. 5 1 0. 5 0 160 K 170 K Instructions 180 K • What 20 K instruction window GCCto a Little if we could mapfrom these • Average IPC over Core? 100 instruction quanta 7 University of Michigan Electrical Engineering and Computer Science

Our Approach: Composite Cores • Hypothesis: Exploiting fine-grained phases allows more opportunities to run

Our Approach: Composite Cores • Hypothesis: Exploiting fine-grained phases allows more opportunities to run on a Little core • Problems I. How to minimize switching overheads? II. When to switch cores? • Questions I. How fine-grained should we go? II. How much energy can we save? 8 University of Michigan Electrical Engineering and Computer Science

Problem I: State Transfer 10 s of KB i. Cache Fetch i. TLB State

Problem I: State Transfer 10 s of KB i. Cache Fetch i. TLB State Fetch transfer costs can be very high: Branch i. TLB Branch Pred RAT ~20 K cycles (ARM’s big. LITTLE) Decode <1 KB Rename Decode Pred Reg File 10 s of KB In. O Execute d. TLB Limits O 3 switching to coarse granularity: Execute Instructions ( Kumar’ 04) d. Cache 100 M d. TLB 9 University of Michigan Electrical Engineering and Computer Science

Creating a Composite Core Only one i. Cache Big u. Engine active Fetch i.

Creating a Composite Core Only one i. Cache Big u. Engine active Fetch i. TLB Branch at a time u. Engin Pred i. Cache i. TLB Branch Pred Decod e O 3 Execute RAT Reg File e Load/Store Queue d. TLB Fetch Controll er <1 K B d. Cache d. TLB i. Cache Little Fetch i. TLB Branch u. Engin Pred e d. Cache d. TLB Decod e 10 Reg File Mem in. O Execute University of Michigan Electrical Engineering and Computer Science

Hardware Sharing Overheads • Big u. Engine needs – High fetch width – Complex

Hardware Sharing Overheads • Big u. Engine needs – High fetch width – Complex branch prediction – Multiple outstanding data cache misses • Little u. Engine wants – Low fetch width – Simple branch prediction – Single outstanding data cache miss • Must build shared units for Big u. Engine – over-provision Little u. Engine Little pays ~8%forenergy overhead to use over • Assume clock gating for inactive u. Engine provisioned fetch + caches – Still has static leakage energy 11 University of Michigan Electrical Engineering and Computer Science

Problem II: When to Switch • Goal: Maximize time on the Little u. Engine

Problem II: When to Switch • Goal: Maximize time on the Little u. Engine subject to maximum performance loss • User-Configurable • Traditional OS-based schedulers won’t work – Decisions to frequent – Needs to be made in hardware • Traditional sampling-based approaches won’t work – Performance not stable for long enough 12 – Frequent switching just to sample wastes University of Michigan Electrical Engineering and Computer Science

What u. Engine to Pick 3 Big Core Instructions / Cycle 2. 5 2

What u. Engine to Pick 3 Big Core Instructions / Cycle 2. 5 2 1 Difference Run on Big Run on Little 1. 5 Little Core Run on Big 0. 5 0 K K K 1 K Instructions 1 K 1 K M Run on Little • This value is hard to determine a priori, depends on application Let–user configure the target value Use a controller to learn appropriate value over time 13 University of Michigan Electrical Engineering and Computer Science

Reactive Online Controller Big Model Little Model User-Selected Performance + Switchin g Controlle r

Reactive Online Controller Big Model Little Model User-Selected Performance + Switchin g Controlle r Threshol d Controlle r 14 True False Little u. Engine Big u. Engine University of Michigan Electrical Engineering and Computer Science

u. Engine Modeling Little u. Engine IPC: 1. 66 Collect Metrics of active u.

u. Engine Modeling Little u. Engine IPC: 1. 66 Collect Metrics of active u. Engine • i. L 1, d. L 1 cache misses • L 2 cache misses • Branch Mispredicts • ILP, CPI model Use. MLP, a linear while(flag){ foo(); flag = bar(); } to estimate inactive u. Engine’s performance Big u. Engine 15 IPC: 2. 15 ? ? ? IPC: University of Michigan Electrical Engineering and Computer Science

Evaluation Architectural Feature Parameters Big u. Engine 3 wide O 3 @ 1. 0

Evaluation Architectural Feature Parameters Big u. Engine 3 wide O 3 @ 1. 0 GHz 12 stage pipeline 128 ROB Entries 128 entry register file Little u. Engine 2 wide In. Order @ 1. 0 GHz 8 stage pipeline 32 entry register file Memory System 32 KB L 1 i/d cache, 1 cycle access 1 MB L 2 cache, 15 cycle access 1 GB Main Mem, 80 cycle access Controller 5% performance loss relative to all big core 16 University of Michigan Electrical Engineering and Computer Science

Little Engine Utilization Traditional OS-Based Little Engine Utilization Fine-Grained Quantum 100% 90% 80% 70%

Little Engine Utilization Traditional OS-Based Little Engine Utilization Fine-Grained Quantum 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% astar bzip 2 gcc gobmk h 264 ref hmmer mcf omnetpp K K Quantum Length (Instructions) M M sjeng average • 3 -Wide O 3 (Big) vs. 2 -Wide (Little) More time on little engine. In. Order with same performance lossto all Big • 5% performance loss relative 17 University of Michigan Electrical Engineering and Computer Science

Switches / Million Instructions Engine Switches 5000 4500 4000 3500 3000 2500 2000 1500

Switches / Million Instructions Engine Switches 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 astar ~1 Switch / 306 Instructions bzip 2 gcc gobmk ~1 Switch / 2800 Instructions h 264 ref hmmer mcf omnetpp K K Quantum Length (Instructions) M M sjeng average Need LOTS of switching to maximize utilization 18 University of Michigan Electrical Engineering and Computer Science

Performance Loss Composite Cores Performance Relative to Big 105% astar ( Quantum Length =

Performance Loss Composite Cores Performance Relative to Big 105% astar ( Quantum Length = 1000 ) 100% bzip 2 gcc 95% gobmk h 264 ref 90% hmmer mcf 85% omnetpp 80% K K Quantum Length (Instructions) M M sjeng average Switching overheads negligible until ~1000 instructions 19 University of Michigan Electrical Engineering and Computer Science

Fine-Grained vs. Coarse. Grained • Little u. Engine’s average power 8% higher – Due

Fine-Grained vs. Coarse. Grained • Little u. Engine’s average power 8% higher – Due to shared hardware structures • Fine-Grained can map 41% more instructions to the Little u. Engine over Coarse-Grained. • Results in overall 27% decrease in average power over Coarse-Grained 20 University of Michigan Electrical Engineering and Computer Science

Decision Techniques 1. Oracle Knows both u. Engine’s performance for all quantums 2. Perfect

Decision Techniques 1. Oracle Knows both u. Engine’s performance for all quantums 2. Perfect Past Knows both u. Engine’s past performance perfectly 3. Model Knows only active u. Engine’s past, models inactive u. Engine using default weights All models target 2195% of the all Big u. Engine’s University of Michigan Electrical Engineering and Computer Science

Little Engine Utilization Dynamic Instructions On Little 100% 90% Oracle Perfect Past 80% 70%

Little Engine Utilization Dynamic Instructions On Little 100% 90% Oracle Perfect Past 80% 70% Model 60% 50% 40% 30% 20% 10% 0% Astar Bzip 2 Gcc Go. Bmk H 264 ref Hmmer Mcf Omnet. PP Sjeng Average Maps ofdominates the for dynamic instructions Issue High 25% width utilization memory computation bound onto the Little u. Engine application bound 22 University of Michigan Electrical Engineering and Computer Science

Energy Savings 100% Energy Savings Relative to Big 90% Oracle 80% Perfect Past 70%

Energy Savings 100% Energy Savings Relative to Big 90% Oracle 80% Perfect Past 70% 60% Model 50% 40% 30% 20% 10% 0% Astar Bzip 2 Gcc Go. Bmk H 264 ref Hmmer Mcf Omnet. PP Sjeng Average • Includes the overhead of shared hardware 18% reduction in energy consumption structures 23 University of Michigan Electrical Engineering and Computer Science

User-Configured Performance 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1%

User-Configured Performance 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1% 5% 10% 20% 1% Utilization 5% 10% 20% Overall Performance 1% 5% 10% 20% Energy Savings 1% yields 44% 4% energy 20%performance loss yields energy savings 24 University of Michigan Electrical Engineering and Computer Science

More Details in the Paper • • Estimated u. Engine area overheads u. Engine

More Details in the Paper • • Estimated u. Engine area overheads u. Engine model accuracy Switching timing diagram Hardware sharing overheads analysis 25 University of Michigan Electrical Engineering and Computer Science

Conclusions Questions? • Even high performance applications experience fine-grained phases of low throughput –

Conclusions Questions? • Even high performance applications experience fine-grained phases of low throughput – Map those to a more efficient core • Composite Cores allows – Fine-grained migration between cores – Low overhead switching • 18% energy savings by mapping 25% of the instructions to Little u. Engine with a 5% performance loss 26 University of Michigan Electrical Engineering and Computer Science

Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal

Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke University of Michigan Micro 45 May 8 th 2012 University of Michigan Electrical Engineering and Computer Science

Back Up 28 University of Michigan Electrical Engineering and Computer Science

Back Up 28 University of Michigan Electrical Engineering and Computer Science

The DVFS Question • Lower voltage is useful when: – L 2 Miss (stalled

The DVFS Question • Lower voltage is useful when: – L 2 Miss (stalled on commit) • Little u. Arch is useful when: – Stalled on L 2 Miss (stalled at issue) – Frequent branch mispredicts (shorter pipeline) – Dependent Computation http: //www. arm. com/files/downloads/big_LITTLE_Final. pdf 29 University of Michigan Electrical Engineering and Computer Science

Average Power Relative to the Big Core Sharing Overheads Big u. Engine 110% 100%

Average Power Relative to the Big Core Sharing Overheads Big u. Engine 110% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% astar bzip 2 Little Core gcc gobmk 30 h 264 ref Little u. Engine hmmer mcf omnetpp sjeng average University of Michigan Electrical Engineering and Computer Science

Performance Relative to Big 103% Oracle 100% Perfect Past Model 98% 95% 93% 90%

Performance Relative to Big 103% Oracle 100% Perfect Past Model 98% 95% 93% 90% Astar Bzip 2 Gcc Go. Bmk H 264 ref Hmmer Mcf Omnet. PP Sjeng Average 5% performance loss 31 University of Michigan Electrical Engineering and Computer Science

Model Accuracy Average Performance Model Average Performance 30% 25% 20% 15% 10% 5% 0%

Model Accuracy Average Performance Model Average Performance 30% 25% 20% 15% 10% 5% 0% -100% 35% Percent of Quantums 30% Model 25% 20% 15% 10% 5% -50% 0% 50% Percent Deviation From Actual 100% Little -> Big 0% -100% -50% 0% 50% Percent Deviation From Actual Big -> Little 32 University of Michigan Electrical Engineering and Computer Science 100%

Regression Coefficients 100% Relative Coefficient Magnatude 90% 80% L 2 Miss 70% Branch Mispredicts

Regression Coefficients 100% Relative Coefficient Magnatude 90% 80% L 2 Miss 70% Branch Mispredicts 60% ILP 50% L 2 Hit 40% MLP 30% Active u. Engine Cycles 20% Constant 10% 0% Little -> Big 33 Big -> Little University of Michigan Electrical Engineering and Computer Science