Mirage Cores The Illusion of Many Outoforder Cores

  • Slides: 23
Download presentation
Mirage Cores: The Illusion of Many Out-of-order Cores Using In -order Hardware Shruti Padmanabha,

Mirage Cores: The Illusion of Many Out-of-order Cores Using In -order Hardware Shruti Padmanabha, Andrew Lukefahr*, Reetuparna Das, Scott Mahlke Micro-50, Boston Oct 18, 2017 University of Michigan Electrical Engineering and Computer Science *Now at Indiana University

General purpose computer architectures System throughput Energy efficiency Single-thread performance 2

General purpose computer architectures System throughput Energy efficiency Single-thread performance 2

Heterogeneous CMP architectures Oo. O Out-of-order Core System throughput • Energy efficiency • •

Heterogeneous CMP architectures Oo. O Out-of-order Core System throughput • Energy efficiency • • • CMP Area High performance Dynamically reorders instructions Large, complex, onesize fits all design Low energy efficiency Single-thread performance 3

Heterogeneous CMP architectures Oo. O In-Order Core • • • Chip Area Smaller, simplistic

Heterogeneous CMP architectures Oo. O In-Order Core • • • Chip Area Smaller, simplistic design Low area Low power Issues instructions in program order Low performance System throughput Energy efficiency Single-thread performance 4

Heterogeneous CMP architectures In. O In. O Oo. O Chip Area System throughput Energy

Heterogeneous CMP architectures In. O In. O Oo. O Chip Area System throughput Energy efficiency Single-thread performance 5

Mirage Cores - Objective Single-thread performance System throughput More In. O cores More Oo.

Mirage Cores - Objective Single-thread performance System throughput More In. O cores More Oo. O-like In. O cores Energy efficiency 6

Background: Dyna. MOS “More Oo. O-like In. O cores” Program Traces Oo. O Sched

Background: Dyna. MOS “More Oo. O-like In. O cores” Program Traces Oo. O Sched $ In. O Memoize! Oin. O HW 70% of the traces have equivalent schedules for most of their lifetimes Oin. O vs In. O Performance: 1. 4 X Area: 1. 2 X Energy: 1. 4 X 7

Mirage Cores: Motivations Memoization opportunities vary based on program/phase characteristics In. O cores can

Mirage Cores: Motivations Memoization opportunities vary based on program/phase characteristics In. O cores can utilize memoized traces for phases of millions of instructions Oin. O Oo. O Oin. O 8

Mirage Cores: Concept Oo. O+ Oo. O Schedule producer In. O+ Oin. O Oo.

Mirage Cores: Concept Oo. O+ Oo. O Schedule producer In. O+ Oin. O Oo. O In. O+ In. O Oin. O In. O+In. O System throughput Oin. O In. O+ Oo. OOin. O In. O+ Oin. O Chip Area Energy efficiency Single-thread performance 9

Mirage Cores: Challenges Oo. O+ Oo. O Schedule producer In. O+ Oin. O In.

Mirage Cores: Challenges Oo. O+ Oo. O Schedule producer In. O+ Oin. O In. O+ Efficiently time-share the Oo. O • Architecture • # Oin. Os per Oo. O • Minimize overheads Oin. O • Effectively arbitrate between applications • Metrics? • Goals? 10

Mirage Cores: Architecture Oo. O L 1 i$ Sched$ In. O+ Oin. O L

Mirage Cores: Architecture Oo. O L 1 i$ Sched$ In. O+ Oin. O L 1 d$ Sched$ Interconnect To Shared L 2 L 1 i$ Arbitrator L 1 i$ Sched$ In. O+ Oin. O L 1 d$ L 1 i$ Sched$ In. O+ Oin. O To Shared L 2 L 1 d$ L 1 i$ Sched$ In. O+ Oin. O L 1 d$ … 11

Arbitration Between Applications Oo. O Candidate Arbitrator App 0 Time 1 Million cycles App

Arbitration Between Applications Oo. O Candidate Arbitrator App 0 Time 1 Million cycles App 1 Execution metrics ? ? App 0 App 2 App 3 … 12

Metrics for arbitration Execution metric Determines Measure Memoizability Single-application speedup ∆Sched$-MPKI 13

Metrics for arbitration Execution metric Determines Measure Memoizability Single-application speedup ∆Sched$-MPKI 13

Memoizability - ∆Sched$-MPKI delta = IPC Relationship between performance and Sched$-MPKI for bzip 2

Memoizability - ∆Sched$-MPKI delta = IPC Relationship between performance and Sched$-MPKI for bzip 2 . . . Program in increasing order of 1 M cycle intervals 14

Metrics for arbitration Execution metric Determines Memoizability Single-application speedup Slowdown System throughput Time on

Metrics for arbitration Execution metric Determines Memoizability Single-application speedup Slowdown System throughput Time on Oo. O Fairness Measure ∆Sched$-MPKI 15

Goals for arbitration 1. Maximize energy efficiency 2. Maximize system throughput Oo. O Traditional

Goals for arbitration 1. Maximize energy efficiency 2. Maximize system throughput Oo. O Traditional Heterogeneous CMP 3. Guarantee fair/priority based resource allocation Oo. O 16

Evaluation Methodology Architectural Feature Parameters Oo. O Core 3 wide O 3 @ 2

Evaluation Methodology Architectural Feature Parameters Oo. O Core 3 wide O 3 @ 2 GHz 12 stage pipeline 128 ROB Entries 128 entry PRF, 32 entry LSQ In. O Core 3 wide In. Order @ 2 GHz 8 stage pipeline 128 entry PRF, 32 entry LSQ Memory System 32 KB L 1 i/d cache, 2 cycle access 8 KB Schedule cache, 1 cycle access 1 MB L 2 cache, 15 cycle access 1 GB Main Mem, 100 cycle access Simulator Gem 5 Energy Model Mc. PAT 17

Evaluation Experimental parameters Parameters Number of cores n-In. O + 1 Oo. O Baseline

Evaluation Experimental parameters Parameters Number of cores n-In. O + 1 Oo. O Baseline n-Oo. O Workloads Random mixes of n-benchmarks from spec 2 k 6 Each run for a 1 billion instruction simpoint 18

Architectures for comparison 8: 1 configuration Traditional Heterogeneous CMP In. O Oo. O (booster)

Architectures for comparison 8: 1 configuration Traditional Heterogeneous CMP In. O Oo. O (booster) Mirage Core CMP OIn. O Oo. O + Sched Producer In. O OIn. O Homogeneous Oo. O CMP Oo. O Oo. O Homogeneous In. O CMP In. O In. O 19

App 7 App 5 Traditional Het-CMP Mirage (no memoization) Cores 100% 54% energy savings

App 7 App 5 Traditional Het-CMP Mirage (no memoization) Cores 100% 54% energy savings over homogeneous Oo. O CMP with 16% 100% Homo-Oo. O STP loss 80% OR Homo-In. O 60% 24% STP gains over homogeneous In. O CMP with 14% energy 60% overhead 40% 20% 0% App 1 App 2 App 6 Energy Rel to Homo. O Performance Rel to Homo-Oo. O 8 In. Os with 1 Oo. O App 0 App 4 App 3 Homo-Oo. O Homo-In. O 20% 0% 20

Size of cluster Traditional Het-CMP (no memoization) Perforamnce Relative to Homo-Oo. O Utilization of

Size of cluster Traditional Het-CMP (no memoization) Perforamnce Relative to Homo-Oo. O Utilization of Oo. O 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 4 8 12 16 Mirage Cores Homo-Oo. O 100% 90% 80% 70% Homo-In. O 60% 50% 40% 30% 20% 10% 0% 4 8 12 16 Number of Oin. O cores per Oo. O is oversubscribed for n >=12 21

Conclusion Reordered schedules System throughput Energy efficiency Single-thread performance In. O+ Oin. O In.

Conclusion Reordered schedules System throughput Energy efficiency Single-thread performance In. O+ Oin. O In. O+ Oin. O Oo. O+ Oo. O Schedule producer Achieve > 80% of a homogeneous Oo. O CMP with > 20% area and > 50% energy savings 22

Mirage Cores: The Illusion of Many Out-of-order Cores Using In-order Hardware Questions? Shruti Padmanabha,

Mirage Cores: The Illusion of Many Out-of-order Cores Using In-order Hardware Questions? Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das, Scott Mahlke Micro-50, Boston Oct 18, 2017 University of Michigan Electrical Engineering and Computer Science