43 rd International Symposium on Microarchitecture Erasing Core

  • Slides: 31
Download presentation
43 rd International Symposium on Microarchitecture Erasing Core Boundaries for Robust and Configurable Performance

43 rd International Symposium on Microarchitecture Erasing Core Boundaries for Robust and Configurable Performance Shantanu Gupta Shuguang Feng Amin Ansari Scott Mahlke University of Michigan, Ann Arbor December 7, 2010 University of Michigan Electrical Engineering and Computer Science

Multicore Architectures • Industry wide move to multicores 2 – 16 cores on a

Multicore Architectures • Industry wide move to multicores 2 – 16 cores on a single die • Multiple challenges confront them: ► ► Single-thread performance Reliability Power density Memory bandwidth …. Sun Niagara 2 ► IBM Cell ► Intel 4 Core Nehalem • Our hypothesis: A highly configurable architecture can handle these issues in a unified manner. 2 University of Michigan Electrical Engineering and Computer Science

Multicore Performance Challenge Core Core Core 2. Stagnating sequential performance CPU Performance (log scale)

Multicore Performance Challenge Core Core Core 2. Stagnating sequential performance CPU Performance (log scale) 1. Good throughput / parallel performance with more cores Core i 7 Core 2 Quad Core Duo Pentium 4 Pentium III Pentium II Need flexibility to provide both Core Parallel and Sequential performance Core Pentium 486 Power wall Core Parallel workloads (scientific computing, newer web browsers, video decoding) Spectrum of Applications 3 Sequential workloads (legacy workloads, most mobile/desktop apps) University of Michigan Electrical Engineering and Computer Science

Solution: Configurable Performance • Assign resources where they are needed… • In an N

Solution: Configurable Performance • Assign resources where they are needed… • In an N core chip: ► ► Use all N cores for best Parallel Performance Group M cores together for Serial Performance (M < N) Parallel / Throughput Serial / Sequential Source: Mark D. Hill • Core Fusion, ISCA’ 07; Composable Lighweight Processors, MICRO’ 07 4 University of Michigan Electrical Engineering and Computer Science

Multicore Reliability Challenge Parametric Variability Electromigration (EM) Hard Faults Intra-die variations in ILD thickness

Multicore Reliability Challenge Parametric Variability Electromigration (EM) Hard Faults Intra-die variations in ILD thickness Need mechanisms for Increased Heating in-field silicon failures Oxide breakdown (OBD) Thermal Runawa Higher y Transistor Power Dissipation Negative Bias Threshold Inversion [Todd Austin, GSRC Sep 08] Leakage Manufacturing Defects That Escape Testing 5 (Inefficient Burn-in Testing) University of Michigan Electrical Engineering and Computer Science

Solution: Isolate Broken Resources CORE level MODULE level • Elast. IC, DT’ 06 •

Solution: Isolate Broken Resources CORE level MODULE level • Elast. IC, DT’ 06 • Configurable Isolation, ISCA’ 07 • Online Diagnosis of Hard Faults, MICRO’ 05 • Ultra Low-Cost Defect Protection, ASPLOS’ 06 STAGE level Stage 1 Stage 2 • Stage. Net, MICRO 08 • Core Cannibalization, PACT 08 Stage 3 Stage. N - Stage. Net decouples the pipeline stages - Regular fabric, no global interconnections - Any set of stages can be connected to form a pipeline 6 University of Michigan Electrical Engineering and Computer Science

Point Solutions: Summary and Limitations Reliability Configurable Performance Fuse cores for higher single-thread performance

Point Solutions: Summary and Limitations Reliability Configurable Performance Fuse cores for higher single-thread performance Stage 1 Stage 2 Stage 3 Stage. N Stage 3 Stage level isolation 7 University of Michigan Electrical Engineering and Computer Science Stage. N

Point Solutions: Summary and Limitations 1. Solve only one challenge at a time 2.

Point Solutions: Summary and Limitations 1. Solve only one challenge at a time 2. Incur additive overheads, no resource overlap Our 3. Are incompatible with one. Goal: another Design an architectural solution, which Stage 1 Stage 2 Stage 3 Fuse cores for higher single-thread performance Stage 1 Stage 2 Stage 3 1. Simultaneously targets configurable Stage 1 Stage 2 Stage 3 performance and reliability Stage. N Stage level isolation 2. Overlaps hardware changes, and • Tightly coupled resources • Decoupled resources • Centralized structures for data and control management • Distributed data and control management 3. Resolves any conflicting requirements 8 University of Michigan Electrical Engineering and Computer Science Stage. N

The Core. Genesis (CG) Architecture Crossbar Switch Fetch Decode Issue Ex/Mem Distributed Structures 1.

The Core. Genesis (CG) Architecture Crossbar Switch Fetch Decode Issue Ex/Mem Distributed Structures 1. • Regular grid of pipeline stages. No explicit core boundary. • Stages interconnected by full crossbars • Distributed structures for data and control management Throughput 9 University of Michigan Electrical Engineering and Computer Science

The Core. Genesis (CG) Architecture Single pipeline processor Fetch Decode Advantages: Issue 1. Fetch

The Core. Genesis (CG) Architecture Single pipeline processor Fetch Decode Advantages: Issue 1. Fetch Unified performance / reliability solution Decode Ex/Mem Issue Ex/Mem 2. Overlaps hardware overheads 3. Regular fabric 4. No centralized resources for fetch, issue, operand copying Issue Decode Issue Ex/Mem Conjoined pipelines processor 1. Throughput 3. Configurable Performance 2. Reliability 10 University of Michigan Electrical Engineering and Computer Science

CG – Microarchitectural Hurdles Single Pipeline Conjoined Pipelines Control Flow Register Data Flow ?

CG – Microarchitectural Hurdles Single Pipeline Conjoined Pipelines Control Flow Register Data Flow ? ? Memory Data Flow Instruction Issue N/A ? ? Control Flow Register and Memory Data Flow - Instruction sequence needs to be • Single Pipeline: Solved by the Stage. Net design, MICRO’ 08 managed across fetch stages - Detection of cross pipeline register and memory data flow violations ► Stream Identification bits for Control Flow Instruction Issue cache (inside EXEC. -stage) ► Bypass for to Register Data Flow Recovery a consistent architectural - Segregate data flow chains state between pipelines 11 University of Michigan Electrical Engineering and Computer Science

CG – Overview Conjoined pipelines processor Distributed fetch Distributed decode Detection of data flow

CG – Overview Conjoined pipelines processor Distributed fetch Distributed decode Detection of data flow violations In-order Writeback Decentralized Instruction issue (broadcasted) Fetch Decode Issue Ex/Mem 1. Control Flow 2. Register Data Flow tracking 3. Memory Data Flow tracking 4. Replay Mechanism 5. Instruction Issue 12 University of Michigan Electrical Engineering and Computer Science

CG – Control Flow Distributed Fetch. - Pipelines fetch alternate instructions - Branch predictors

CG – Control Flow Distributed Fetch. - Pipelines fetch alternate instructions - Branch predictors are kept in sync. 9. . 7. . 5. . 3. . 1 Fetch Decode 10. . 8. . 6. . 4. . 2 Fetch Decode Advantages - Evenly splits the work (fetch, decode, issue) between two pipelines - No explicit communication required for control decisions - Consistent control decisions due to mirrored branch predictors 13 University of Michigan Electrical Engineering and Computer Science

CG – Data Flow Across pipeline dependencies are tricky…. Compare notes at commit time

CG – Data Flow Across pipeline dependencies are tricky…. Compare notes at commit time Register Data Flow 1. Issue stages locally maintain a table of source registers Replay if any dependency was violated 2. Issue stages monitor write-backs, and detect if any other pipeline Instructionupdates. Split instruction Local decisions a source for an outstanding instruction stream and execution 3. Missed dependency initiate a light-weight replay 14 University of Michigan Electrical Engineering and Computer Science

CG – Register Data Flow: Example Scenario A 1. 2. 3. 4. R 1

CG – Register Data Flow: Example Scenario A 1. 2. 3. 4. R 1 = …. R 2 = …. … = R 1 … = R 2 R 1 …. 3, 1 … = R 2 R 1 Issue R 1 = …. Execute R 1 R 2 … = …. R 2 4, 2 Issue … = R 2 = …. R 1 R 2 Execute R 2 Scenario B 1. 2. 3. 4. R 1 = …. R 2 = …. … = R 2 Data flow violation! Pipeline 1 used a stale value of R 2 How can we avoid these violations? 15 University of Michigan Electrical Engineering and Computer Science

CG – Instruction Issue • Instructions can be: Issue ► Straight steered ► Cross

CG – Instruction Issue • Instructions can be: Issue ► Straight steered ► Cross steered straight cro ss Issue Ex/Mem • Objective: match producers and consumers • Mismatch Data Flow violation Replay • Solution: Use static compiler analysis to generate steering hints 16 University of Michigan Electrical Engineering and Computer Science

CG – Instruction Issue: Example 10. . 8. . 6. . 4. . 2

CG – Instruction Issue: Example 10. . 8. . 6. . 4. . 2 6 Fetch order 9 1 10 4 straight cro ss Issue Ex/Mem Always straight steering 5 2 • Ignores data dependencies • Number of replays = 5 7 3 Compiler orchestrated steering 8 Critical cross dependency 9. . 7. . 5. . 3. . 1 Issue • Use clustering algorithms • Accounts for dependencies and communication delays • Number of replays = 0 17 University of Michigan Electrical Engineering and Computer Science

CG – Design Summary Fetch Decode Issue Ex/Mem 1. Control Flow - Pipelines fetch

CG – Design Summary Fetch Decode Issue Ex/Mem 1. Control Flow - Pipelines fetch alternate instructions - Branch predictors kept in sync 2. Register Data Flow - Maintain local data flow information - Check the decisions at writeback 3. Memory Data Flow tracking 5. Instruction Issue - Steer consumers to producers - Leverage static compiler analysis 4. Replay Mechanism 18 University of Michigan Electrical Engineering and Computer Science

Evaluation Methodology • Liberty Simulation Infrastructure ► For cycle accurate simulations • Trimaran Compilation

Evaluation Methodology • Liberty Simulation Infrastructure ► For cycle accurate simulations • Trimaran Compilation System ► For instruction steering hints Microarchitectural Paramenters Branch predictor Global, 16 -bit, gshare predictor Level 1 I/D cache 4 -way, 16 KB, 1 cycle latency Level 2 unified cache 8 -way, 64 KB, 5 cycle latency • Experiments: ► ► ► Single-thread performance gain from conjoining Throughput improvement from conjoining (at low utilizations) Throughput sustainability (in face of failures) 19 University of Michigan Electrical Engineering and Computer Science

Sequential Performance Baseline (1 issue) CG Single Pipeline (1 issue) CG Conjoint Pipelines (2

Sequential Performance Baseline (1 issue) CG Single Pipeline (1 issue) CG Conjoint Pipelines (2 issue) Baseline (2 issue) 2 1. 5 1 0. 5 20 University of Michigan Electrical Engineering and Computer Science Average heat sobel rls vitoneloop 188. ammp 183. equake 179. art 177. mesa 173. applu 172. mgrid 300. twolf 256. bzip 2 197. parser 176. gcc 175. vpr 0 164. gzip Normalized IPC 2. 5

Throughput at varying utilization 8 -core Core. Genesis chip 0. 75 00 0 1

Throughput at varying utilization 8 -core Core. Genesis chip 0. 75 00 0 1 1 5 0. 25 8 7 6 5 4 3 2 1 0 0. Throughput (IPC) 8 -core traditional multicore System utilization (number threads / number of cores) 21 University of Michigan Electrical Engineering and Computer Science

Throughput Sustainability (Reliability) 8 -core traditional multicore 8 -core Core. Genesis chip 7 Throughput

Throughput Sustainability (Reliability) 8 -core traditional multicore 8 -core Core. Genesis chip 7 Throughput (IPC) 6 5 4 3 2 1 0 0 1 2 3 4 5 Time (years) 22 6 7 8 University of Michigan Electrical Engineering and Computer Science 9

Conclusions • Architectural flexibility can tackle multiple multicore challenges • Core. Genesis is our

Conclusions • Architectural flexibility can tackle multiple multicore challenges • Core. Genesis is our attempt at a unified performance and reliability solution ► ► Decentralized instruction flow management to combine resources for higher single-thread performance Decoupled pipeline architecture to allow stage level reconfiguration • Results: ► ► ► Combining two single issue pipelines gives 40% speedup Sustains the same throughput for up to 70% longer Overheads: 20% area, 17% power 23 University of Michigan Electrical Engineering and Computer Science

Thank you Erasing Core Boundaries for Robust and Configurable Performance http: //cccp. eecs. umich.

Thank you Erasing Core Boundaries for Robust and Configurable Performance http: //cccp. eecs. umich. edu 24 University of Michigan Electrical Engineering and Computer Science

Back up slides University of Michigan Electrical Engineering and Computer Science

Back up slides University of Michigan Electrical Engineering and Computer Science

Traditional Solutions and Core. Genesis (CG) (B) Core Disabling. Isolates broken cores (red). Sustains

Traditional Solutions and Core. Genesis (CG) (B) Core Disabling. Isolates broken cores (red). Sustains throughput only in low failure rates. Throughput The architecture is composed of a sea of building blocks (B). These blocks can be configured for: Sequential • Throughput computing: By forming single-issue pipelines • Single-thread performance: By forming wider-issue pipelines • Fault-tolerance: By decommissioning broken blocks. (A) Dynamic Multicore. Cores can fuse together when sequential performance is needed. (C) Heterogeneous CMP. Maintains a variety of cores to offer powerproportional computing. Traditional point solutions • Customized processing: Heterogeneous building blocks can be introduced in the fabric to form customized pipelines. Core. Genesis Vision 26 University of Michigan Electrical Engineering and Computer Science

CG Instance: A Unified Performance. Reliability Solution Provides. . Summary of Challenges 1. Configurable

CG Instance: A Unified Performance. Reliability Solution Provides. . Summary of Challenges 1. Configurable Performance: By merging varying number of stages 2. Reliability: By isolating broken stages Design Characteristics • Elementary pipeline stages form the building blocks • Stages interconnected using full crossbars. • No global flush, stall or forwarding signals. • No modifications to the cache hierarchy Single Pipeline Conjoint Pipelines Control Flow Register Data Flow Memory Data Flow Instruction Steering 27 University of Michigan Electrical Engineering and Computer Science

Bypass $ 1. Control Flow 2. Data Flow Stream ID (SID) double Issue double

Bypass $ 1. Control Flow 2. Data Flow Stream ID (SID) double Issue double double Decode Ex/Mem 3. Transmission Delays Bypass $ Macro-Ops • Stores previous results • Send instruction bundles • Control flow handling • Eliminates flush signals • Fully associative structure 0 1 • Emulates data forwarding • Amortizes transfer delay • Increases system utilization 28 double SID Fetch buffer Register File buffer Macro-op Generator buffer Gen Branch PC Predictor buffer Decoupling Stages in a Pipeline [MICRO’ 08] >> LD LD + + / & ST >> << ST University of Michigan Electrical Engineering and Computer Science

Replays Mem. Flow replay cycles Reg. Flow replay cycles Normal operation cycles 16 4.

Replays Mem. Flow replay cycles Reg. Flow replay cycles Normal operation cycles 16 4. g 17 zip 5. 17 vpr 19 6. g 7. cc p 25 ars 6. er b 30 zip 0. 2 tw ol f 17 2. m 17 gr 3. id a 17 ppl 7. u m es a 1 18 79 3. . ar e t 18 qua 8. ke am m p vi to ne lo op rls so be l he at 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 29 University of Michigan Electrical Engineering and Computer Science

Area 30 University of Michigan Electrical Engineering and Computer Science

Area 30 University of Michigan Electrical Engineering and Computer Science

Power 31 University of Michigan Electrical Engineering and Computer Science

Power 31 University of Michigan Electrical Engineering and Computer Science