18 742 Spring 2011 Parallel Computer Architecture Lecture

Project Proposals n We’ve read your proposals n Get feedback from us on your

Reviews n Due Today (Feb 9) before class q n Due Friday (Feb 11)

Last Lecture n n n Discussion on hardware support for debugging parallel programs Asymmetric

Today n n n Speculative Lock Elision Data Marshaling Dynamic Core Combining (Core Fusion)

Alternatives to ACS n Transactional memory (Herlihy+) ACS does not require code modification n

Speculative Lock Elision n n Many programs use locks for synchronization Many locks are

Dynamically Unnecessary Synchronization 8

Speculative Lock Elision: Issues n n n Either the entire critical section is committed

Maintaining Atomicity n n If atomicity is maintained, all locks can be removed Conditions

SLE Implementation n Checkpoint register state before entering SLE mode n In SLE mode:

ACS vs. SLE n ACS Advantages over SLE + Speeds up each individual critical

ACS Summary n n n Critical sections reduce performance and limit scalability Accelerate critical

Staged Execution Model (I) n n Goal: speed up a program by dividing it

Staged Execution Model (II) LOAD X STORE Y LOAD Y …. STORE Z LOAD

Staged Execution Model (III) Split code into segments Segment S 0 LOAD X STORE

Staged Execution Model (IV) Core 0 Core 1 Core 2 Instances of S 0

Staged Execution Model: Segment Spawning Core 0 Core 1 Core 2 S 0 LOAD

Staged Execution Model: Two Examples n Accelerated Critical Sections [Suleman et al. , ASPLOS

Problem: Locality of Inter-segment Data. Core 0 Core 1 Core 2 S 0 LOAD

Problem: Locality of Inter-segment n Data Accelerated Critical Sections [Suleman et al. , ASPLOS

Terminology Core 0 S 0 Core 1 LOAD X STORE Y Transfer Y S

Data Marshaling: Key Observation and Idea n Observation: Set of generator instructions is stable

Data Marshaling Hardware Compiler/Profiler 1. Identify generator instructions 2. Insert marshal instructions Binary containing

Profiling Algorithm Inter-segment data Mark as Generator Instruction LOAD X STORE Y LOAD Y

Marshal Instructions LOAD X STORE Y G: STORE Y MARSHAL C 1 When to

Hardware Support and DM Example Cache Hit! Core 0 Addr Y L 2 Cache

DM: Advantages, Disadvantages n Advantages q q q n Timely data transfer: Push data

Accelerated Critical Sections Large Core Small Core 0 Addr Y L 2 Cache Data

Accelerated Critical Sections: Methodology n Workloads: 12 critical section intensive applications q q n

DM on Accelerated Critical Sections: Results 168 170 8. 7% 32

Pipeline Parallelism Cache Hit! Core 0 Addr Y L 2 Cache Data Y Marshal

Pipeline Parallelism: Methodology n Workloads: 9 applications with pipeline parallelism q q n Financial,

DM on Pipeline Parallelism: Results 16% 35

DM Coverage, Accuracy, Timeliness n n High coverage of inter-segment misses in a timely

DM Scaling Results n DM performance improvement increases with q q q n More

Other Applications of Data Marshaling n Can be applied to other Staged Execution models

How to Build a Dynamic ACMP n Frequency boosting DVFS n Core combining: Core

Core Fusion: Motivation n n Programs are incrementally parallelized in stages Each parallelization stage

Core Fusion Idea n Combine multiple simple cores dynamically to form a larger, more

Core Fusion Microarchitecture n Concept: Add enveloping hardware to make cores combineable 42

Slides: 42

Download presentation

18 -742 Spring 2011 Parallel Computer Architecture Lecture 10: Asymmetric Multi-Core III Prof. Onur Mutlu Carnegie Mellon University

Project Proposals n We’ve read your proposals n Get feedback from us on your progress 2

Reviews n Due Today (Feb 9) before class q n Due Friday (Feb 11) midnight q n Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution, ” MICRO 2001. Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures, ” ISCA 1993. Due Tuesday (Feb 15) midnight q q q Patel, “Processor-Memory Interconnections for Multiprocessors, ” ISCA 1979. Dally, “Route packets, not wires: on-chip inteconnection network, ” DAC 2001. Das et al. , “Aergia: Exploiting Packet Latency Slack in On-Chip Networks, ” ISCA 2010. 3

Last Lecture n n n Discussion on hardware support for debugging parallel programs Asymmetric multi-core for energy efficiency Accelerated critical sections (ACS) 4

Today n n n Speculative Lock Elision Data Marshaling Dynamic Core Combining (Core Fusion) 5

Alternatives to ACS n Transactional memory (Herlihy+) ACS does not require code modification n Transactional Lock Removal (Rajwar+), Speculative Synchronization (Martinez+), Speculative Lock Elision (Rajwar) q Hide critical section latency by increasing concurrency ACS reduces latency of each critical section q Overlaps execution of critical sections with no data conflicts ACS accelerates ALL critical sections q Does not improve locality of shared data ACS improves locality of shared data ACS outperforms TLR (Rajwar+) by 18% (details in ASPLOS 2009 paper) 6

Speculative Lock Elision n n Many programs use locks for synchronization Many locks are not necessary q q n Idea: q q q n Stores occur infrequently during execution Updating different parts of data structure Speculatively assume lock is not necessary and execute critical section without acquiring the lock Check for conflicts within the critical section Roll back if assumption is incorrect Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution, ” MICRO 2001. 7

Dynamically Unnecessary Synchronization 8

Speculative Lock Elision: Issues n n n Either the entire critical section is committed or none of it How to detect the lock How to keep track of dependencies and conflicts in a critical section q n n How to buffer speculative state How to check if “atomicity” is violated q n Read set and write set Dependence violations with another thread How to support commit and rollback 9

Maintaining Atomicity n n If atomicity is maintained, all locks can be removed Conditions for atomicity: q q n Data read is not modified by another thread until critical section is complete Data written is not accessed by another thread until critical section is complete If we know the beginning and end of a critical section, we can monitor the memory addresses read or written to by the critical section and check for conflicts q Using the underlying coherence mechanism 10

SLE Implementation n Checkpoint register state before entering SLE mode n In SLE mode: q q q Store: Buffer the update in the write buffer (do not make visible to other processors), request exclusive access Store/Load: Set “access” bit for block in the cache Trigger misspeculation on some coherence actions n n q n If external invalidation to a block with “access” bit set If exclusive access to request to a block with “access” bit set If not enough buffering space, trigger misspeculation If end of critical section reached without misspeculation, commit all writes (needs to appear instantaneous) 11

ACS vs. SLE n ACS Advantages over SLE + Speeds up each individual critical section + Keeps shared data and locks in a single cache (improves shared data and lock locality) + Does not incur re-execution overhead since it does not speculatively execute critical sections in parallel n ACS Disadvantages over SLE - Needs transfer of private data and control to a large core (reduces private data locality and incurs overhead) - Executes non-conflicting critical sections serially - Large core can reduce parallel throughput (assuming no SMT) 12

ACS Summary n n n Critical sections reduce performance and limit scalability Accelerate critical sections by executing them on a powerful core ACS reduces average execution time by: q q n n 34% compared to an equal-area SCMP 23% compared to an equal-area ACMP ACS improves scalability of 7 of the 12 workloads Generalizing the idea: Accelerate “critical paths” or “critical stages” by executing them on a powerful core 13

Staged Execution Model (I) n n Goal: speed up a program by dividing it up into pieces Idea q q q n Benefits q q q n Split program code into segments Run each segment on the core best-suited to run it Each core assigned a work-queue, storing segments to be run Accelerates segments/critical-paths using specialized/heterogeneous cores Exploits inter-segment parallelism Improves locality of within-segment data Examples q q Accelerated critical sections [Suleman et al. , ASPLOS 2010] Producer-consumer pipeline parallelism Task parallelism (Cilk, Intel TBB, Apple Grand Central Dispatch) Special-purpose cores and functional units 14

Staged Execution Model (II) LOAD X STORE Y LOAD Y …. STORE Z LOAD Z …. 15

Staged Execution Model (III) Split code into segments Segment S 0 LOAD X STORE Y Segment S 1 LOAD Y …. STORE Z Segment S 2 LOAD Z …. 16

Staged Execution Model (IV) Core 0 Core 1 Core 2 Instances of S 0 Instances of S 1 Instances of S 2 Work-queues 17

Staged Execution Model: Segment Spawning Core 0 Core 1 Core 2 S 0 LOAD X STORE Y S 1 LOAD Y …. STORE Z S 2 LOAD Z …. 18

Staged Execution Model: Two Examples n Accelerated Critical Sections [Suleman et al. , ASPLOS 2009] q Idea: Ship critical sections to a large core in an asymmetric CMP n n q n Segment 0: Non-critical section Segment 1: Critical section Benefit: Faster execution of critical section, reduced serialization, improved lock and shared data locality Producer-Consumer Pipeline Parallelism q Idea: Split a loop iteration into multiple “pipeline stages” where one stage consumes data produced by the next stage each stage runs on a different core n q Segment N: Stage N Benefit: Stage-level parallelism, better locality faster execution 19

Problem: Locality of Inter-segment Data. Core 0 Core 1 Core 2 S 0 LOAD X STORE Y Transfer Y Cache Miss S 1 LOAD Y …. STORE Z Transfer Z Cache Miss S 2 LOAD Z …. 20

Problem: Locality of Inter-segment n Data Accelerated Critical Sections [Suleman et al. , ASPLOS 2010] q q n Producer-Consumer Pipeline Parallelism q q n Idea: Ship critical sections to a large core in an ACMP Problem: Critical section incurs a cache miss when it touches data produced in the non-critical section (i. e. , thread private data) Idea: Split a loop iteration into multiple “pipeline stages” each stage runs on a different core Problem: A stage incurs a cache miss when it touches data produced by the previous stage Performance of Staged Execution limited by inter-segment cache misses 21

Terminology Core 0 S 0 Core 1 LOAD X STORE Y Transfer Y S 1 LOAD Y …. STORE Z Generator instruction: The last instruction to write to an inter -segment cache block in a segment Core 2 Inter-segment data: Cache block written by one segment and consumed by the next segment Transfer Z S 2 LOAD Z …. 22

Data Marshaling: Key Observation and Idea n Observation: Set of generator instructions is stable over execution time and across input sets n Idea: q q q n Identify the generator instructions Record cache blocks produced by generator instructions Proactively send such cache blocks to the next segment’s core before initiating the next segment Suleman et al. , “Data Marshaling for Multi-Core Architectures, ” ISCA 2010. 23

Data Marshaling Hardware Compiler/Profiler 1. Identify generator instructions 2. Insert marshal instructions Binary containing generator prefixes & marshal Instructions 1. Record generatorproduced addresses 2. Marshal recorded blocks to next core 24

Profiling Algorithm Inter-segment data Mark as Generator Instruction LOAD X STORE Y LOAD Y …. STORE Z LOAD Z …. 25

Marshal Instructions LOAD X STORE Y G: STORE Y MARSHAL C 1 When to send (Marshal) Where to send (C 1) LOAD Y …. G: STORE Z MARSHAL C 2 0 x 5: LOAD Z …. 26

Data Marshaling Hardware Compiler/Profiler 1. Identify generator instructions 2. Insert marshal Instructions Binary containing generator prefixes & marshal Instructions 1. Record generatorproduced addresses 2. Marshal recorded blocks to next core 27

Hardware Support and DM Example Cache Hit! Core 0 Addr Y L 2 Cache Data Y Marshal Buffer S 0 LOAD X STORE Y G: STORE Y MARSHAL C 1 S 1 LOAD Y …. G: STORE Z MARSHAL C 2 S 2 0 x 5: LOAD Z …. Core 1 L 2 Cache 28

DM: Advantages, Disadvantages n Advantages q q q n Timely data transfer: Push data to core before needed Can marshal any arbitrary sequence of lines: Identifies generators, not patterns Low hardware cost: Profiler marks generators, no need for hardware to find them Disadvantages q q Requires profiler and ISA support Not always accurate (generator set is conservative): Pollution at remote core, wasted bandwidth on interconnect n Not a large problem as number of inter-segment blocks is small 29

Accelerated Critical Sections Large Core Small Core 0 Addr Y L 2 Cache Data Y L 2 Cache LOAD X STORE Y G: STORE Y CSCALL LOAD Y …. G: STORE Z CSRET Critical Section Marshal Buffer Cache Hit! 30

Accelerated Critical Sections: Methodology n Workloads: 12 critical section intensive applications q q n Multi-core x 86 simulator q q n Data mining kernels, sorting, database, web, networking Different training and simulation input sets 1 large and 28 small cores Aggressive stream prefetcher employed at each core Details: q q Large core: 2 GHz, out-of-order, 128 -entry ROB, 4 -wide, 12 -stage Small core: 2 GHz, in-order, 2 -wide, 5 -stage Private 32 KB L 1, private 256 KB L 2, 8 MB shared L 3 On-chip interconnect: Bi-directional ring, 5 -cycle hop latency 31

DM on Accelerated Critical Sections: Results 168 170 8. 7% 32

Pipeline Parallelism Cache Hit! Core 0 Addr Y L 2 Cache Data Y Marshal Buffer S 0 LOAD X STORE Y G: STORE Y MARSHAL C 1 S 1 LOAD Y …. G: STORE Z MARSHAL C 2 S 2 0 x 5: LOAD Z …. Core 1 L 2 Cache 33

Pipeline Parallelism: Methodology n Workloads: 9 applications with pipeline parallelism q q n Financial, compression, multimedia, encoding/decoding Different training and simulation input sets Multi-core x 86 simulator q q 32 -core CMP: 2 GHz, in-order, 2 -wide, 5 -stage Aggressive stream prefetcher employed at each core Private 32 KB L 1, private 256 KB L 2, 8 MB shared L 3 On-chip interconnect: Bi-directional ring, 5 -cycle hop latency 34

DM on Pipeline Parallelism: Results 16% 35

DM Coverage, Accuracy, Timeliness n n High coverage of inter-segment misses in a timely manner Medium accuracy does not impact performance q Only 5. 0 and 6. 8 cache blocks marshaled for average segment 36

DM Scaling Results n DM performance improvement increases with q q q n More cores Higher interconnect latency Larger private L 2 caches Why? Inter-segment data misses become a larger bottleneck q q q More cores More communication Higher latency Longer stalls due to communication Larger L 2 cache Communication misses remain 37

Other Applications of Data Marshaling n Can be applied to other Staged Execution models q Task parallelism models n q q q n Cilk, Intel TBB, Apple Grand Central Dispatch Special-purpose remote functional units Computation spreading [Chakraborty et al. , ASPLOS’ 06] Thread motion/migration [e. g. , Rangan et al. , ISCA’ 09] Can be an enabler for more aggressive SE models q Lowers the cost of data migration n q an important overhead in remote execution of code segments Remote execution of finer-grained tasks can become more feasible finer-grained parallelization in multi-cores 38

How to Build a Dynamic ACMP n Frequency boosting DVFS n Core combining: Core Fusion n Ipek et al. , “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors, ” ISCA 2007. Idea: Dynamically fuse multiple small cores to form a single large core 39

Core Fusion: Motivation n n Programs are incrementally parallelized in stages Each parallelization stage is best executed on a different “type” of multi-core 40

Core Fusion Idea n Combine multiple simple cores dynamically to form a larger, more powerful core 41

Core Fusion Microarchitecture n Concept: Add enveloping hardware to make cores combineable 42