18 742 Spring 2011 Parallel Computer Architecture Lecture
- Slides: 42
18 -742 Spring 2011 Parallel Computer Architecture Lecture 10: Asymmetric Multi-Core III Prof. Onur Mutlu Carnegie Mellon University
Project Proposals n We’ve read your proposals n Get feedback from us on your progress 2
Reviews n Due Today (Feb 9) before class q n Due Friday (Feb 11) midnight q n Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution, ” MICRO 2001. Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures, ” ISCA 1993. Due Tuesday (Feb 15) midnight q q q Patel, “Processor-Memory Interconnections for Multiprocessors, ” ISCA 1979. Dally, “Route packets, not wires: on-chip inteconnection network, ” DAC 2001. Das et al. , “Aergia: Exploiting Packet Latency Slack in On-Chip Networks, ” ISCA 2010. 3
Last Lecture n n n Discussion on hardware support for debugging parallel programs Asymmetric multi-core for energy efficiency Accelerated critical sections (ACS) 4
Today n n n Speculative Lock Elision Data Marshaling Dynamic Core Combining (Core Fusion) 5
Alternatives to ACS n Transactional memory (Herlihy+) ACS does not require code modification n Transactional Lock Removal (Rajwar+), Speculative Synchronization (Martinez+), Speculative Lock Elision (Rajwar) q Hide critical section latency by increasing concurrency ACS reduces latency of each critical section q Overlaps execution of critical sections with no data conflicts ACS accelerates ALL critical sections q Does not improve locality of shared data ACS improves locality of shared data ACS outperforms TLR (Rajwar+) by 18% (details in ASPLOS 2009 paper) 6
Speculative Lock Elision n n Many programs use locks for synchronization Many locks are not necessary q q n Idea: q q q n Stores occur infrequently during execution Updating different parts of data structure Speculatively assume lock is not necessary and execute critical section without acquiring the lock Check for conflicts within the critical section Roll back if assumption is incorrect Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution, ” MICRO 2001. 7
Dynamically Unnecessary Synchronization 8
Speculative Lock Elision: Issues n n n Either the entire critical section is committed or none of it How to detect the lock How to keep track of dependencies and conflicts in a critical section q n n How to buffer speculative state How to check if “atomicity” is violated q n Read set and write set Dependence violations with another thread How to support commit and rollback 9
Maintaining Atomicity n n If atomicity is maintained, all locks can be removed Conditions for atomicity: q q n Data read is not modified by another thread until critical section is complete Data written is not accessed by another thread until critical section is complete If we know the beginning and end of a critical section, we can monitor the memory addresses read or written to by the critical section and check for conflicts q Using the underlying coherence mechanism 10
SLE Implementation n Checkpoint register state before entering SLE mode n In SLE mode: q q q Store: Buffer the update in the write buffer (do not make visible to other processors), request exclusive access Store/Load: Set “access” bit for block in the cache Trigger misspeculation on some coherence actions n n q n If external invalidation to a block with “access” bit set If exclusive access to request to a block with “access” bit set If not enough buffering space, trigger misspeculation If end of critical section reached without misspeculation, commit all writes (needs to appear instantaneous) 11
ACS vs. SLE n ACS Advantages over SLE + Speeds up each individual critical section + Keeps shared data and locks in a single cache (improves shared data and lock locality) + Does not incur re-execution overhead since it does not speculatively execute critical sections in parallel n ACS Disadvantages over SLE - Needs transfer of private data and control to a large core (reduces private data locality and incurs overhead) - Executes non-conflicting critical sections serially - Large core can reduce parallel throughput (assuming no SMT) 12
ACS Summary n n n Critical sections reduce performance and limit scalability Accelerate critical sections by executing them on a powerful core ACS reduces average execution time by: q q n n 34% compared to an equal-area SCMP 23% compared to an equal-area ACMP ACS improves scalability of 7 of the 12 workloads Generalizing the idea: Accelerate “critical paths” or “critical stages” by executing them on a powerful core 13
Staged Execution Model (I) n n Goal: speed up a program by dividing it up into pieces Idea q q q n Benefits q q q n Split program code into segments Run each segment on the core best-suited to run it Each core assigned a work-queue, storing segments to be run Accelerates segments/critical-paths using specialized/heterogeneous cores Exploits inter-segment parallelism Improves locality of within-segment data Examples q q Accelerated critical sections [Suleman et al. , ASPLOS 2010] Producer-consumer pipeline parallelism Task parallelism (Cilk, Intel TBB, Apple Grand Central Dispatch) Special-purpose cores and functional units 14
Staged Execution Model (II) LOAD X STORE Y LOAD Y …. STORE Z LOAD Z …. 15
Staged Execution Model (III) Split code into segments Segment S 0 LOAD X STORE Y Segment S 1 LOAD Y …. STORE Z Segment S 2 LOAD Z …. 16
Staged Execution Model (IV) Core 0 Core 1 Core 2 Instances of S 0 Instances of S 1 Instances of S 2 Work-queues 17
Staged Execution Model: Segment Spawning Core 0 Core 1 Core 2 S 0 LOAD X STORE Y S 1 LOAD Y …. STORE Z S 2 LOAD Z …. 18
Staged Execution Model: Two Examples n Accelerated Critical Sections [Suleman et al. , ASPLOS 2009] q Idea: Ship critical sections to a large core in an asymmetric CMP n n q n Segment 0: Non-critical section Segment 1: Critical section Benefit: Faster execution of critical section, reduced serialization, improved lock and shared data locality Producer-Consumer Pipeline Parallelism q Idea: Split a loop iteration into multiple “pipeline stages” where one stage consumes data produced by the next stage each stage runs on a different core n q Segment N: Stage N Benefit: Stage-level parallelism, better locality faster execution 19
Problem: Locality of Inter-segment Data. Core 0 Core 1 Core 2 S 0 LOAD X STORE Y Transfer Y Cache Miss S 1 LOAD Y …. STORE Z Transfer Z Cache Miss S 2 LOAD Z …. 20
Problem: Locality of Inter-segment n Data Accelerated Critical Sections [Suleman et al. , ASPLOS 2010] q q n Producer-Consumer Pipeline Parallelism q q n Idea: Ship critical sections to a large core in an ACMP Problem: Critical section incurs a cache miss when it touches data produced in the non-critical section (i. e. , thread private data) Idea: Split a loop iteration into multiple “pipeline stages” each stage runs on a different core Problem: A stage incurs a cache miss when it touches data produced by the previous stage Performance of Staged Execution limited by inter-segment cache misses 21
Terminology Core 0 S 0 Core 1 LOAD X STORE Y Transfer Y S 1 LOAD Y …. STORE Z Generator instruction: The last instruction to write to an inter -segment cache block in a segment Core 2 Inter-segment data: Cache block written by one segment and consumed by the next segment Transfer Z S 2 LOAD Z …. 22
Data Marshaling: Key Observation and Idea n Observation: Set of generator instructions is stable over execution time and across input sets n Idea: q q q n Identify the generator instructions Record cache blocks produced by generator instructions Proactively send such cache blocks to the next segment’s core before initiating the next segment Suleman et al. , “Data Marshaling for Multi-Core Architectures, ” ISCA 2010. 23
Data Marshaling Hardware Compiler/Profiler 1. Identify generator instructions 2. Insert marshal instructions Binary containing generator prefixes & marshal Instructions 1. Record generatorproduced addresses 2. Marshal recorded blocks to next core 24
Profiling Algorithm Inter-segment data Mark as Generator Instruction LOAD X STORE Y LOAD Y …. STORE Z LOAD Z …. 25
Marshal Instructions LOAD X STORE Y G: STORE Y MARSHAL C 1 When to send (Marshal) Where to send (C 1) LOAD Y …. G: STORE Z MARSHAL C 2 0 x 5: LOAD Z …. 26
Data Marshaling Hardware Compiler/Profiler 1. Identify generator instructions 2. Insert marshal Instructions Binary containing generator prefixes & marshal Instructions 1. Record generatorproduced addresses 2. Marshal recorded blocks to next core 27
Hardware Support and DM Example Cache Hit! Core 0 Addr Y L 2 Cache Data Y Marshal Buffer S 0 LOAD X STORE Y G: STORE Y MARSHAL C 1 S 1 LOAD Y …. G: STORE Z MARSHAL C 2 S 2 0 x 5: LOAD Z …. Core 1 L 2 Cache 28
DM: Advantages, Disadvantages n Advantages q q q n Timely data transfer: Push data to core before needed Can marshal any arbitrary sequence of lines: Identifies generators, not patterns Low hardware cost: Profiler marks generators, no need for hardware to find them Disadvantages q q Requires profiler and ISA support Not always accurate (generator set is conservative): Pollution at remote core, wasted bandwidth on interconnect n Not a large problem as number of inter-segment blocks is small 29
Accelerated Critical Sections Large Core Small Core 0 Addr Y L 2 Cache Data Y L 2 Cache LOAD X STORE Y G: STORE Y CSCALL LOAD Y …. G: STORE Z CSRET Critical Section Marshal Buffer Cache Hit! 30
Accelerated Critical Sections: Methodology n Workloads: 12 critical section intensive applications q q n Multi-core x 86 simulator q q n Data mining kernels, sorting, database, web, networking Different training and simulation input sets 1 large and 28 small cores Aggressive stream prefetcher employed at each core Details: q q Large core: 2 GHz, out-of-order, 128 -entry ROB, 4 -wide, 12 -stage Small core: 2 GHz, in-order, 2 -wide, 5 -stage Private 32 KB L 1, private 256 KB L 2, 8 MB shared L 3 On-chip interconnect: Bi-directional ring, 5 -cycle hop latency 31
DM on Accelerated Critical Sections: Results 168 170 8. 7% 32
Pipeline Parallelism Cache Hit! Core 0 Addr Y L 2 Cache Data Y Marshal Buffer S 0 LOAD X STORE Y G: STORE Y MARSHAL C 1 S 1 LOAD Y …. G: STORE Z MARSHAL C 2 S 2 0 x 5: LOAD Z …. Core 1 L 2 Cache 33
Pipeline Parallelism: Methodology n Workloads: 9 applications with pipeline parallelism q q n Financial, compression, multimedia, encoding/decoding Different training and simulation input sets Multi-core x 86 simulator q q 32 -core CMP: 2 GHz, in-order, 2 -wide, 5 -stage Aggressive stream prefetcher employed at each core Private 32 KB L 1, private 256 KB L 2, 8 MB shared L 3 On-chip interconnect: Bi-directional ring, 5 -cycle hop latency 34
DM on Pipeline Parallelism: Results 16% 35
DM Coverage, Accuracy, Timeliness n n High coverage of inter-segment misses in a timely manner Medium accuracy does not impact performance q Only 5. 0 and 6. 8 cache blocks marshaled for average segment 36
DM Scaling Results n DM performance improvement increases with q q q n More cores Higher interconnect latency Larger private L 2 caches Why? Inter-segment data misses become a larger bottleneck q q q More cores More communication Higher latency Longer stalls due to communication Larger L 2 cache Communication misses remain 37
Other Applications of Data Marshaling n Can be applied to other Staged Execution models q Task parallelism models n q q q n Cilk, Intel TBB, Apple Grand Central Dispatch Special-purpose remote functional units Computation spreading [Chakraborty et al. , ASPLOS’ 06] Thread motion/migration [e. g. , Rangan et al. , ISCA’ 09] Can be an enabler for more aggressive SE models q Lowers the cost of data migration n q an important overhead in remote execution of code segments Remote execution of finer-grained tasks can become more feasible finer-grained parallelization in multi-cores 38
How to Build a Dynamic ACMP n Frequency boosting DVFS n Core combining: Core Fusion n Ipek et al. , “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors, ” ISCA 2007. Idea: Dynamically fuse multiple small cores to form a single large core 39
Core Fusion: Motivation n n Programs are incrementally parallelized in stages Each parallelization stage is best executed on a different “type” of multi-core 40
Core Fusion Idea n Combine multiple simple cores dynamically to form a larger, more powerful core 41
Core Fusion Microarchitecture n Concept: Add enveloping hardware to make cores combineable 42
- Architecture lecture notes
- Isa computer architecture
- D.m. 742 del 2017 slide
- Calculate the density of xenon gas at a pressure of 742
- Sga-742
- Characteristics of modern os
- 44 word form
- 142 984 km in miles
- Ecisd spring break 2011
- 01:640:244 lecture notes - lecture 15: plat, idah, farad
- Parallel computer architecture cmu
- Parallel priority interrupt
- Cast of spring, summer, fall, winter... and spring
- Summer spring autumn
- Buses in computer architecture
- Computer organization and computer architecture difference
- Basic computer organisation and design
- Cloud computing lecture
- Spring framework architecture
- Computer security 161 cryptocurrency lecture
- Computer aided drug design lecture notes
- Resultant of parallel forces
- The inner terminus of a fingerprint pattern is called:
- Non parallel sentence
- Parrelell structure
- Parallel structure means
- Shift register truth table
- Parrallel structure
- Types of parallel architecture
- Parallel and distributed database architecture
- Parallel and distributed database architecture
- Define parallelism in computer architecture
- Object line definition
- Design objectives of computer clusters
- Parallel computer models
- Parallel projection in computer graphics
- Bernstein condition for parallelism
- Architecture business life cycle
- Call and return architecture in software architecture
- What is product architecture
- Define product architecture
- Computer organization and architecture 10th solution
- Computer architecture 101