18 742 Spring 2011 Parallel Computer Architecture Lecture

  • Slides: 35
Download presentation
18 -742 Spring 2011 Parallel Computer Architecture Lecture 9: More Asymmetry Prof. Onur Mutlu

18 -742 Spring 2011 Parallel Computer Architecture Lecture 9: More Asymmetry Prof. Onur Mutlu Carnegie Mellon University

Reviews n Due Wednesday (Feb 9) q n Rajwar and Goodman, “Speculative Lock Elision:

Reviews n Due Wednesday (Feb 9) q n Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution, ” MICRO 2001. Due Friday (Feb 11) q Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures, ” ISCA 1993. 2

Last Lecture n n Finish multi-core evolution (IBM perspective) Asymmetric multi-core q n n

Last Lecture n n Finish multi-core evolution (IBM perspective) Asymmetric multi-core q n n Accelerating serial bottlenecks How to achieve asymmetry? q q q n Benefits and disadvantages Static versus dynamic Boosting frequency Asymmetry as EPI throttling Design tradeoffs in asymmetric multi-core 3

Review: How to Achieve Asymmetry n Static q q Type and power of cores

Review: How to Achieve Asymmetry n Static q q Type and power of cores fixed at design time Two approaches to design “faster cores”: n n q n High frequency Build a more complex, powerful core with entirely different uarch Is static asymmetry natural? (chip-wide variations in frequency) Dynamic q q Type and power of cores change dynamically Two approaches to dynamically create “faster cores”: n n n Boost frequency dynamically (limited power budget) Combine small cores to enable a more complex, powerful core Is there a third, fourth, fifth approach? 4

Review: Possible EPI Throttling Techniques n Grochowski et al. , “Best of both Latency

Review: Possible EPI Throttling Techniques n Grochowski et al. , “Best of both Latency and Throughput, ” ICCD 2004. 5

Review: Design Tradeoffs in ACMP (I) n Hardware Design Effort vs. Programmer Effort -

Review: Design Tradeoffs in ACMP (I) n Hardware Design Effort vs. Programmer Effort - ACMP requires more design effort + Performance becomes less dependent on length of the serial part + Can reduce programmer effort: Serial portions are not as bad for performance with ACMP n Migration Overhead vs. Accelerated Serial Bottleneck + Performance gain from faster execution of serial portion - Performance loss when state is migrated - Serial portion incurs cache misses when it needs data generated by the parallel portion - Parallel portion incurs cache misses when it needs data generated by the serial portion 6

Review: Design Tradeoffs in ACMP (II) n Fewer threads vs. accelerated serial bottleneck +

Review: Design Tradeoffs in ACMP (II) n Fewer threads vs. accelerated serial bottleneck + Performance gain from accelerated serial portion - Performance loss due to unavailability of L threads in parallel portion q q This need not be the case Large core can implement Multithreading to improve parallel throughput As the number of cores (threads) on chip increases, fractional loss in parallel performance decreases 7

Uses of Asymmetry n So far: q n Improvement in serial performance (sequential bottleneck)

Uses of Asymmetry n So far: q n Improvement in serial performance (sequential bottleneck) What else can we do with asymmetry? q q q Energy reduction? Energy/performance tradeoff? Improvement in parallel portion? 8

Use of Asymmetry for Energy Efficiency Kumar et al. , “Single-ISA Heterogeneous Multi-Core Architectures:

Use of Asymmetry for Energy Efficiency Kumar et al. , “Single-ISA Heterogeneous Multi-Core Architectures: The n Potential for Processor Power Reduction, ” MICRO 2003. n Idea: q q q Implement multiple types of cores on chip Monitor characteristics of the running thread (e. g. , sample energy/perf on each core periodically) Dynamically pick the core that provides the best energy/performance tradeoff for a given phase n “Best core” Depends on optimization metric 9

Use of Asymmetry for Energy Efficiency 10

Use of Asymmetry for Energy Efficiency 10

Use of Asymmetry for Energy Efficiency Advantages n + More flexibility in energy-performance tradeoff

Use of Asymmetry for Energy Efficiency Advantages n + More flexibility in energy-performance tradeoff + Can execute computation to the core that is best suited for it (in terms of energy) n Disadvantages/issues - Incorrect predictions/sampling wrong core reduced performance or increased energy - Overhead of core switching - Disadvantages of asymmetric CMP (e. g. , design multiple cores) - Need phase monitoring and matching algorithms - What characteristics should be monitored? - Once characteristics known, how do you pick the core? 11

Use of ACMP to Improve Parallel Portion Performance n Mutual Exclusion: q n n

Use of ACMP to Improve Parallel Portion Performance n Mutual Exclusion: q n n Threads are not allowed to update shared data concurrently Accesses to shared data are encapsulated inside critical sections Only one thread can execute a critical section at a given time Idea: Ship critical sections to a large core Suleman et al. , “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures, ” ASPLOS 2009, IEEE Micro Top Picks 2010. 12

A Critical Section 13

A Critical Section 13

Example of Critical Section from My. SQL List of Open Tables A Thread 0

Example of Critical Section from My. SQL List of Open Tables A Thread 0 B × Thread 1 Thread 2 × × Thread 3 C × D E × × × Thread 2: 3: Open. Tables(D, E) Close. All. Tables() 14

Example of Critical Section from My. SQL A B C D E 0 0

Example of Critical Section from My. SQL A B C D E 0 0 1 2 2 3 3 15

Example of Critical Section from My. SQL LOCK_open Acquire() End of Transaction: foreach (table

Example of Critical Section from My. SQL LOCK_open Acquire() End of Transaction: foreach (table opened by thread) if (table. temporary) table. close() LOCK_open Release() 16

Contention for Critical Sections Critical Section Parallel Thread 1 Thread 2 Thread 3 Thread

Contention for Critical Sections Critical Section Parallel Thread 1 Thread 2 Thread 3 Thread 4 Idle Accelerating critical sections not only helps the thread executing t t t thet critical sections, but also the waiting threads Thread 1 Critical Sections 1 2 3 4 5 6 Thread 2 Thread 3 Thread 4 7 execute 2 x faster t 1 t 2 t 3 t 4 t 5 t 6 t 7 17

Contention for Critical Sections n n Contention for critical sections leads to serial execution

Contention for Critical Sections n n Contention for critical sections leads to serial execution (serialization) of threads in the parallel program portion Contention is likely to increase with large critical sections 18

Impact of Critical Sections on Scalability • Contention for critical sections increases with the

Impact of Critical Sections on Scalability • Contention for critical sections increases with the number of threads and limits scalability 7 Speedup LOCK_open Acquire() foreach (table locked by thread) table. lock release() table. file release() if (table. temporary) table. close() LOCK_open Release() 8 6 5 4 3 2 1 0 0 8 16 24 32 Chip Area (cores) My. SQL (oltp-1) 19

Conventional ACMP 1. 2. 3. 4. 5. Enter. CS() Priority. Q. insert(…) Leave. CS()

Conventional ACMP 1. 2. 3. 4. 5. Enter. CS() Priority. Q. insert(…) Leave. CS() P 1 P 2 encounters a Critical Section Sends a request for the lock Acquires the lock Executes Critical Section Releases the lock Core executing critical section P 2 P 3 P 4 On-chip Interconnect 20

Accelerated Critical Sections (ACS) Critical Section Request Buffer (CSRB) Large core Niagara -like core

Accelerated Critical Sections (ACS) Critical Section Request Buffer (CSRB) Large core Niagara -like core Niagara Niagara -like -like core core ACMP Approach • Accelerate Amdahl’s serial part and critical sections using the large core n Suleman et al. , “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures, ” ASPLOS 2009, IEEE Micro Top Picks 2010. 21

Accelerated Critical Sections (ACS) 1. P 2 encounters a Critical Section 2. P 2

Accelerated Critical Sections (ACS) 1. P 2 encounters a Critical Section 2. P 2 sends CSCALL Request to CSRB 3. P 1 executes Critical Section 4. P 1 sends CSDONE signal Enter. CS() Priority. Q. insert(…) Leave. CS() P 1 Critical Section Request Buffer (CSRB) Core executing critical section P 2 P 3 P 4 Onchip. Interconnect 22

Accelerated Critical Sections (ACS) Small Core A = compute() PUSH A CSCALL X, Target

Accelerated Critical Sections (ACS) Small Core A = compute() PUSH A CSCALL X, Target PC LOCK X result = CS(A) UNLOCK X print result … … … … Large Core CSCALL Request Send X, TPC, STACK_PTR, CORE_ID … Waiting in Critical Section … Request Buffer … (CSRB) TPC: Acquire X POP A result = CS(A) PUSH result Release X CSRET X CSDONE Response POP result print result 23

ACS Architecture Overview n ISA extensions q q CSCALL LOCK_ADDR, TARGET_PC CSRET LOCK_ADDR n

ACS Architecture Overview n ISA extensions q q CSCALL LOCK_ADDR, TARGET_PC CSRET LOCK_ADDR n Compiler/Library inserts CSCALL/CSRET n On a CSCALL, the small core: q Sends a CSCALL request to the large core n q n Arguments: Lock address, Target PC, Stack Pointer, Core ID Stalls and waits for CSDONE Large Core q q Critical Section Request Buffer (CSRB) Executes the critical section and sends CSDONE to the requesting core 24

False Serialization n ACS can serialize independent critical sections n Selective Acceleration of Critical

False Serialization n ACS can serialize independent critical sections n Selective Acceleration of Critical Sections (SEL) q Saturating counters to track false serialization To large core A 4 2 3 CSCALL (A) B 4 5 CSCALL (A) Critical Section Request Buffer (CSRB) CSCALL (B) From small cores 25

ACS Performance Tradeoffs n Fewer threads vs. accelerated critical sections q q Accelerating critical

ACS Performance Tradeoffs n Fewer threads vs. accelerated critical sections q q Accelerating critical sections offsets loss in throughput As the number of cores (threads) on chip increase: n n n Overhead of CSCALL/CSDONE vs. better lock locality q n Fractional loss in parallel performance decreases Increased contention for critical sections makes acceleration more beneficial ACS avoids “ping-ponging” of locks among caches by keeping them at the large core More cache misses for private data vs. fewer misses for shared data 26

Cache misses for private data Priority. Heap. insert(New. Sub. Problems) Private Data: New. Sub.

Cache misses for private data Priority. Heap. insert(New. Sub. Problems) Private Data: New. Sub. Problems Shared Data: The priority heap Puzzle Benchmark 27

ACS Performance Tradeoffs n Fewer threads vs. accelerated critical sections q q Accelerating critical

ACS Performance Tradeoffs n Fewer threads vs. accelerated critical sections q q Accelerating critical sections offsets loss in throughput As the number of cores (threads) on chip increase: n n n Overhead of CSCALL/CSDONE vs. better lock locality q n Fractional loss in parallel performance decreases Increased contention for critical sections makes acceleration more beneficial ACS avoids “ping-ponging” of locks among caches by keeping them at the large core More cache misses for private data vs. fewer misses for shared data q Cache misses reduce if shared data > private data 28

ACS Comparison Points Niagara Niagara -like -like core core Large core Niagara Niagara -like

ACS Comparison Points Niagara Niagara -like -like core core Large core Niagara Niagara -like -like core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara -like -like core core SCMP ACS • All small cores • Conventional locking • One large core (area-equal 4 small cores) • Conventional locking • ACMP with a CSRB • Accelerates Critical Sections 29

ACS Simulation Parameters n Workloads q q n 12 critical section intensive applications from

ACS Simulation Parameters n Workloads q q n 12 critical section intensive applications from various domains 7 use coarse-grain locks and 5 use fine-grain locks Simulation parameters: q q q x 86 cycle accurate processor simulator Large core: Similar to Pentium-M with 2 -way SMT. 2 GHz, out-of-order, 128 -entry ROB, 4 -wide issue, 12 -stage Small core: Similar to Pentium 1, 2 GHz, in-order, 2 -wide issue, 5 stage Private 32 KB L 1, private 256 KB L 2, 8 MB shared L 3 On-chip interconnect: Bi-directional ring 30

Workloads with Coarse-Grain Locks Chip Area = 16 cores Chip Area = 32 small

Workloads with Coarse-Grain Locks Chip Area = 16 cores Chip Area = 32 small cores SCMP = 16 small cores ACMP/ACS = 1 large and 12 small cores SCMP = 32 small cores ACMP/ACS = 1 large and 28 small cores 150 ea n p gm ts lit e t sq or le zz qs ge pa 31 pu m in e SCMP ACS ep ea n p gm ts lit e t sq or qs le zz pu in e pa ge m is SCMP ACS 210 130 120 110 100 90 80 70 60 50 40 30 20 10 0 is 150 Exec. Time Norm. to ACMP 210 130 120 110 100 90 80 70 60 50 40 30 20 10 0 ep Exec. Time Norm. to ACMP Equal-area comparison Number of threads = Best threads

Workloads with Fine-Grain Locks Chip Area = 32 small cores SCMP = 16 small

Workloads with Fine-Grain Locks Chip Area = 32 small cores SCMP = 16 small cores ACMP/ACS = 1 large and 12 small cores SCMP = 32 small cores ACMP/ACS = 1 large and 28 small cores w n ea gm he eb ca c jb b ec sp tp -2 ol lo o ip 32 tp -1 SCMP ACS ku p n ea gm he w eb ca c jb b ec sp tp -2 ol tp -1 ol lo o ku p SCMP ACS 130 120 110 100 90 80 70 60 50 40 30 20 10 0 ol 130 120 110 100 90 80 70 60 50 40 30 20 10 0 Exec. Time Norm. to ACMP Chip Area = 16 cores ip Exec. Time Norm. to ACMP Equal-area comparison Number of threads = Best threads

------ SCMP ------ ACS Equal-Area Comparisons Number of threads = No. of cores Speedup

------ SCMP ------ ACS Equal-Area Comparisons Number of threads = No. of cores Speedup over a small core 3. 5 3 2. 5 2 1. 5 1 0. 5 0 3 5 2. 5 4 2 3 1. 5 2 1 0. 5 1 0 0 0 8 16 24 32 (a) ep (b) is 6 10 5 8 4 2 1 2 0 0 0 8 16 24 32 (c) pagemine (d) puzzle 6 (g) sqlite (h) iplookup 0 8 16 24 32 (e) qsort (f) tsp 12 10 2. 5 10 8 2 8 6 1. 5 6 4 1 4 2 0. 5 2 0 0 0 8 16 24 32 3 2 0 8 16 24 32 14 12 10 8 6 4 2 0 12 4 4 3. 5 3 2. 5 2 1. 5 1 0. 5 0 0 8 16 24 32 8 6 3 7 6 5 4 3 2 1 0 0 8 16 24 32 (i) oltp-1 (i) oltp-2 Chip Area (small cores) 33 0 8 16 24 32 (k) specjbb 0 8 16 24 32 (l) webcache

Exec. Time Norm. to SCMP ACS on Symmetric CMP 130 120 110 100 90

Exec. Time Norm. to SCMP ACS on Symmetric CMP 130 120 110 100 90 80 70 60 50 40 30 20 10 0 Majority of benefit is from large core ACS 34 ca ch e gm ea n b w eb ec jb tp -2 ol tp -1 ol sp ip lo o ku p p ts lit e sq t or qs le zz pu m in e is ge pa ep symm. ACS

Alternatives to ACS n Transactional memory (Herlihy+) ACS does not require code modification n

Alternatives to ACS n Transactional memory (Herlihy+) ACS does not require code modification n Transactional Lock Removal (Rajwar+), Speculative Synchronization (Martinez+), Speculative Lock Elision (Rajwar) q Hide critical section latency by increasing concurrency ACS reduces latency of each critical section q Overlaps execution of critical sections with no data conflicts ACS accelerates ALL critical sections q Does not improve locality of shared data ACS improves locality of shared data ACS outperforms TLR (Rajwar+) by 18% (details in ASPLOS 2009 paper) 35