18 447 Computer Architecture Lecture 32 Heterogeneous Systems

  • Slides: 75
Download presentation
18 -447 Computer Architecture Lecture 32: Heterogeneous Systems Prof. Onur Mutlu Carnegie Mellon University

18 -447 Computer Architecture Lecture 32: Heterogeneous Systems Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 4/20/2015

Where We Are in Lecture Schedule n n n The memory hierarchy Caches, caches,

Where We Are in Lecture Schedule n n n The memory hierarchy Caches, caches, more caches Virtualizing the memory hierarchy: Virtual Memory Main memory: DRAM Main memory control, scheduling Memory latency tolerance techniques Non-volatile memory Multiprocessors Coherence and consistency In-memory computation and predictable performance Multi-core issues (e. g. , heterogeneous multi-core) Interconnection networks 2

First, Some Administrative Things 3

First, Some Administrative Things 3

Midterm II and Midterm II Review n Midterm II is this Friday (April 24,

Midterm II and Midterm II Review n Midterm II is this Friday (April 24, 2015) q q n 12: 30 -2: 30 pm, CIC Panther Hollow Room (4 th floor) Please arrive 5 minutes early and sit with 1 -seat separation Same rules as Midterm I except you get to have 2 cheat sheets Covers all topics we have examined so far, with more focus on Lectures 17 -32 (Memory Hierarchy and Multiprocessors) Midterm II Review is Wednesday (April 22) q q Come prepared with questions on concepts and lectures Detailed homework and exam questions and solutions study on your own and ask TAs during office hours 4

Suggestions for Midterm II n Solve past midterms (and finals) on your own… q

Suggestions for Midterm II n Solve past midterms (and finals) on your own… q q n n n http: //www. ece. cmu. edu/~ece 447/s 15/doku. php? id=exams Do Homework 7 and go over past homeworks. Study and internalize the lecture material well. q n n n And, check your solutions vs. the online solutions Questions will be similar in spirit Buzzwords can help you. Ditto for slides and videos. Understand how to solve all homework & exam questions. Study hard. Also read: https: //piazza. com/class/i 3540 xiz 8 ku 40 a? cid=335 5

Lab 8: Multi-Core Cache Coherence n Due May 3; Last submission accepted on May

Lab 8: Multi-Core Cache Coherence n Due May 3; Last submission accepted on May 10, 11: 59 pm Cycle-level modeling of the MESI cache coherence protocol n Since this is the last lab n q q An automatic extension of 7 days granted for everyone No other late days accepted 6

Reminder on Collaboration on 447 Labs n Reminder of 447 policy: q q n

Reminder on Collaboration on 447 Labs n Reminder of 447 policy: q q n Absolutely no form of collaboration allowed No discussions, no code sharing, no code reviews with fellow students, no brainstorming, … All labs and all portions of each lab has to be your own work q Just focus on doing the lab yourself, alone 7

We Have Another Course for Collaboration n 740 is the next course in sequence

We Have Another Course for Collaboration n 740 is the next course in sequence n n Tentative Time: Lect. MW 7: 30 -9: 20 pm, (Rect. T 7: 30 pm) Content: q q Lectures: More advanced, with a different perspective Recitations: Delving deeper into papers, advanced topics Readings: Many fundamental and research readings; will do many reviews Project: More open ended research project. Proposal milestones final poster and presentation n n q q Done in groups of 1 -3 Focus of the course is the project (and papers) Exams: lighter and fewer Homeworks: None 8

A Note on Testing Your Own Code n We provide the reference simulator to

A Note on Testing Your Own Code n We provide the reference simulator to aid you Do not expect it to be given, and do not rely on it much n In real life, there are no reference simulators n n n n The The The architect architect designs the reference simulator verifies it tests it fixes it makes sure there are no bugs ensures the simulator matches the specification 9

Lab 6 Grade Distribution 10

Lab 6 Grade Distribution 10

Lab 6 Extra Credit Recognitions n Stay tuned… 11

Lab 6 Extra Credit Recognitions n Stay tuned… 11

Lab 4 -5 Special Recognition n Limited out-of-order execution q Terence An 12

Lab 4 -5 Special Recognition n Limited out-of-order execution q Terence An 12

Where We Are in Lecture Schedule n n n The memory hierarchy Caches, caches,

Where We Are in Lecture Schedule n n n The memory hierarchy Caches, caches, more caches Virtualizing the memory hierarchy: Virtual Memory Main memory: DRAM Main memory control, scheduling Memory latency tolerance techniques Non-volatile memory Multiprocessors Coherence and consistency In-memory computation and predictable performance Multi-core issues (e. g. , heterogeneous multi-core) Interconnection networks 13

Today n Heterogeneity (asymmetry) in system design n Evolution of multi-core systems n Handling

Today n Heterogeneity (asymmetry) in system design n Evolution of multi-core systems n Handling serial and parallel bottlenecks better n Heterogeneous multi-core systems 14

Heterogeneity (Asymmetry) 15

Heterogeneity (Asymmetry) 15

Heterogeneity (Asymmetry) Specialization n Heterogeneity and asymmetry have the same meaning q n n

Heterogeneity (Asymmetry) Specialization n Heterogeneity and asymmetry have the same meaning q n n n Contrast with homogeneity and symmetry Heterogeneity is a very general system design concept (and life concept, as well) Idea: Instead of having multiple instances of the same “resource” to be the same (i. e. , homogeneous or symmetric), design some instances to be different (i. e. , heterogeneous or asymmetric) Different instances can be optimized to be more efficient in executing different types of workloads or satisfying different requirements/goals q Heterogeneity enables specialization/customization 16

Why Asymmetry in Design? (I) n Different workloads executing in a system can have

Why Asymmetry in Design? (I) n Different workloads executing in a system can have different behavior q q n Different applications can have different behavior Different execution phases of an application can have different behavior The same application executing at different times can have different behavior (due to input set changes and dynamic events) E. g. , locality, predictability of branches, instruction-level parallelism, data dependencies, serial fraction, bottlenecks in parallel portion, interference characteristics, … Systems are designed to satisfy different metrics at the same time q q There is almost never a single goal in design, depending on design point E. g. , Performance, energy efficiency, fairness, predictability, reliability, availability, cost, memory capacity, latency, bandwidth, … 17

Why Asymmetry in Design? (II) n Problem: Symmetric design is one-size-fits-all It tries to

Why Asymmetry in Design? (II) n Problem: Symmetric design is one-size-fits-all It tries to fit a single-size design to all workloads and metrics n It is very difficult to come up with a single design n q q n that satisfies all workloads even for a single metric that satisfies all design metrics at the same time This holds true for different system components, or resources q q Cores, caches, memory, controllers, interconnect, disks, servers, … Algorithms, policies, … 18

Asymmetry Enables Customization C C C 2 C 1 C C C C 4

Asymmetry Enables Customization C C C 2 C 1 C C C C 4 C 4 C C C 5 C 5 C 5 Symmetric n Asymmetric Symmetric: One size fits all q n C 3 C Energy and performance suboptimal for different “workload” behaviors Asymmetric: Enables customization and adaptation q q Processing requirements vary across workloads (applications and phases) Execute code on best-fit resources (minimal energy, adequate perf. ) 19

We Have Already Seen Examples Before (in 447)design: scalar + vector pipelines n CRAY-1

We Have Already Seen Examples Before (in 447)design: scalar + vector pipelines n CRAY-1 n n n Modern processors: scalar instructions + SIMD extensions Decoupled Access Execute: access + execute processors Thread Cluster Memory Scheduling: different memory scheduling policies for different thread clusters RAIDR: Heterogeneous refresh rate Hybrid memory systems q q q n DRAM + Phase Change Memory Fast, Costly DRAM + Slow, Cheap DRAM Reliable, Costly DRAM + Unreliable, Cheap DRAM … 20

An Example Asymmetric Design: CRAY-1 n n n n Russell, “The CRAY-1 computer system,

An Example Asymmetric Design: CRAY-1 n n n n Russell, “The CRAY-1 computer system, ” CACM 1978. Scalar and vector modes 8 64 -element vector registers 64 bits per element 16 memory banks 8 64 -bit scalar registers 8 24 -bit address registers 21

Remember: Hybrid Memory Systems CPU DRAM Fast, durable Small, leaky, volatile, high-cost DRAM Ctrl

Remember: Hybrid Memory Systems CPU DRAM Fast, durable Small, leaky, volatile, high-cost DRAM Ctrl PCM Ctrl Phase Change Memory (or Tech. X) Large, non-volatile, low-cost Slow, wears out, high active energy Hardware/software manage data allocation and movement to achieve the best of multiple technologies Meza+, “Enabling Efficient and Scalable Hybrid Memories, ” IEEE Comp. Arch. Letters, 2012. Yoon, Meza et al. , “Row Buffer Locality Aware Caching Policies for Hybrid Memories, ” ICCD 2012 Best Paper Award.

Remember: Throughput vs. Fairness Throughput biased approach Prioritize less memory-intensive threads Fairness biased approach

Remember: Throughput vs. Fairness Throughput biased approach Prioritize less memory-intensive threads Fairness biased approach Take turns accessing memory Good for throughput Does not starve thread A less memory intensive thread B thread C higher priority starvation unfairness thread C thread A thread B not prioritized reduced throughput Single policy for all threads is insufficient 23

Remember: Achieving the Best of Both Worlds higher priority thread For Throughput Prioritize memory-non-intensive

Remember: Achieving the Best of Both Worlds higher priority thread For Throughput Prioritize memory-non-intensive threads thread thread For Fairness Unfairness caused by memory-intensive being prioritized over each other • Shuffle thread ranking Memory-intensive threads have different vulnerability to interference • Shuffle asymmetrically 24

Remember: Heterogeneous Retention Times in DRAM 25

Remember: Heterogeneous Retention Times in DRAM 25

Aside: Examples from Life n Heterogeneity is abundant in life q n n n

Aside: Examples from Life n Heterogeneity is abundant in life q n n n n both in nature and human-made components Humans are heterogeneous Cells are heterogeneous specialized for different tasks Organs are heterogeneous Cars are heterogeneous Buildings are heterogeneous Rooms are heterogeneous … 26

General-Purpose vs. Special. Purpose n Asymmetry is a way of enabling specialization n It

General-Purpose vs. Special. Purpose n Asymmetry is a way of enabling specialization n It bridges the gap between purely general purpose and purely special purpose q q q n Purely general purpose: Single design for every workload or metric Purely special purpose: Single design per workload or metric Asymmetric: Multiple sub-designs optimized for sets of workloads/metrics and glued together The goal of a good asymmetric design is to get the best of both general purpose and special purpose 27

Asymmetry Advantages and Disadvantages n Advantages over Symmetric Design + Can enable optimization of

Asymmetry Advantages and Disadvantages n Advantages over Symmetric Design + Can enable optimization of multiple metrics + Can enable better adaptation to workload behavior + Can provide special-purpose benefits with general-purpose usability/flexibility n Disadvantages over Symmetric Design - Higher overhead and more complexity in design, verification - Higher overhead in management: scheduling onto asymmetric components - Overhead in switching between multiple components can lead to degradation 28

Yet Another Example n Modern processors integrate general purpose cores and GPUs q q

Yet Another Example n Modern processors integrate general purpose cores and GPUs q q CPU-GPU systems Heterogeneity in execution models 29

Three Key Problems in Future Systems n Memory system q q n Efficiency (performance

Three Key Problems in Future Systems n Memory system q q n Efficiency (performance and energy) scalability q q n Applications are increasingly data intensive Data storage and movement limits performance & efficiency Enables scalable systems new applications Enables better user experience new usage models Predictability and robustness q q Resource sharing and unreliable hardware causes Qo. S issues Predictable performance and Qo. S are first class constraints s n g i s e D c i r t Asymme s m e l b o r P e s e h T e v l o S p Can Hel 30

Multi-Core Design 31

Multi-Core Design 31

Many Cores on Chip n n Simpler and lower power than a single large

Many Cores on Chip n n Simpler and lower power than a single large core Large scale parallelism on chip Intel Core i 7 AMD Barcelona 8 cores IBM Cell BE IBM POWER 7 Intel SCC Tilera TILE Gx 8+1 cores 8 cores 4 cores Sun Niagara II 8 cores Nvidia Fermi 448 “cores” 48 cores, networked 100 cores, networked 32

With Many Cores on Chip n What we want: q n N times the

With Many Cores on Chip n What we want: q n N times the performance with N times the cores when we parallelize an application on N cores What we get: q q Amdahl’s Law (serial bottleneck) Bottlenecks in the parallel portion 33

Caveats of Parallelism n Amdahl’s Law q q f: Parallelizable fraction of a program

Caveats of Parallelism n Amdahl’s Law q q f: Parallelizable fraction of a program N: Number of processors 1 Speedup = 1 -f q n n + f N Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities, ” AFIPS 1967. Maximum speedup limited by serial portion: Serial bottleneck Parallel portion is usually not perfectly parallel q q q Synchronization overhead (e. g. , updates to shared data) Load imbalance overhead (imperfect parallelization) Resource sharing overhead (contention among N processors) 34

The Problem: Serialized Code Sections n Many parallel programs cannot be parallelized completely n

The Problem: Serialized Code Sections n Many parallel programs cannot be parallelized completely n Causes of serialized code sections q q n Sequential portions (Amdahl’s “serial part”) Critical sections Barriers Limiter stages in pipelined programs Serialized code sections q q q Reduce performance Limit scalability Waste energy 35

Example from My. SQL Asymmetric Critical Section Access Open Tables Cache 8 7 Speedup

Example from My. SQL Asymmetric Critical Section Access Open Tables Cache 8 7 Speedup Open database tables 6 5 4 3 2 Perform the operations …. Today 1 Parallel 0 0 8 16 24 32 Chip Area (cores) 36

Demands in Different Code Sections n What we want: n In a serialized code

Demands in Different Code Sections n What we want: n In a serialized code section one powerful “large” core n In a parallel code section many wimpy “small” cores n These two conflict with each other: q q If you have a single powerful core, you cannot have many cores A small core is much more energy and area efficient than a large core 37

“Large” vs. “Small” Cores Large Core Out-of-order Wide fetch e. g. 4 -wide Deeper

“Large” vs. “Small” Cores Large Core Out-of-order Wide fetch e. g. 4 -wide Deeper pipeline Aggressive branch predictor (e. g. hybrid) • Multiple functional units • Trace cache • Memory dependence speculation • • Small Core • • In-order Narrow Fetch e. g. 2 -wide Shallow pipeline Simple branch predictor (e. g. Gshare) • Few functional units Large Cores are power inefficient: e. g. , 2 x performance for 4 x area (power) 38

Large vs. Small Cores n Grochowski et al. , “Best of both Latency and

Large vs. Small Cores n Grochowski et al. , “Best of both Latency and Throughput, ” ICCD 2004. 39

Meet Large: IBM POWER 4 n n n Tendler et al. , “POWER 4

Meet Large: IBM POWER 4 n n n Tendler et al. , “POWER 4 system microarchitecture, ” IBM J R&D, 2002. A symmetric multi-core chip… Two powerful cores 40

IBM POWER 4 n n n 2 cores, out-of-order execution 100 -entry instruction window

IBM POWER 4 n n n 2 cores, out-of-order execution 100 -entry instruction window in each core 8 -wide instruction fetch, issue, execute Large, local+global hybrid branch predictor 1. 5 MB, 8 -way L 2 cache Aggressive stream based prefetching 41

IBM POWER 5 n Kalla et al. , “IBM Power 5 Chip: A Dual-Core

IBM POWER 5 n Kalla et al. , “IBM Power 5 Chip: A Dual-Core Multithreaded Processor, ” IEEE Micro 2004. 42

Meet Small: Sun Niagara (Ultra. SPARC T 1) A 32 -Way Multithreaded SPARC n

Meet Small: Sun Niagara (Ultra. SPARC T 1) A 32 -Way Multithreaded SPARC n Kongetira et al. , “Niagara: Processor, ” IEEE Micro 2005. 43

Niagara Core n n n 4 -way fine-grain multithreaded, 6 -stage, dual-issue in-order Round

Niagara Core n n n 4 -way fine-grain multithreaded, 6 -stage, dual-issue in-order Round robin thread selection (unless cache miss) Shared FP unit among cores 44

Remember the Demands n What we want: n In a serialized code section one

Remember the Demands n What we want: n In a serialized code section one powerful “large” core n In a parallel code section many wimpy “small” cores n These two conflict with each other: q q n If you have a single powerful core, you cannot have many cores A small core is much more energy and area efficient than a large core Can we get the best of both worlds? 45

Performance vs. Parallelism Assumptions: 1. Small cores takes an area budget of 1 and

Performance vs. Parallelism Assumptions: 1. Small cores takes an area budget of 1 and has performance of 1 2. Large core takes an area budget of 4 and has performance of 2 46

Tile-Large Approach Large core “Tile-Large” Tile a few large cores n IBM Power 5,

Tile-Large Approach Large core “Tile-Large” Tile a few large cores n IBM Power 5, AMD Barcelona, Intel Core 2 Quad, Intel Nehalem + High performance on single thread, serial code sections (2 units) - Low throughput on parallel program portions (8 units) n 47

Tile-Small Approach Small core Small core Small core Small core “Tile-Small” Tile many small

Tile-Small Approach Small core Small core Small core Small core “Tile-Small” Tile many small cores n Sun Niagara, Intel Larrabee, Tilera TILE (tile ultra-small) + High throughput on the parallel part (16 units) - Low performance on the serial part, single thread (1 unit) n 48

Can we get the best of both worlds? n Tile Large + High performance

Can we get the best of both worlds? n Tile Large + High performance on single thread, serial code sections (2 units) - Low throughput on parallel program portions (8 units) n Tile Small + High throughput on the parallel part (16 units) - Low performance on the serial part, single thread (1 unit), reduced single-thread performance compared to existing single thread processors n Idea: Have both large and small on the same chip Performance asymmetry 49

Asymmetric Multi-Core 50

Asymmetric Multi-Core 50

Asymmetric Chip Multiprocessor (ACMP) Large core “Tile-Large” Small core Small core Small core Small

Asymmetric Chip Multiprocessor (ACMP) Large core “Tile-Large” Small core Small core Small core Small core Small core “Tile-Small” Small core Small core Small core Large core ACMP Provide one large core and many small cores + Accelerate serial part using the large core (2 units) + Execute parallel part on small cores and large core for high throughput (12+2 units) n 51

Accelerating Serial Bottlenecks Single thread Large core Small core Small core Small core ACMP

Accelerating Serial Bottlenecks Single thread Large core Small core Small core Small core ACMP Approach 52

Performance vs. Parallelism Assumptions: 1. Small cores takes an area budget of 1 and

Performance vs. Parallelism Assumptions: 1. Small cores takes an area budget of 1 and has performance of 1 2. Large core takes an area budget of 4 and has performance of 2 53

ACMP Performance vs. Parallelism Area-budget = 16 small cores Large core Small Small core

ACMP Performance vs. Parallelism Area-budget = 16 small cores Large core Small Small core core Large core Small core core Small Small Small Small core core core “Tile-Small” ACMP “Tile-Large” Large Cores 4 0 1 Small Cores 0 16 12 Serial Performance 2 1 2 2 x 4=8 1 x 16 = 16 1 x 2 + 1 x 12 = 14 Parallel Throughput 54 54

Amdahl’s Law Modified n n Simplified Amdahl’s Law for an Asymmetric Multiprocessor Assumptions: q

Amdahl’s Law Modified n n Simplified Amdahl’s Law for an Asymmetric Multiprocessor Assumptions: q q q Serial portion executed on the large core Parallel portion executed on both small cores and large cores f: Parallelizable fraction of a program L: Number of large processors S: Number of small processors X: Speedup of a large processor over a small one Speedup = 1 1 -f X + f S + X*L 55

Caveats of Parallelism, Revisited n Amdahl’s Law q q f: Parallelizable fraction of a

Caveats of Parallelism, Revisited n Amdahl’s Law q q f: Parallelizable fraction of a program N: Number of processors 1 Speedup = 1 -f q n n + f N Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities, ” AFIPS 1967. Maximum speedup limited by serial portion: Serial bottleneck Parallel portion is usually not perfectly parallel q q q Synchronization overhead (e. g. , updates to shared data) Load imbalance overhead (imperfect parallelization) Resource sharing overhead (contention among N processors) 56

Accelerating Parallel Bottlenecks n n Serialized or imbalanced execution in the parallel portion can

Accelerating Parallel Bottlenecks n n Serialized or imbalanced execution in the parallel portion can also benefit from a large core Examples: q q n Critical sections that are contended Parallel stages that take longer than others to execute Idea: Dynamically identify these code portions that cause serialization and execute them on a large core 57

Accelerated Critical Sections M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N.

Accelerated Critical Sections M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. Patt, "Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures" Proceedings of the 14 th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2009 58

Contention for Critical Sections Critical Section 12 iterations, 33% instructions inside the critical section

Contention for Critical Sections Critical Section 12 iterations, 33% instructions inside the critical section Parallel Idle P=4 P=3 P=2 33% in critical section P=1 0 1 2 3 4 5 6 7 8 9 10 11 12 59

Contention for Critical Sections Critical Section 12 iterations, 33% instructions inside the critical section

Contention for Critical Sections Critical Section 12 iterations, 33% instructions inside the critical section Parallel Idle P=4 Accelerating critical sections increases performance and scalability P=3 Critical Section Accelerated by 2 x P=2 P=1 0 1 2 3 4 5 6 7 8 9 10 11 12 60

Impact of Critical Sections on n Contention for critical sections leads to serial execution

Impact of Critical Sections on n Contention for critical sections leads to serial execution Scalability 8 7 Speedup n (serialization) of threads in the parallel program portion Contention for critical sections increases with the number of threads and limits scalability Asymmetric 6 5 4 3 2 Today 1 0 0 8 16 24 32 Chip Area (cores) My. SQL (oltp-1) 61

A Case for Asymmetry n n Execution time of sequential kernels, critical sections, and

A Case for Asymmetry n n Execution time of sequential kernels, critical sections, and limiter stages must be short It is difficult for the programmer to shorten these serialized sections q q q n n Insufficient domain-specific knowledge Variation in hardware platforms Limited resources Goal: A mechanism to shorten serial bottlenecks without requiring programmer effort Idea: Accelerate serialized code sections by shipping them to powerful cores in an asymmetric multi-core (ACMP) 62

An Example: Accelerated Critical Sections Idea: HW/SW ships critical sections to a large, powerful

An Example: Accelerated Critical Sections Idea: HW/SW ships critical sections to a large, powerful core in an n asymmetric multi-core architecture n Benefit: q q q n n Reduces serialization due to contended locks Reduces the performance impact of hard-to-parallelize sections Programmer does not need to (heavily) optimize parallel code fewer bugs, improved productivity Suleman et al. , “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures, ” ASPLOS 2009, IEEE Micro Top Picks 2010. Suleman et al. , “Data Marshaling for Multi-Core Architectures, ” ISCA 2010, IEEE Micro Top Picks 2011. 63

Accelerated Critical Sections Enter. CS() Priority. Q. insert(…) Leave. CS() 1. P 2 encounters

Accelerated Critical Sections Enter. CS() Priority. Q. insert(…) Leave. CS() 1. P 2 encounters a critical section (CSCALL) 2. P 2 sends CSCALL Request to CSRB 3. P 1 executes Critical Section 4. P 1 sends CSDONE signal Core executing critical section P 1 P 2 Critical Section Request Buffer (CSRB) P 3 P 4 Onchip. Interconnect 64

Accelerated Critical Sections (ACS) Small Core A = compute() PUSH A CSCALL X, Target

Accelerated Critical Sections (ACS) Small Core A = compute() PUSH A CSCALL X, Target PC LOCK X result = CS(A) UNLOCK X print result … … … … Large Core CSCALL Request Send X, TPC, STACK_PTR, CORE_ID … Waiting in Critical Section … Request Buffer … (CSRB) TPC: Acquire X POP A result = CS(A) PUSH result Release X CSRET X CSDONE Response POP result print result n Suleman et al. , “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures, ” ASPLOS 2009. 65

False Serialization n ACS can serialize independent critical sections n Selective Acceleration of Critical

False Serialization n ACS can serialize independent critical sections n Selective Acceleration of Critical Sections (SEL) q Saturating counters to track false serialization To large core A 2 3 4 CSCALL (A) B 5 4 CSCALL (A) Critical Section Request Buffer (CSRB) CSCALL (B) From small cores 66

ACS Performance Tradeoffs n Pluses + Faster critical section execution + Shared locks stay

ACS Performance Tradeoffs n Pluses + Faster critical section execution + Shared locks stay in one place: better lock locality + Shared data stays in large core’s (large) caches: better shared data locality, less ping-ponging n Minuses - Large core dedicated for critical sections: reduced parallel throughput - CSCALL and CSDONE control transfer overhead - Thread-private data needs to be transferred to large core: worse private data locality 67

ACS Performance Tradeoffs n Fewer parallel threads vs. accelerated critical sections q q Accelerating

ACS Performance Tradeoffs n Fewer parallel threads vs. accelerated critical sections q q Accelerating critical sections offsets loss in throughput As the number of cores (threads) on chip increase: n n n Overhead of CSCALL/CSDONE vs. better lock locality q n Fractional loss in parallel performance decreases Increased contention for critical sections makes acceleration more beneficial ACS avoids “ping-ponging” of locks among caches by keeping them at the large core More cache misses for private data vs. fewer misses for shared data 68

Cache Misses for Private Data Priority. Heap. insert(New. Sub. Problems) Private Data: New. Sub.

Cache Misses for Private Data Priority. Heap. insert(New. Sub. Problems) Private Data: New. Sub. Problems Shared Data: The priority heap Puzzle Benchmark 69

ACS Performance Tradeoffs n Fewer parallel threads vs. accelerated critical sections q q Accelerating

ACS Performance Tradeoffs n Fewer parallel threads vs. accelerated critical sections q q Accelerating critical sections offsets loss in throughput As the number of cores (threads) on chip increase: n n n Overhead of CSCALL/CSDONE vs. better lock locality q n Fractional loss in parallel performance decreases Increased contention for critical sections makes acceleration more beneficial ACS avoids “ping-ponging” of locks among caches by keeping them at the large core More cache misses for private data vs. fewer misses for shared data q Cache misses reduce if shared data > private data This problem can be solved 70

ACS Comparison Points Small core Small core Small core Small core Small core Conventional

ACS Comparison Points Small core Small core Small core Small core Small core Conventional locking Small core Small core Small core Large core n n Conventional locking Large core executes Amdahl’s serial part Small core Small core Small core Large core ACMP SCMP n Small core ACS n Large core executes Amdahl’s serial part and critical sections 71

Accelerated Critical Sections: Methodology n Workloads: 12 critical section intensive applications q n Multi-core

Accelerated Critical Sections: Methodology n Workloads: 12 critical section intensive applications q n Multi-core x 86 simulator q q n Data mining kernels, sorting, database, web, networking 1 large and 28 small cores Aggressive stream prefetcher employed at each core Details: q q Large core: 2 GHz, out-of-order, 128 -entry ROB, 4 -wide, 12 -stage Small core: 2 GHz, in-order, 2 -wide, 5 -stage Private 32 KB L 1, private 256 KB L 2, 8 MB shared L 3 On-chip interconnect: Bi-directional ring, 5 -cycle hop latency 72

ACS Performance Chip Area = 32 small cores Equal-area comparison Number of threads =

ACS Performance Chip Area = 32 small cores Equal-area comparison Number of threads = Best threads SCMP = 32 small cores ACMP = 1 large and 28 small cores 269 180 185 Coarse-grain locks Fine-grain locks 73

------ SCMP ------ ACS Equal-Area Comparisons Number of threads = No. of cores Speedup

------ SCMP ------ ACS Equal-Area Comparisons Number of threads = No. of cores Speedup over a small core 3, 5 3 2, 5 2 1, 5 1 0, 5 0 3 5 2, 5 4 2 3 1, 5 2 1 0, 5 1 0 0 0 8 16 24 32 (a) ep (b) is 6 10 5 8 4 2 1 2 0 0 0 8 16 24 32 (c) pagemine (d) puzzle 6 (g) sqlite (h) iplookup 0 8 16 24 32 (e) qsort (f) tsp 12 10 2, 5 10 8 2 8 6 1, 5 6 4 1 4 2 0, 5 2 0 0 0 8 16 24 32 3 2 0 8 16 24 32 14 12 10 8 6 4 2 0 12 4 4 3, 5 3 2, 5 2 1, 5 1 0, 5 0 0 8 16 24 32 8 6 3 7 6 5 4 3 2 1 0 0 8 16 24 32 (i) oltp-1 (i) oltp-2 Chip Area (small cores) 74 0 8 16 24 32 (k) specjbb 0 8 16 24 32 (l) webcache

ACS Summary n n n Critical sections reduce performance and limit scalability Accelerate critical

ACS Summary n n n Critical sections reduce performance and limit scalability Accelerate critical sections by executing them on a powerful core ACS reduces average execution time by: q q n n 34% compared to an equal-area SCMP 23% compared to an equal-area ACMP ACS improves scalability of 7 of the 12 workloads Generalizing the idea: Accelerate all bottlenecks (“critical paths”) by executing them on a powerful core 75