ECE 454 Computer Systems Programming Parallel Architectures and
- Slides: 62
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (I) Ding Yuan ECE Dept. , University of Toronto http: //www. eecg. toronto. edu/~yuan
Big picture • We know that we need parallelization • But will more parallelization always yield better performance? throughput After this lec. , you will have a clear understanding of “why” # of threads • Still many open research questions • Many of the results are from the most recent research papers 2020 -12 -04
Topics • Basic parallel architectures • Cache coherence • Avoiding false sharing • Performance of cache coherence and its implications to software design (next lec. ) 2020 -12 -04
What does “Parallel” Mean? P P Caches C C Memory M Processors • A parallel computer • A collection of processing elements • Some method to communicate between them communication mechanisms are key
48 -core AMD Opteron 2020 -12 -04
Data Communication Mechanisms • Shared Memory (aka shared address space, SAS) • Any processor can directly reference any memory location • Communication occurs implicitly through loads and stores • Requires separate synchronization mechanisms focus for now • Message Passing • Communication via explicit I/O operations (messages) later
Cache Coherence
The Problem • Multi-core results in multiple copies of the data being cached, how do you keep them consistent to each other? Consistency (or coherence): “the behavior is equivalent to there being only a single copy of the data except for the performance benefit of the cache. ” [Gray and Cheriton 83] Processor L 1 Cache 0 x 4013000 0 Blocks (cache lines) 2020 -12 -04 … Other cache levels, interconnect, Memory 0 x 4013000 0
Writes to shared data complicates the problem! 2020 -12 -04
Example: The Cache Coherence Problem Thread A: Store X=3 Thread B: Load X Processor Tag Cache X X=? Processor Data 2 Cache Shared Memory (X=2) Tag Data - -
Problems with the Intuition • “Reading a location should return latest value written” • “Latest” is not well-defined! • Even in sequential case: • “last” is defined in terms of program order, not time • Order of operations in machine language presented to processor • processor may execute operations out of order • In parallel case: • program order defined within a thread • but need to make sense of orders across threads • Must define a meaningful semantics • “cache coherence” of a single location
Easy Solution: A Single Shared Cache P P P Cache • Processors share a single L 1 cache • avoids the problem of coherence • no redundancy, no coherence problem • Useful for very small machines + fine-grain sharing and prefetch effects - limited cache bandwidth - does not scale P
Shared Cache Example: Prefetch Effects Thread A: Load X Thread B: Load X Processor Tag Cache X Data 2 Memory (X=2) Miss! Hit!
• Modern processors have private L 1 cache/core P P C C … P P C C Dual-core (motherboard) Private (L 1) cache M SMP (Symmetric multiprocessing) 2020 -12 -04
Coherence problem Thread B: Load X Thread A: Processor Cache Tag Data - - Shared Memory (X=2)
Coherence problem Thread B: Load X Thread A: Processor Cache Tag Data - - Read Shared Memory (X=2)
Coherence problem Thread B: Load X Thread A: Processor Cache Tag Data - - X 2 Read Shared Memory (X=2) Fill
Coherence problem Thread A: Store X=3 Thread B: Processor Cache Tag Data - - X 2 Shared Memory (X=2)
Coherence problem Thread A: Store X=3 Thread B: Processor Cache Tag Data X 3 X 2 Inconsistency! Shared Memory (X=2)
How to solve it? • Add three states to each cache line • Dirty – you have written to the cache line • Inconsistent with the primary storage (RAM or last-level cache) • Cannot share with other cores • Read-only – you are reading the cache line • Consistent with the primary storage • Can share with other cores • Invalid • Also called MSI (modified, shared, invalid) protocol • Modified = Dirty • Shared = Read-only • Invalid 2020 -12 -04
MSI Coherence Thread A: Thread B: Processor Cache State Tag Data Invalid - - Shared Memory (X=2)
MSI Coherence Load X Thread A: Thread B: Processor Cache State Tag Data Invalid - - Read Shared Memory (X=2)
MSI Coherence Load X Thread A: Thread B: Processor Cache State Tag Data Invalid - - Shared X 2 Read Shared Memory (X=2) Fill
MSI Coherence Load X Thread A: Store X=3 Thread B: Processor Cache State Tag Data Invalid - - Shared X 2 Invalidation Shared Memory (X=2) invalidates all other copies
MSI Coherence Load X Thread A: Store X=3 Thread B: Processor Cache State Tag Data Invalid - - Invalidation Shared Memory (X=2) invalidates all other copies Invalidation
MSI Coherence Load X Thread A: Store X=3 Thread B: Processor Cache State Modified Cache Tag Data State Tag Data X 3 Invalid - - Invalidation Shared Memory (X, 2 (stale))
Improving MSI • MSI works • But how about performance? • What if a lot of reads and writes on a cache line that only exists in one cache? • Every write to Shared will generate an invalidation request! • Solution: add Exclusive state • MSI -> MESI (aka Illinois protocol) • Exclusive: the current core has the only copy of that block • And it is consistent with the memory • write to Exclusive state will NOT generate invalidation request 2020 -12 -04
MESI Details: Writing • If attempt to write to a block that is “shared” or “invalid” • Called a “write miss” • Must first get block in exclusive state before writing to it • generates a read-exclusive request • Causes invalidations to be sent to any cache with a copy • Completes when there are no more valid copies • Can then perform the write and enter the “modified” state • MESI typically built on caches that are “write-back” • Writes don’t propagate beyond the cache • If a “modified” block is replaced, write it back to the next level
Cache Coherence Example 1
Example 1: MESI Coherence Thread A: Thread B: Processor Cache State Tag Data Invalid - - Shared Memory (X=2)
Example 1: MESI Coherence Load X Thread A: Thread B: Processor Cache State Tag Data Invalid - - Shared Memory (X=2)
Example 1: MESI Coherence Load X Thread A: Thread B: Processor Cache State Tag Data Invalid - - Read Shared Memory (X=2)
Example 1: MESI Coherence Load X Thread A: Thread B: Processor Cache State Excl Fill Cache Tag Data State Tag Data X 2 Invalid - - Read Shared Memory (X=2)
Example 1: MESI Coherence Load X Thread A: Thread B: Load X Processor Cache State Excl Cache Tag Data State Tag Data X 2 Invalid - - Shared Memory (X=2)
Example 1: MESI Coherence Load X Thread A: Thread B: Load X Processor Cache State Excl Cache Tag Data State Tag Data X 2 Invalid - - Read Shared Memory (X=2)
Example 1: MESI Coherence Load X Thread A: Thread B: Load X Processor Cache State Tag Data Share X 2 Notify Shared Fill Shared Memory (X=2) Read
Cache Coherence Example 2
Example 2: MESI Coherence Thread A: Thread B: Processor Cache State Tag Data Invalid - - Shared Memory (X=2)
Example 2: MESI Coherence Load X Thread A: Thread B: Processor Cache State Tag Data Invalid - - Read Shared Memory (X=2)
Example 2: MESI Coherence Load X Thread A: Thread B: Processor Cache State Tag Data Invalid - - State Tag Data X 2 Excl. Read Shared Memory (X=2) Fill
Example 2: MESI Coherence Load X Thread A: Store X=3 Thread B: Processor Cache State Tag Data Invalid - - State Excl. Tag Data X 2 Read-Exclusive Shared Memory (X=2) read-exclusive invalidates all other copies
Example 2: MESI Coherence Load X Thread A: Store X=3 Thread B: Processor Cache State Tag Data Invalid - - Read-Exclusive Invalidation Shared Memory (X=2) read-exclusive invalidates all other copies
Example 2: MESI Coherence Load X Thread A: Store X=3 Thread B: Processor Cache State Dirty Fill Cache Tag Data State Tag Data X 3 Invalid - - Read-Exclusive Invalidation Shared Memory (X, 2 (out of date) ) the state ‘dirty’ implies exclusiveness
Cache Coherence Example 3
Example 3: Basic Coherence Thread A: Thread B: Processor Cache State Cache Tag Data - - State Shared Memory (X=2) Tag Data - -
Example 3: Basic Coherence Store X=5 Thread A: Thread B: Processor Cache State Cache Tag Data - - State Shared Memory (X=2) Tag Data - -
Example 3: Basic Coherence Store X=5 Thread A: Thread B: Processor Cache State Cache Tag Data - - State Read-Exclusive Shared Memory (X=2) Tag Data - -
Example 3: Basic Coherence Store X=5 Thread A: Thread B: Processor Cache State Dirty Fill Tag Data X 5 State Read-Exclusive Shared Memory (X=out-of-date) Tag Data - -
Example 3: Basic Coherence Store X=5 Thread A: Thread B: Load X Processor Cache State Dirty Tag Data X 5 State Shared Memory (X, 2 (out of date)) Tag Data - -
Example 3: Basic Coherence Store X=5 Thread A: Thread B: Load X Processor Cache State Dirty Tag Data X 5 State Tag Data - Read Shared Memory (X=out-of-date)
Example 3: Basic Coherence Store X=5 Thread A: Thread B: Load X Processor Cache State Dirty Tag Data X 5 State Read request Shared Memory (X=out-of-date) Tag Data - Read
Example 3: Basic Coherence Store X=5 Thread A: Thread B: Load X Processor Cache State Tag Data Share X 5 Read request State Update Shared Memory (X=5) Tag Data - Read
Example 3: Basic Coherence Store X=5 Thread A: Thread B: Load X Processor Cache State Tag Data Share X 5 Read request Update Shared Memory (X=5) Fill Read
Avoiding False Sharing
False Sharing Store X=5 Thread A: Thread B: Processor Cache State Cache Tag -, - Data State -, - Tag -, - Data -, - Read-Exclusive Shared Memory (X=0, Y=0) X and Y are on the same cache line
False Sharing Store X=5 Thread A: Thread B: Processor Cache State Dirty Fill Cache Tag X, Y Data State 5, 0 Read-Exclusive Shared Memory (X, Y=out-of-date) Tag -, - Data -, -
False Sharing Store X=5 Thread A: Thread B: Store Y=2 Processor Cache State Dirty Cache Tag X, Y Data State 5, 0 Invalidation Shared Memory (X, Y=out-of-date) Tag -, - Data -, - Read-Exclusive
False Sharing Store X=5 Thread A: Thread B: Store Y=2 Processor Cache State Cache Tag -, - update Data State -, - Invalidation Shared Memory (X=5, Y=0) Tag -, - Data -, - Read-Exclusive
False Sharing Store X=5 Thread A: Thread B: Store Y=2 Processor Cache State Cache Tag -, - update Data State Tag -, - Invalidation X, Y fill Shared Memory (X, Y=out-of-date) Data 5, 2 Read-Exclusive
False Sharing Store X=5 Thread A: Thread B: Store Y=2 Processor Cache State Cache Tag -, - update Data -, - State Tag Dirty Invalidation X, Y fill Shared Memory (X, Y=out-of-date) Data 5, 2 Read-Exclusive
False Sharing while(1) Thread A: Store X=5 while(1) Thread B: Store Y=2 Processor Cache State Cache Tag -, - Data State -, - Tag -, - Data -, - Shared Memory (X, Y=out-of-date) X, Y cache line will ping-pong back & forth
False Sharing Summary • Cache blocks attempt to exploit locality • Multiple data elements move together on one block • Cache blocks can cause “false sharing” • 2 threads accessing distinct locations on one block • Block will ping-pong as if accessing same location • Avoid false sharing by careful data arrangement • ensure that elements to be shared separately are mapped to separate cache blocks • Eg. , insert padding (unused data between shared items) • Whose responsibility? compiler? malloc? programmer?
- Ece 454
- Ece 454
- Ece 454
- Ece 454
- Ece 454
- Distributed systems architectures
- Database storage architecture
- Autoencoders, unsupervised learning, and deep architectures
- Vdv 454
- Integral vs modular architecture
- Base system architectures
- Backbone network components
- George schlossnagle
- Modular product architectures
- Kj454
- Gui architectures
- Database system architectures
- Ars 46-454
- Cmpt 454
- Cmpt 454
- Ist 454
- 640 en yakın yüzlüğe yuvarlama
- Cdn architectures
- Aaron bannert
- Different olap architectures
- Testo 454
- Opa 454
- Instruction set architecture (isa)
- Ecommerce server
- Backbone network architectures
- Gpu cache coherence
- Why systolic architectures
- Unlike force
- Non parallel sentence
- Parallel struc
- Mary likes hiking swimming and to ride a bicycle
- Parallelism meaning
- Perbedaan linear programming dan integer programming
- Greedy vs dynamic programming
- System programming
- Linear vs integer programming
- Perbedaan linear programming dan integer programming
- Real-time systems and programming languages
- 6-6 word problem practice systems of inequalities
- Real-time systems and programming languages
- Expert systems: principles and programming, fourth edition
- Programming massively parallel processors
- Scala parallel map
- Java parallel programming
- An introduction to parallel programming peter pacheco
- Bubble sort parallel programming
- Mpi critical section
- Programming massively parallel processors
- Programming massively parallel processors
- Parallel programming platforms
- F# parallel programming
- Parallel programming
- Programming massively parallel processors, kirk et al.
- Concepts techniques and models of computer programming
- Linear programming models: graphical and computer methods
- Reliability of series and parallel systems example
- Parallel and distributed simulation systems
- It is the inner terminus of the fingerprint pattern?