ECE 454 Computer Systems Programming Parallel Architectures and

What we already learnt • How to benefit from multi-cores by parallelize sequential program

Putting it all together [1] • Performance implications for parallel architecture • Background: architecture

Two case studies • 48 -core AMD Opteron • 80 -core Intel Xeon Socket

48 -core AMD Opteron C L 1 … 6 x… L 1 … 8

80 -core Intel Xeon C L 1 … 10 x… L 1 cross-socket. C

Interconnect between sockets Cross-sockets communication can be 2 -hops 7 Ding Yuan, ECE 454

Performance of memory operations 8 Ding Yuan, ECE 454

Local caches and memory latencies • Memory access to a line cached locally (cycles)

Latency of remote access: read (cycles) Ignore “State” is the MESI state of a

Latency of remote access: write (cycles) Ignore “State” is the MESI state of a

How about synchronization? 12 Ding Yuan, ECE 454

Synchronization implementation • Hardware support is required to implement sync. primitives • In the

Test-And-Set • The semantics of test-and-set are: • Record the old value • Set

Using Test-And-Set • Here is our lock implementation with test-and-set: struct lock { int

TAS and cache coherence acquire(lock) Thread A: Thread B: Processor Cache State Cache Data

TAS and cache coherence acquire(lock) Thread A: Thread B: Processor Cache State Dirty Fill

TAS and cache coherence acquire(lock) Thread A: Thread B: acquire(lock) Processor Cache State Dirty

TAS and cache coherence acquire(lock) Thread A: Thread B: acquire(lock) Processor Cache Data State

TAS and cache coherence acquire(lock) Thread A: Thread B: acquire(lock) Processor Cache State Invalid

What if there are contentions? while(TAS(lock)) Thread A: ; Thread B: while(TAS(lock)) ; Processor

How bad can it be? TAS Store Ignore Recall: TAS essentially is a Store

How to optimize? • When the lock is being held, a contending “acquire” keeps

What if there are contentions? Thread A: Thread B: holding lock while(lock->held==1) ; Processor

Let’s put everything together TAS Load Ignore Write Local access 27 Ding Yuan, ECE

Implications to programmers • Cache coherence is expensive (more than you thought) • Avoid

Slides: 28

Download presentation

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept. , University of Toronto http: //www. eecg. toronto. edu/~yuan

What we already learnt • How to benefit from multi-cores by parallelize sequential program into multi-threaded program • Watch out of locks: atomic regions are serialized • Use fine-grained locks, and avoid locking if possible • But are these all? • As long as you do the above, your multi-threaded program will run Nx faster on an N-core machine? 2 Ding Yuan, ECE 454

Putting it all together [1] • Performance implications for parallel architecture • Background: architecture of the two testing machines • Cache-coherence performance and implications to parallel software design [1] Everything you always wanted to know about synchronization but were afraid to ask. David, et. al. , SOSP’ 13 3 Ding Yuan, ECE 454

Two case studies • 48 -core AMD Opteron • 80 -core Intel Xeon Socket Question to keep in mind: which machine would you use? 4 Ding Yuan, ECE 454

48 -core AMD Opteron C L 1 … 6 x… L 1 … 8 x… Last Level Cache C C C L 1 … 6 x… L 1 Last Level Cache cross-die! (motherboard) C C 6 -cores per die (each socket contains 2 dies) RAM • LLC NOT shared • Directory-based cache coherence 5 Ding Yuan, ECE 454

80 -core Intel Xeon C L 1 … 10 x… L 1 cross-socket. C … 8 x… L 1 C L 1 … 10 x… L 1 Last Level Cache 10 -cores per die C (motherboard) C C 10 -cores per die RAM • LLC shared • Snooping-based cache coherence 6 Ding Yuan, ECE 454

Interconnect between sockets Cross-sockets communication can be 2 -hops 7 Ding Yuan, ECE 454

Performance of memory operations 8 Ding Yuan, ECE 454

Local caches and memory latencies • Memory access to a line cached locally (cycles) • Best case: L 1 < 10 cycles (remember this) • Worst case: RAM 136 – 355 cycles (remember this) 9 Ding Yuan, ECE 454

Latency of remote access: read (cycles) Ignore “State” is the MESI state of a cache line in a remote cache (local state is invalid) • Cross-socket communication is expensive! • Xeon: loading from Shared state is 7. 5 times more expensive over two hops than within socket • Opteron: cross-socket latency even larger than RAM • Opteron: uniform latency regardless of the cache state • Directory-based protocol (directory is distributed across all LLC, here we assume the directory lookup stays in the same die) • Xeon: load from “Shared” state is much faster than from “M” and “E” states • “Shared” state read is served from LLC instead from remote cache 10 Ding Yuan, ECE 454

Latency of remote access: write (cycles) Ignore “State” is the MESI state of a cache line in a remote cache. • Cross-socket communication is expensive! • Opteron: store to “Shared” cache line is much more expensive • Directory-based protocol is incomplete • Does not keep track of the sharers, therefore it is • Equivalent to broadcast and have to wait for all invalidations to complete • Xeon: store latency similar regardless of the previous cache line state • Snooping-based coherence 11 Ding Yuan, ECE 454

How about synchronization? 12 Ding Yuan, ECE 454

Synchronization implementation • Hardware support is required to implement sync. primitives • In the form of atomic instructions • Common examples include: test-and-set, compare-and-swap, etc. • Used to implement high-level synchronization primitives • e. g. , lock/unlock, semaphores, barriers, cond. var. , etc. • We will only discuss test-and-set 13 Ding Yuan, ECE 454

Test-And-Set • The semantics of test-and-set are: • Record the old value • Set the value to TRUE • This is a write! • Return the old value • Hardware executes it atomically! Hardware implementation: • Read-exclusive (invalidations) bool test_and_set (bool *flag){ • Modify (change state) bool old = *flag; • Memory barrier *flag = True; • completes all the mem. op. atomic! before this TAS return old; • cancel all the mem. op. } after this TAS • When executing test-and-set on “flag” • What is value of flag afterwards if it was initially False? True? • What is the return result if flag was initially False? True? 14 Ding Yuan, ECE 454

Using Test-And-Set • Here is our lock implementation with test-and-set: struct lock { int held = 0; } void acquire (lock) { while (test-and-set(&lock->held)); } void release (lock) { lock->held = 0; } • When will the while return? What is the value of held? • Does it work? What about multiprocessors? 15 Ding Yuan, ECE 454

TAS and cache coherence acquire(lock) Thread A: Thread B: Processor Cache State Cache Data State Data Read-Exclusive Shared Memory (lock->held = 0) 16 Ding Yuan, ECE 454

TAS and cache coherence acquire(lock) Thread A: Thread B: Processor Cache State Dirty Fill Cache Data State Data lock->held=1 Read-Exclusive Shared Memory (lock->held = 0) 17 Ding Yuan, ECE 454

TAS and cache coherence acquire(lock) Thread A: Thread B: acquire(lock) Processor Cache State Dirty Cache Data State Data lock->held=1 invalidation Read-Exclusive Shared Memory (lock->held = 0) 18 Ding Yuan, ECE 454

TAS and cache coherence acquire(lock) Thread A: Thread B: acquire(lock) Processor Cache Data State Invalid State Data lock->held=1 update invalidation Read-Exclusive Shared Memory (lock->held = 1) 19 Ding Yuan, ECE 454

TAS and cache coherence acquire(lock) Thread A: Thread B: acquire(lock) Processor Cache State Invalid Cache Data State Dirty lock->held=1 Data lock->held=1 Fill Read-Exclusive Shared Memory (lock->held = 1) 20 Ding Yuan, ECE 454

What if there are contentions? while(TAS(lock)) Thread A: ; Thread B: while(TAS(lock)) ; Processor Cache State Cache Data State Data Shared Memory (lock->held = 1) 21 Ding Yuan, ECE 454

How bad can it be? TAS Store Ignore Recall: TAS essentially is a Store + Memory Barrier Takeaway: heavy lock contentions may lead to worse performa than serializing the execution! 22 Ding Yuan, ECE 454

How to optimize? • When the lock is being held, a contending “acquire” keeps modifying the lock var. to 1 • Not necessary! void test_and_set (lock) { do { while (lock->held == 1) ; // spin } } while (test_and_set(lock->held)); } void release (lock) { lock->held = 0; } 23 Ding Yuan, ECE 454

What if there are contentions? Thread A: Thread B: holding lock while(lock->held==1) ; Processor Dirty Processor Cache State Thread B: Data Cache Data State Data lock->held=1 Read request Read Shared Memory (lock->held = 0) 24 Ding Yuan, ECE 454

What if there are contentions? Thread A: Thread B: holding lock while(lock->held==1) ; Processor Shared Processor Cache State Thread B: Data lock->held=1 Read request update Cache Data State Shared State Data lock->held=1 Read Shared Memory (lock->held = 1) 25 Ding Yuan, ECE 454

What if there are contentions? Thread A: Thread B: holding lock while(lock->held==1) ; Processor Cache State Shared Data lock->held=1 Thread B: Cache Data State Shared lock->held=1 State Shared Data lock->held=1 Repeated read to “Shared” cache line: no cache coherence traffic! Shared Memory (lock->held = 1) 26 Ding Yuan, ECE 454

Let’s put everything together TAS Load Ignore Write Local access 27 Ding Yuan, ECE 454

Implications to programmers • Cache coherence is expensive (more than you thought) • Avoid unnecessary sharing (e. g. , false sharing) • Avoid unnecessary coherence (e. g. , TAS -> TATAS) • Clear understanding of the performance • Crossing sockets is a killer • Can be slower than running the same program on single core! • pthread provides CPU affinity mask • pin cooperative threads on cores within the same die • Loads and stores can be as expensive as atomic operations • Programming gurus understand the hardware • So do you now! • Have fun hacking! More details in “Everything you always wanted to know about synchronization 28 but were afraid to ask”. David, et. al. , SOSP’ 13