ECE 454 Computer Systems Programming Parallel Architectures and

  • Slides: 28
Download presentation
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept. , University of Toronto http: //www. eecg. toronto. edu/~yuan

What we already learnt • How to benefit from multi-cores by parallelize sequential program

What we already learnt • How to benefit from multi-cores by parallelize sequential program into multi-threaded program • Watch out of locks: atomic regions are serialized • Use fine-grained locks, and avoid locking if possible • But are these all? • As long as you do the above, your multi-threaded program will run Nx faster on an N-core machine? 2 Ding Yuan, ECE 454

Putting it all together [1] • Performance implications for parallel architecture • Background: architecture

Putting it all together [1] • Performance implications for parallel architecture • Background: architecture of the two testing machines • Cache-coherence performance and implications to parallel software design [1] Everything you always wanted to know about synchronization but were afraid to ask. David, et. al. , SOSP’ 13 3 Ding Yuan, ECE 454

Two case studies • 48 -core AMD Opteron • 80 -core Intel Xeon Socket

Two case studies • 48 -core AMD Opteron • 80 -core Intel Xeon Socket Question to keep in mind: which machine would you use? 4 Ding Yuan, ECE 454

48 -core AMD Opteron C L 1 … 6 x… L 1 … 8

48 -core AMD Opteron C L 1 … 6 x… L 1 … 8 x… Last Level Cache C C C L 1 … 6 x… L 1 Last Level Cache cross-die! (motherboard) C C 6 -cores per die (each socket contains 2 dies) RAM • LLC NOT shared • Directory-based cache coherence 5 Ding Yuan, ECE 454

80 -core Intel Xeon C L 1 … 10 x… L 1 cross-socket. C

80 -core Intel Xeon C L 1 … 10 x… L 1 cross-socket. C … 8 x… L 1 C L 1 … 10 x… L 1 Last Level Cache 10 -cores per die C (motherboard) C C 10 -cores per die RAM • LLC shared • Snooping-based cache coherence 6 Ding Yuan, ECE 454

Interconnect between sockets Cross-sockets communication can be 2 -hops 7 Ding Yuan, ECE 454

Interconnect between sockets Cross-sockets communication can be 2 -hops 7 Ding Yuan, ECE 454

Performance of memory operations 8 Ding Yuan, ECE 454

Performance of memory operations 8 Ding Yuan, ECE 454

Local caches and memory latencies • Memory access to a line cached locally (cycles)

Local caches and memory latencies • Memory access to a line cached locally (cycles) • Best case: L 1 < 10 cycles (remember this) • Worst case: RAM 136 – 355 cycles (remember this) 9 Ding Yuan, ECE 454

Latency of remote access: read (cycles) Ignore “State” is the MESI state of a

Latency of remote access: read (cycles) Ignore “State” is the MESI state of a cache line in a remote cache (local state is invalid) • Cross-socket communication is expensive! • Xeon: loading from Shared state is 7. 5 times more expensive over two hops than within socket • Opteron: cross-socket latency even larger than RAM • Opteron: uniform latency regardless of the cache state • Directory-based protocol (directory is distributed across all LLC, here we assume the directory lookup stays in the same die) • Xeon: load from “Shared” state is much faster than from “M” and “E” states • “Shared” state read is served from LLC instead from remote cache 10 Ding Yuan, ECE 454

Latency of remote access: write (cycles) Ignore “State” is the MESI state of a

Latency of remote access: write (cycles) Ignore “State” is the MESI state of a cache line in a remote cache. • Cross-socket communication is expensive! • Opteron: store to “Shared” cache line is much more expensive • Directory-based protocol is incomplete • Does not keep track of the sharers, therefore it is • Equivalent to broadcast and have to wait for all invalidations to complete • Xeon: store latency similar regardless of the previous cache line state • Snooping-based coherence 11 Ding Yuan, ECE 454

How about synchronization? 12 Ding Yuan, ECE 454

How about synchronization? 12 Ding Yuan, ECE 454

Synchronization implementation • Hardware support is required to implement sync. primitives • In the

Synchronization implementation • Hardware support is required to implement sync. primitives • In the form of atomic instructions • Common examples include: test-and-set, compare-and-swap, etc. • Used to implement high-level synchronization primitives • e. g. , lock/unlock, semaphores, barriers, cond. var. , etc. • We will only discuss test-and-set 13 Ding Yuan, ECE 454

Test-And-Set • The semantics of test-and-set are: • Record the old value • Set

Test-And-Set • The semantics of test-and-set are: • Record the old value • Set the value to TRUE • This is a write! • Return the old value • Hardware executes it atomically! Hardware implementation: • Read-exclusive (invalidations) bool test_and_set (bool *flag){ • Modify (change state) bool old = *flag; • Memory barrier *flag = True; • completes all the mem. op. atomic! before this TAS return old; • cancel all the mem. op. } after this TAS • When executing test-and-set on “flag” • What is value of flag afterwards if it was initially False? True? • What is the return result if flag was initially False? True? 14 Ding Yuan, ECE 454

Using Test-And-Set • Here is our lock implementation with test-and-set: struct lock { int

Using Test-And-Set • Here is our lock implementation with test-and-set: struct lock { int held = 0; } void acquire (lock) { while (test-and-set(&lock->held)); } void release (lock) { lock->held = 0; } • When will the while return? What is the value of held? • Does it work? What about multiprocessors? 15 Ding Yuan, ECE 454

TAS and cache coherence acquire(lock) Thread A: Thread B: Processor Cache State Cache Data

TAS and cache coherence acquire(lock) Thread A: Thread B: Processor Cache State Cache Data State Data Read-Exclusive Shared Memory (lock->held = 0) 16 Ding Yuan, ECE 454

TAS and cache coherence acquire(lock) Thread A: Thread B: Processor Cache State Dirty Fill

TAS and cache coherence acquire(lock) Thread A: Thread B: Processor Cache State Dirty Fill Cache Data State Data lock->held=1 Read-Exclusive Shared Memory (lock->held = 0) 17 Ding Yuan, ECE 454

TAS and cache coherence acquire(lock) Thread A: Thread B: acquire(lock) Processor Cache State Dirty

TAS and cache coherence acquire(lock) Thread A: Thread B: acquire(lock) Processor Cache State Dirty Cache Data State Data lock->held=1 invalidation Read-Exclusive Shared Memory (lock->held = 0) 18 Ding Yuan, ECE 454

TAS and cache coherence acquire(lock) Thread A: Thread B: acquire(lock) Processor Cache Data State

TAS and cache coherence acquire(lock) Thread A: Thread B: acquire(lock) Processor Cache Data State Invalid State Data lock->held=1 update invalidation Read-Exclusive Shared Memory (lock->held = 1) 19 Ding Yuan, ECE 454

TAS and cache coherence acquire(lock) Thread A: Thread B: acquire(lock) Processor Cache State Invalid

TAS and cache coherence acquire(lock) Thread A: Thread B: acquire(lock) Processor Cache State Invalid Cache Data State Dirty lock->held=1 Data lock->held=1 Fill Read-Exclusive Shared Memory (lock->held = 1) 20 Ding Yuan, ECE 454

What if there are contentions? while(TAS(lock)) Thread A: ; Thread B: while(TAS(lock)) ; Processor

What if there are contentions? while(TAS(lock)) Thread A: ; Thread B: while(TAS(lock)) ; Processor Cache State Cache Data State Data Shared Memory (lock->held = 1) 21 Ding Yuan, ECE 454

How bad can it be? TAS Store Ignore Recall: TAS essentially is a Store

How bad can it be? TAS Store Ignore Recall: TAS essentially is a Store + Memory Barrier Takeaway: heavy lock contentions may lead to worse performa than serializing the execution! 22 Ding Yuan, ECE 454

How to optimize? • When the lock is being held, a contending “acquire” keeps

How to optimize? • When the lock is being held, a contending “acquire” keeps modifying the lock var. to 1 • Not necessary! void test_and_set (lock) { do { while (lock->held == 1) ; // spin } } while (test_and_set(lock->held)); } void release (lock) { lock->held = 0; } 23 Ding Yuan, ECE 454

What if there are contentions? Thread A: Thread B: holding lock while(lock->held==1) ; Processor

What if there are contentions? Thread A: Thread B: holding lock while(lock->held==1) ; Processor Dirty Processor Cache State Thread B: Data Cache Data State Data lock->held=1 Read request Read Shared Memory (lock->held = 0) 24 Ding Yuan, ECE 454

What if there are contentions? Thread A: Thread B: holding lock while(lock->held==1) ; Processor

What if there are contentions? Thread A: Thread B: holding lock while(lock->held==1) ; Processor Shared Processor Cache State Thread B: Data lock->held=1 Read request update Cache Data State Shared State Data lock->held=1 Read Shared Memory (lock->held = 1) 25 Ding Yuan, ECE 454

What if there are contentions? Thread A: Thread B: holding lock while(lock->held==1) ; Processor

What if there are contentions? Thread A: Thread B: holding lock while(lock->held==1) ; Processor Cache State Shared Data lock->held=1 Thread B: Cache Data State Shared lock->held=1 State Shared Data lock->held=1 Repeated read to “Shared” cache line: no cache coherence traffic! Shared Memory (lock->held = 1) 26 Ding Yuan, ECE 454

Let’s put everything together TAS Load Ignore Write Local access 27 Ding Yuan, ECE

Let’s put everything together TAS Load Ignore Write Local access 27 Ding Yuan, ECE 454

Implications to programmers • Cache coherence is expensive (more than you thought) • Avoid

Implications to programmers • Cache coherence is expensive (more than you thought) • Avoid unnecessary sharing (e. g. , false sharing) • Avoid unnecessary coherence (e. g. , TAS -> TATAS) • Clear understanding of the performance • Crossing sockets is a killer • Can be slower than running the same program on single core! • pthread provides CPU affinity mask • pin cooperative threads on cores within the same die • Loads and stores can be as expensive as atomic operations • Programming gurus understand the hardware • So do you now! • Have fun hacking! More details in “Everything you always wanted to know about synchronization 28 but were afraid to ask”. David, et. al. , SOSP’ 13