Everything you still dont know about synchronization and
Everything you still don’t know about synchronization and how it scales Anshu Raina Suhas Pai Mohit Verma Vikas Goel Yuvraj Patel
Motivation – Why need synchronization? • Many core architecture a common phenomenon • Challenging to scale systems • Synchronization • Crucial to ensure coordination and correctness • Hindrance to scalability • Synchronization ensured using primitives • Rely on hardware artifacts– sometimes gory details of h/w not known • Hard to predict if applications will scale w. r. t a specific synchronization scheme 2/22/2021 2
What are we trying to study? • Study synchronization under scaling • • 2/22/2021 How various hardware artifacts scale? How the higher level synchronization primitives scale? Does hardware architecture impact the performance? What are the overheads that pop up while scaling? 3
How do you synchronize? • Basic hardware artifacts • CAS, TAS, FAI and other atomic instructions • Mutex, Semaphores, Spin locks, Condition Variables, Barrier • Different purpose • Structure shared by all threads using it • Use the above hardware artifacts to update the shared structure atomically 2/22/2021 4
Synchronization Primitives 101 • Basic hardware artifacts • CAS - Uses lock cmpxchg • TAS - Uses xchg • FAI - Uses lock xadd • Higher level synchronization primitives • Mutex – Used to ensure mutual exclusion, Ownership crucial – lock/unlock • Semaphore – Signaling mechanism, Ownership not important – wait/post(signal) • Spinlock – Locking mechanism, generally used for smaller critical section • Futex • Used for performance – avoid syscalls to acquire locks • Syscall done only when contention 2/22/2021 5
Experiments • Parameters • Different configurations (Intra socket, Inter socket, Hyperthreading) • Thread scaling (1, 2, 4, 8, 14, 28, 56) • 28, 56 not done for Intra socket • 56 not done for Intra socket & Inter socket • Vary Critical section (CS) size • Pseudo-code for CS: FOR (0 … LOOP_COUNT) { count : = count + 1; (count is volatile) } • Experiments done for LOOP_COUNT– 100, 10000 • Layered study • Basic Hardware Artifacts – CAS, FAI, TAS • Higher level synchronization primitives (musl library) – Mutex, Semaphore, Spinlocks 2/22/2021 6
Platform/Architecture • Intel Xeon E 5 v 3 (Haswell EP) • 2 sockets, 14 physical active cores per socket(possibly using a 18 core die) • Hyperthreaded, 2 threads per core • 8 cores, 8 L 3 slices, 1 memory controller connected to 1 bi-directional ring. Remaining are connected to another bi-directional ring • Ring topology hidden from OS in default configuration • COD splits the processors into 2 clusters, topology now has 4 NUMA nodes(but we are seeing only 2 NUMA nodes. Enabling COD also doesn’t show 4 NUMA nodes ) • Cache Coherence Mechanisms • MESIF implementation • Implemented by Caching Agents(CAs) within L 3 slice and Home Agents(HAs) with memory controller. • Modes 2/22/2021 • Source Mode(enabled by default) • Home Mode • Cluster-on-Die Mode 7
How Do Atomic Instructions Scale? • How do atomic instructions scale with varying contention? • Does placement of threads affect scaling? • Single Socket – Hyper. Threading or not • Two Sockets – Hyper. Threading or not • How do different atomic instructions vary in latency? • Locks are implemented using these atomics • Spin locks, Mutex, Semaphores use CAS • Does the coherence state of the cache line affect latencies of operations? 2/22/2021 8
Atomics– Latency trends with increasing threads 2/22/2021 9
Effect of Thread Placement on Latencies 2/22/2021 10
Effect of Thread Placement on a Single CAS 2/22/2021 11
Insights • Latencies of all instructions increas linearly with increasing contention • Threads placed on Hyper. Threaded cores provide improved performance • Effects of Hyper. Threading are more pronounced when threads are on different sockets • CAS latency can be very large if threads are placed across sockets (2 x more!) • Significant because CAS used widely for implementation of locks (spin locks, mutex) - More to be covered in subsequent slides! 2/22/2021 12
Spinlocks, Mutex and Binary Semaphores • What should I use if my critical section is small? • Does number of threads in my application has matter? • Does it thread placement matter? • What is the worst & best performance I can get? 2/22/2021 13
Binary Semaphore/Mutex Behavior as Critical Section Size Changes • Spinlocks usually used when critical section small • Binary Semaphore/ Mutex? • See for yourself! 2/22/2021 14
NHT 2 s_100 2/22/2021
NHT 2 s_100 2/22/2021
NHT 2 s_100 2/22/2021
NHT 2 s_100 2/22/2021
NHT 2 s_100 2/22/2021
NHT 2 s_10000 2/22/2021
NHT 2 s_10000 2/22/2021
NHT 2 s_10000 2/22/2021
NHT 2 s_10000 2/22/2021
NHT 2 s_10000 2/22/2021
NHT 2 s_10000 2/22/2021
NHT 2 s_10000 2/22/2021
NHT 2 s_10000 2/22/2021
NHT 2 s_1000 2/22/2021
NHT 2 s_1000 2/22/2021
NHT 2 s_1000 2/22/2021
NHT 2 s_1000 2/22/2021
NHT 2 s_1000 2/22/2021
General behavior as Critical Section size changes • We looked at what happens when 14 threads try contending at once. • When CS large, everyone calls syscall once. • How are threads woken up? • FCFS! • When CS small, no contention. • CS over before other threads even scheduled! • When CS size intermediate, some threads call syscall more than once. • Since CS is not big enough, some threads weren’t even scheduled yet, and start contending with the thread just woken up. • FCFS woken up does not imply FCFS entering CS. 2/22/2021 33
Spinlocks scaling as number of threads vary? Max Latency - Spinlocks Critical Section - 100 loop count 450 Max Iterations • Spinlocks mostly used with less CS size • How does its performance vary with number of threads? NHT 1 s_100 NHT 2 s_100 300 250 200 150 100 No. of threads 0 1 2 4 8 14 28 56 Max Latency - Spinlocks Critical Section - 1000 loop count 14000 NHT 1 s_1000 NHT 2 s_1000 HT_1000 12000 10000 8000 6000 4000 2000 No. of Threads 0 1 2/22/2021 HT_100 350 50 Max Iterations • Does not scale well, even if CS small • Actually worse than mutex, and binary semaphore • Why? • Back off in mutex and semaphore • More later! 400 2 4 8 14 28 56 34
How does mutex/semaphore scale with number of threads? 2/22/2021 35
How does mutex/semaphore scale with number of threads? 2/22/2021 36
Why don’t they scale well? What’s going on inside? Mutex 1. Try CAS to get the lock 2. Fail? Try CAS again to get the lock 3. Fail? Spin for some time if there are other waiters too! 4. Try CAS on the lock again 5. Fail? 1. Register yourself as a waiter 2. Syscall to Futex 2/22/2021 37
Why don’t they scale well? What’s going on inside? Semaphore 1. Check the semaphore to see if you can enter the critical section 2. Fail? Try CAS in a loop until you successfully update semaphore 3. Fail? Spin for some time if there are other waiters too! 4. Try CAS to update the semaphore again. 5. Fail? • Register yourself as waiter. • CAS to update the semaphore • Syscall to futex 2/22/2021 38
Why don’t they scale well? What’s going on inside? Mutex lock overhead breakup for 14 threads (Intra. Socket No HT) Mutex lock overhead breakup for 14 threads (HT Intrasocket) 2500 1800 1600 2000 1400 1500 Latency (in ns) 1200 1000 800 1000 600 400 500 200 0 1 2 1 st_CAS 2/22/2021 3 4 2 nd_CAS 5 6 7 8 Thread ID trylock <- 3 rd CAS spin_time 9 10 11 12 while loop trying trylock 13 14 0 1 2 3 1 st_CAS 4 2 nd_CAS 5 6 7 8 Thread ID trylock <- 3 rd CAS spin_time 9 10 11 12 13 while loop trying trylock 39 14
Why don’t they scale well? What’s going on inside? Mutex lock overhead breakup for 14 threads (Inter-socket No HT) 4500 4000 3500 Latency (in ns) 3000 2500 2000 1500 1000 500 0 1 2 3 4 1 st_CAS 2/22/2021 5 2 nd_CAS 6 7 Thread ID trylock <- 3 rd CAS 8 spin_time 9 10 11 12 13 14 while loop trying trylock 40
How does the behavior change as thread placement varies? (For 14 threads) 1 st_CAS 2 nd_CAS 3 rd_CAS while loop try_lock Syscall Spin didn’t_complete No_of_syscalls NHT 1 s_1000 21. 43 0 0 0 78. 57 1 7 -1/4 -2 NHT 2 s_1000 14. 29 0 0 7. 14 78. 57 0 4 -1/6 -2/1 -3 HT_1000 21. 43 0 0 14. 29 64. 29 3 6 -1/3 -2 NHT 1 s_100 85. 71 0 0 14. 29 0 2 0 NHT 2 s_100 42. 86 0 14. 29 42. 86 0 2 0 HT_100 78. 57 7. 14 0 14. 29 0 0 0 Mutex: % of threads completing at each stage Semaphore has the same behavior • It might make sense to not spin in mutex/semaphore when critical section large • Most threads block during syscall even after spinning 2/22/2021 41
How does the behavior change as thread placement varies? Config CS size Max spin count NHT 1 s 1000 3719 NHT 2 s 1000 4275 HT 1 s 1000 3242 NHT 1 s 100 50 NHT 2 s 100 151 HT 100 28 Spinlock: spin count variation with thread placement 2/22/2021 42
Variation of max and min overheads as number of threads vary Semaphore wait latency for Critical section size 1000 2/22/2021 43
Variation of max and min overheads as number of threads vary Semaphore wait latency for Critical section size 1000 2/22/2021 44
Variation of max and min overhead as number of threads vary • For 14 threads, semaphore wait latency across sockets is worse • For smaller number of threads, inter-socket wait latency can be smaller • Timeline shows threads across sockets are scheduled late, resulting in lesser contention • If threads on hyperthreaded core, wait latency can be smaller • Compared to wait latency of threads on non-hyperthreaded cores • The behavior of mutex is similar to that of semaphore 2/22/2021 45
How do mutexes, spinlocks and binary sems compare with each other? • Mutexes and binary semaphores have similar latency Critical Section – 100 LOOPCOUNT 2/22/2021 46
How do mutexes, spinlocks and binary sems compare with each other? • Do not use spin locks if there are lots of threads for small CS Critical Section – 100 LOOPCOUNT 2/22/2021 47
What about semaphore post & mutex unlock? • Post/unlock’s latency increases linearly with scale 2/22/2021 48
Other Observations • Locks are given to threads in a cluster format • Inter socket experiments, the locks are acquired by threads belonging to the same socket (3 -4 threads in one go) 2/22/2021 49
Conclusion • Synchronization is hard • Basic hardware artifacts are closely tied with the software synchronization primitives • Inter-socket performance is usually worse than same socket and hyper-threading • To get the best performance from software, you should know everything about the architecture • But if you have a Haswell – EP machine, use our slides 50
Backup? 51
Variation of max and min overheads as thread placement varies • Inter-socket worst sem wait latency is worse 2/22/2021 52
How does the behavior change as thread placement varies? Mutex 1 st CAS 2 nd CAS 3 rd CAS while trylock Syscall Spin didn’t complete Num syscalls 21. 43 0 0 0 78. 57 1 7 -1/4 -2 14. 29 0 0 7. 14 78. 57 0 4 -1/6 -2/1 -3 HT_1000 21. 43 0 0 14. 29 64. 29 3 6 -1/3 -2 NHT 1 s_100 85. 71 0 0 14. 29 0 2 0 NHT 2 s_100 42. 86 0 14. 29 42. 86 0 2 0 HT_100 78. 57 7. 14 0 14. 29 0 0 0 NHT 1 s_1000 NHT 2 s_1000 Semaphore 1 st CAS TRY Spin didn’t complete Num syscalls Spinlock Max spin count NHT 1 s_1000 21. 42857 78. 57143 1 8 -1/3 -2 NHT 1 s_1000 3719 NHT 2 s_1000 14. 28571 85. 71429 0 9 -1/3 -2 NHT 2 s_1000 4275 HT_1000 14. 28571 85. 71429 0 9 -1/2 -2 HT_1000 3242 NHT 1 s_100 85. 71429 14. 28571 0 0 NHT 1 s_100 50 NHT 2 s_100 42. 85714 57. 14286 0 0 NHT 2 s_100 151 HT_100 85. 71429 14. 28571 0 0 HT_100 28 Does it makes sense to spin in mutex/semaphore? 2/22/2021 53
- Slides: 53