A Methodology for Creating Fast WaitFree Data Structures

Concurrency & (Non-blocking) synchronization �Concurrent data-structures require (fast and scalable) synchronization Non-blocking synchronization: �No

Lock-free (LF) algorithms Among all threads trying to apply operations on the data structure,

Wait-free (WF) algorithms �A thread completes its operation a bounded #steps � regardless �Particularly

The overhead of wait-freedom �Much of the overhead is because of helping � key

Why is helping slow? �A thread helps others immediately when it starts its operation

Reducing the overhead of helping Main observation: �“Bad” cases happen, but are very rare

Fast-path-slow-path methodology �Start operation by running its (customized) lockfree implementation Fast path �Upon several

Fast-path-slow-path generic scheme Start Do I need to help ? yes no Different threads

Fast-path-slow-path: queue example Fast path (MS-queue) Slow path (KP-queue) 10

Fast-path-slow-path: queue example Internal structures Thread ID 0 1 2 phase 9 4 9

Fast-path-slow-path: queue example Internal structures Thread ID 0 1 2 cur. Tid 1 0

Fast-path-slow-path: queue example Internal structures 0 1 cur. Tid 1 0 last. Phase 4

Fast-path-slow-path: queue example Fast path 1. help_if_needed() 2. int trials = 0 while (trials++

Fast-path-slow-path: queue example Slow path 1. my phase ++ 2. announce my operation (in

Performance evaluation � 32 -core � 8 Ubuntu server with Open. JDK 1. 6

Performance evaluation 140 MS-queue KP-queue 120 time (sec) 100 80 60 40 20 0

Performance evaluation 140 MAX_FAILURES MS-queue HELPING_DELAY KP-queue fast WF (0, 0) fast WF (100,

Performance evaluation 140 MS-queue KP-queue fast WF (0, 0) fast WF (3, 3) fast

The impact of configuration parameters 140 fast WF (0, 0) 120 time (sec) 100

The use of the slow path enqueue % ops on slow path 100 dequeue

Tuning performance parameters �Why not just always use large values for both parameters (MAX_FAILURES,

Summary �A novel methodology for creating fast wait-free data structures � key ideas: two

Slides: 29

Download presentation

A Methodology for Creating Fast Wait-Free Data Structures Alex Koganand Erez Petrank Computer Science Technion, Israel

Concurrency & (Non-blocking) synchronization �Concurrent data-structures require (fast and scalable) synchronization Non-blocking synchronization: �No thread is blocked in waiting for another thread to complete � no 2 locks / critical sections

Lock-free (LF) algorithms Among all threads trying to apply operations on the data structure, one will succeed �Opportunistic approach � read some part of the data structure � make an attempt to apply an operation � when failed, retry �Many scalable and efficient algorithms üGlobal progress All but one threads may starve 3

Wait-free (WF) algorithms �A thread completes its operation a bounded #steps � regardless �Particularly � e. g. , of what other threads are doing important property in several domains real-time systems and operating systems �Commonly regarded as too inefficient and complicated to design 4

The overhead of wait-freedom �Much of the overhead is because of helping � key mechanism employed by most WF algorithms � controls the way threads help each other with their operations Can we eliminate the overhead? �The goal: average-case efficiency of lockfreedom and worst-case bound of wait-freedom 5

Why is helping slow? �A thread helps others immediately when it starts its operation �All threads help others in exactly the same order contention redundant work �Each operation has to be applied exactly once � usually results in a higher # expensive atomic operations Lock-free MSqueue (PODC, 1996) Wait-free KP-queue (PPOPP, 2011) # CASs in enqueue 2 3 # CASs in dequeue 1 4 6

Reducing the overhead of helping Main observation: �“Bad” cases happen, but are very rare Typically a thread can complete without any help � if only it had a chance to do that … Main ideas: �Ask for help only when you really need it � i. e. , after trying several times to apply the operation �Help others only after giving them a chance to proceed on their own � delayed 7 helping

Fast-path-slow-path methodology �Start operation by running its (customized) lockfree implementation Fast path �Upon several failures, switch into a (customized) wait-free implementation � notify others that you need help � keep trying �Once Slow path in a while, threads on the fast path check if their help is needed and provide help Delayed helping 8

Fast-path-slow-path generic scheme Start Do I need to help ? yes no Different threads may run on two paths concurrently! 9 Help Someone Apply my op using fast path (at most N times) Succes s? no Apply my op using slow path (until success) yes Return

Fast-path-slow-path: queue example Fast path (MS-queue) Slow path (KP-queue) 10

Fast-path-slow-path: queue example Internal structures Thread ID 0 1 2 phase 9 4 9 pending enqueu e node true false null state 11

Fast-path-slow-path: queue example Internal structures Thread ID 0 1 2 phase 9 4 9 pending enqueu e node true false null state 12 Counts # ops on the slow path

Fast-path-slow-path: queue example Internal structures Thread ID 0 1 2 phase 9 4 9 pending enqueu e node true false null state 13 Is there a pending operation on the slow path?

Fast-path-slow-path: queue example Internal structures Thread ID 0 1 2 phase 9 4 9 pending enqueu e node true false null state 14 What is the pending operation?

Fast-path-slow-path: queue example Internal structures Thread ID 0 1 2 cur. Tid 1 0 0 last. Phase 4 5 9 next. Check 3 8 0 help. Records 15

Fast-path-slow-path: queue example Internal structures Thread ID 0 1 2 cur. Tid 1 0 0 last. Phase 4 5 9 next. Check 3 8 0 help. Records 16 ID of the next thread that I will try to help

Fast-path-slow-path: queue example Internal structures Thread ID 0 1 2 cur. Tid 1 0 0 last. Phase 4 5 9 next. Check 3 8 0 help. Records 17 Phase # of that thread at the time the record was created

Fast-path-slow-path: queue example Internal structures 0 1 cur. Tid 1 0 last. Phase 4 5 controls the 0 frequency of helping checks 9 next. Check 3 8 0 Thread ID 2 HELPING_DELAY help. Records 18 Decrements with every my operation. Check if my help is needed when this counter reaches 0

Fast-path-slow-path: queue example Fast path 1. help_if_needed() 2. int trials = 0 while (trials++ < MAX_FAILURES) MAX_FAILURES { controls the number apply_op_with_customized_LF_alg of trials on the fast (finish if succeeded) path } 3. switch to slow path �LF algorithm customization is required to synchronize operations run on two paths 19

Fast-path-slow-path: queue example Slow path 1. my phase ++ 2. announce my operation (in state) 3. apply_op_with_customized_WF_alg (until finished) �WF algorithm customization is required to synchronize operations run on two paths 20

Performance evaluation � 32 -core � 8 Ubuntu server with Open. JDK 1. 6 2. 3 GHz quadcore AMD 8356 processors �The queue is initially empty �Each thread iteratively performs (100 k times): � Enqueue-Dequeue benchmark: enqueue and then dequeue �Measure threads completion time as a function of #

Performance evaluation 140 MS-queue KP-queue 120 time (sec) 100 80 60 40 20 0 1 22 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 number of threads

Performance evaluation 140 MAX_FAILURES MS-queue HELPING_DELAY KP-queue fast WF (0, 0) fast WF (100, 100) 120 time (sec) 100 80 60 40 20 0 1 23 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 number of threads

Performance evaluation 140 MS-queue KP-queue fast WF (0, 0) fast WF (3, 3) fast WF (10, 10) fast WF (20, 20) fast WF (100, 100) 120 time (sec) 100 80 60 40 20 0 1 24 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 number of threads

The impact of configuration parameters 140 fast WF (0, 0) 120 time (sec) 100 MAX_FAILURES HELPING_DELAY fast WF (10, 10) 80 60 40 20 0 1 25 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 number of threads

The use of the slow path enqueue % ops on slow path 100 dequeue 80 HELPING_DELAY MAX_FAILURES 60 40 20 0 1 11 21 31 41 51 number of threads 26 61 1 11 21 31 41 51 number of threads 61

Tuning performance parameters �Why not just always use large values for both parameters (MAX_FAILURES, HELPING_DELAY)? � (almost) always eliminate slow path �Lemma: The number of steps required for a thread to complete an operation on the queue in the worst-case is O(MAX_FAILURES + HELPING_DELAY * n 2) →Tradeoff between average-case performance and worst-case completion time bound 27

Summary �A novel methodology for creating fast wait-free data structures � key ideas: two execution paths + delayed helping � good performance when the fast path is extensively utilized � concurrent operations can proceed on both paths in parallel �Can be used in other scenarios � e. g. , running real-time and non-real-time threads sideby-side 28

Thank you! Questions? 29