Kendo Efficient Deterministic Multithreading in Software Marek Olszewski

  • Slides: 64
Download presentation
Kendo: Efficient Deterministic Multithreading in Software Marek Olszewski Jason Ansel Saman Amarasinghe Commit Group

Kendo: Efficient Deterministic Multithreading in Software Marek Olszewski Jason Ansel Saman Amarasinghe Commit Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Example � Simple Open. MP Parallel Code: double inv_sum = 0. 0; #pragma omp

Example � Simple Open. MP Parallel Code: double inv_sum = 0. 0; #pragma omp parallel for reduction(+: sum) for (int i = 1; i < 10000000; i++) inv_sum += 1. 0 / i; printf(“inv_sum: %. 64 gn”, inv_sum);

Example � Simple Open. MP Parallel Code: double inv_sum = 0. 0; #pragma omp

Example � Simple Open. MP Parallel Code: double inv_sum = 0. 0; #pragma omp parallel for reduction(+: sum) for (int i = 1; i < 10000000; i++) inv_sum += 1. 0 / i; printf(“inv_sum: %. 64 gn”, inv_sum);

Example � Simple Open. MP Parallel Code: double inv_sum = 0. 0; Threads Critical

Example � Simple Open. MP Parallel Code: double inv_sum = 0. 0; Threads Critical section #pragma omp parallel for reduction(+: sum) for (int i = 1; i < 10000000; i++) inv_sum += 1. 0 / i; printf(“inv_sum: %. 64 gn”, inv_sum); Run 1: inv_sum: 16. 69531126586006308798459940589964389801025390625 Run 2: inv_sum: 16. 695311265860066640698278206400573253631591796875 Re n io cdtuc u tio d n Re

Another Example Threads Global State Critical section data Non-commutative updates data Global data structure

Another Example Threads Global State Critical section data Non-commutative updates data Global data structure Lock Locks Threads perform repeated well-synchronized updates to global state � Common parallel programming paradigm: ◦ Radiosity (Singh et al. 1994) ◦ Locus. Route (Rose 1988) ◦ Delaunay Triangulation (Kulkarni et al. 2008)

Another Example Threads Global State Critical section data Non-commutative updates data Global data structure

Another Example Threads Global State Critical section data Non-commutative updates data Global data structure Lock Locks � Non-deterministic � Difficult internal states and output to eliminate using today’s programming idioms

Non-Determinism � Hard to create programs with repeatable results ◦ Determinism is often part

Non-Determinism � Hard to create programs with repeatable results ◦ Determinism is often part of program specifications, eg: �Don’t want a verilog compiler to generate different circuits every time �Multi-threaded replicas in fault-tolerant systems must be deterministic ? ? ?

Non-Determinism � Hard to create programs with repeatable results ◦ Determinism is often part

Non-Determinism � Hard to create programs with repeatable results ◦ Determinism is often part of program specifications, eg: �Don’t want a verilog compiler to generate different circuits every time �Multi-threaded replicas in fault-tolerant systems must be deterministic � Debugging becomes more difficult ◦ Heisenbugs ◦ Difficult to perform cyclic debugging �Common debugging method used for sequential programs � Testing ? ? ? offers weak guarantees ◦ Will the code pass the test again? ? ? ?

Deterministic Execution Model � Non-determinism causes many problems ◦ Why do we put up

Deterministic Execution Model � Non-determinism causes many problems ◦ Why do we put up with it? � Present parallel programmer with deterministic execution model ◦ Interleave critical sections deterministically ◦ Only allow one interleaving �Find a good interleaving that preserves the parallel performance

Token Algorithm Thread 2 Thread Progress Thread 1 Threads racing to acquire Lock A

Token Algorithm Thread 2 Thread Progress Thread 1 Threads racing to acquire Lock A

Token Algorithm Thread 2 Thread Progress Thread 1 Threads racing to acquire Lock A

Token Algorithm Thread 2 Thread Progress Thread 1 Threads racing to acquire Lock A

Token Algorithm Thread 2 Thread Progress Thread 1 Threads racing to acquire Lock A

Token Algorithm Thread 2 Thread Progress Thread 1 Threads racing to acquire Lock A

Token Algorithm Thread 2 Thread Progress Thread 1

Token Algorithm Thread 2 Thread Progress Thread 1

Token Algorithm Thread 2 det_lock(A) Thread Progress Thread 1

Token Algorithm Thread 2 det_lock(A) Thread Progress Thread 1

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token

Token Algorithm Thread 1 Thread 2 det_lock(A) det_unlock(A) wait_for_token() lock(A) pass_token() Guarantees that thread

Token Algorithm Thread 1 Thread 2 det_lock(A) det_unlock(A) wait_for_token() lock(A) pass_token() Guarantees that thread 1 will always acquire lock before thread 2 Thread Progress Token

Token Algorithm Thread 1 Thread 2 Thread Progress Token � Load imbalance! � Allow

Token Algorithm Thread 1 Thread 2 Thread Progress Token � Load imbalance! � Allow threads to pass token outside of critical sections � High overhead! � Too much serialization!

What Do We Need? � Method of tracking thread progress ◦ Must be deterministic

What Do We Need? � Method of tracking thread progress ◦ Must be deterministic ◦ Must match true progress of thread in physical time as close as possible ◦ Must be cheap to compute � Ability to pass the token in advance (before it is received) ◦ Decouples threads

Logical Time Algorithm � Each thread keeps a low overhead counter called its “Logical

Logical Time Algorithm � Each thread keeps a low overhead counter called its “Logical Clock” ◦ Incremented often, and in a way that tries to match progress of thread as close as possible ◦ Clocks collectively create a notion of “logical time” �Abstract counterpart to physical time � Threads take turns holding a “virtual token” ◦ Thread’s turn when its clock is a global minimum � Thread passes the “virtual token” by incrementing its clock ◦ Does not have to wait for its turn ◦ Allows threads to execute asynchronously while outside of critical sections

Logical Time Algorithm Physical Time Thread 2 t=3 Threads racing to acquire t=3 Lock

Logical Time Algorithm Physical Time Thread 2 t=3 Threads racing to acquire t=3 Lock A Deterministic Logical Time Thread 1

Logical Time Algorithm Physical Time Thread 2 t=6 t=8 Threads racing to acquire Lock

Logical Time Algorithm Physical Time Thread 2 t=6 t=8 Threads racing to acquire Lock A Deterministic Logical Time Thread 1

Logical Time Algorithm Physical Time det_lock(A) Thread 2 t=8 t=18 Deterministic Logical Time Thread

Logical Time Algorithm Physical Time det_lock(A) Thread 2 t=8 t=18 Deterministic Logical Time Thread 1

Logical Time Algorithm Physical Time det_lock(A) wait_for_turn(); lock(A); Thread 2 t=8 t=18 Deterministic Logical

Logical Time Algorithm Physical Time det_lock(A) wait_for_turn(); lock(A); Thread 2 t=8 t=18 Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time t=16 det_lock(A) wait_for_turn(); lock(A); t=18 Deterministic Logical

Logical Time Algorithm Thread 2 Physical Time t=16 det_lock(A) wait_for_turn(); lock(A); t=18 Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time det_lock(A) wait_for_turn(); lock(A); t=18 t=20 Deterministic Logical

Logical Time Algorithm Thread 2 Physical Time det_lock(A) wait_for_turn(); lock(A); t=18 t=20 Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time t=22 t=20 Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time t=22 t=20 Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time det_lock(A) t=24 t=22 wait_for_turn(); lock(A); Deterministic Logical

Logical Time Algorithm Thread 2 Physical Time det_lock(A) t=24 t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time det_lock(A) t=24 t=22 wait_for_turn(); lock(A); Deterministic Logical

Logical Time Algorithm Thread 2 Physical Time det_lock(A) t=24 t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time det_lock(A) det_unlock(A) t=22 t=26 wait_for_turn(); lock(A); Deterministic

Logical Time Algorithm Thread 2 Physical Time det_lock(A) det_unlock(A) t=22 t=26 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time t=24 t=29 Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time t=24 t=29 Deterministic Logical Time Thread 1

Logical Time Algorithm Physical Time Thread 2 t=3 Threads racing to acquire t=3 Lock

Logical Time Algorithm Physical Time Thread 2 t=3 Threads racing to acquire t=3 Lock A Deterministic Logical Time Thread 1

Logical Time Algorithm Physical Time Thread 2 t=6 Threads racing to acquire t=8 Lock

Logical Time Algorithm Physical Time Thread 2 t=6 Threads racing to acquire t=8 Lock A Deterministic Logical Time Thread 1

Logical Time Algorithm Physical Time Thread 2 t=8 t=18 Threads racing to acquire Lock

Logical Time Algorithm Physical Time Thread 2 t=8 t=18 Threads racing to acquire Lock A Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time t=16 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical

Logical Time Algorithm Thread 2 Physical Time t=16 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time t=16 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical

Logical Time Algorithm Thread 2 Physical Time t=16 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time det_lock(A) wait_for_turn(); lock(A); t=18 det_lock(A) t=22 wait_for_turn();

Logical Time Algorithm Thread 2 Physical Time det_lock(A) wait_for_turn(); lock(A); t=18 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time det_lock(A) wait_for_turn(); lock(A); t=18 det_lock(A) t=22 wait_for_turn();

Logical Time Algorithm Thread 2 Physical Time det_lock(A) wait_for_turn(); lock(A); t=18 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time det_lock(A) wait_for_turn(); lock(A); t=18 det_lock(A) t=22 wait_for_turn();

Logical Time Algorithm Thread 2 Physical Time det_lock(A) wait_for_turn(); lock(A); t=18 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time t=20 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical

Logical Time Algorithm Thread 2 Physical Time t=20 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time t=23 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical

Logical Time Algorithm Thread 2 Physical Time t=23 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time t=23 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical

Logical Time Algorithm Thread 2 Physical Time t=23 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time det_lock(A) det_unlock(A) t=22 t=26 wait_for_turn(); lock(A); Deterministic

Logical Time Algorithm Thread 2 Physical Time det_lock(A) det_unlock(A) t=22 t=26 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1

Logical Time Algorithm Thread 2 Physical Time t=24 t=29 Guarantees that thread 1 will

Logical Time Algorithm Thread 2 Physical Time t=24 t=29 Guarantees that thread 1 will always acquire lock before thread 2 Deterministic Logical Time Thread 1

Are We Done? � Unfortunately, no. .

Are We Done? � Unfortunately, no. .

Nested Locks det_lock(A) Thread 2 t=25 det_lock(B) Lock A Lock B t=27 Deterministic Logical

Nested Locks det_lock(A) Thread 2 t=25 det_lock(B) Lock A Lock B t=27 Deterministic Logical Time Thread 1

Nested Locks Thread 2 wait_for_turn(); lock(B); det_lock(A) t=25 det_lock(B) t=27 wait_for_turn(); lock(A); Cyclic Dependencies

Nested Locks Thread 2 wait_for_turn(); lock(B); det_lock(A) t=25 det_lock(B) t=27 wait_for_turn(); lock(A); Cyclic Dependencies Deadlock! Deterministic Logical Time Thread 1

Preventing the Deadlock � Make threads spin in deterministic logical time � Must do

Preventing the Deadlock � Make threads spin in deterministic logical time � Must do so deterministically ◦ Threads wait for its turn after every increment ◦ Ensure that it performs the same number of increments on every run �Can’t wait for lock to be released in physical time �Wait for lock to be released in deterministic logical time ◦ Releasing thread stores its logical clock in the lock ◦ Spinning thread spins until its logical clock is greater than the time recorded in the lock See paper for more details

Kendo Prototype �A prototype deterministic locking framework ◦ Supports C and C++ code �

Kendo Prototype �A prototype deterministic locking framework ◦ Supports C and C++ code � Implements a subset of the pthreads API � Runs on commodity hardware today! � Uses performance counters to construct logical clocks ◦ Efficient and cheap, but �Not all track physical time �Some are non-deterministic �Not visible to other threads ◦ Processor traps every X amount of performance counter events �Increments thread’s logical clock ◦ Use “retired stores” performance counter event

Evaluation � Methodology ◦ Converted Splash 2 benchmark suite to run use the Kendo

Evaluation � Methodology ◦ Converted Splash 2 benchmark suite to run use the Kendo framework ◦ Eliminated data-races ◦ Checked determinism by examining output and the final deterministic logical clocks of each thread � Experimental Framework ◦ Processor: Intel Xeon 16 -way SMP (4 quad-cores) running at 2. 4 GHz ◦ OS: Linux 2. 6. 25 (modified for performance counter support)

Performance 1. 75 Execution Time (Relative to Non-Deterministic) 1. 50 1. 25 1. 00

Performance 1. 75 Execution Time (Relative to Non-Deterministic) 1. 50 1. 25 1. 00 0. 75 0. 50 0. 25 0. 00 quicksort tsp ocean barnes radiosity raytrace Benchmark 4 Processors fmm volrend water-nsqrd. Geomean

Performance 2. 00 Execution Time (Relative to Non-Deterministic) 1. 75 1. 50 1. 25

Performance 2. 00 Execution Time (Relative to Non-Deterministic) 1. 75 1. 50 1. 25 1. 00 0. 75 0. 50 0. 25 0. 00 quicksort tsp ocean barnes radiosity raytrace Benchmark 4 Processors 8 Processors fmm volrend water-nsqrd. Geomean

Performance 2. 75 Execution Time (Relative to Non-Deterministic) 2. 50 2. 25 2. 00

Performance 2. 75 Execution Time (Relative to Non-Deterministic) 2. 50 2. 25 2. 00 1. 75 1. 50 1. 25 1. 00 0. 75 0. 50 0. 25 0. 00 quicksort tsp ocean barnes radiosity raytrace fmm Benchmark 4 Processors 8 Processors 16 Processors volrend water-nsqrd. Geomean

Overhead Breakdown 4 Processors Execution Time (Relative to Non-Deterministic) 1. 8 1. 6 1.

Overhead Breakdown 4 Processors Execution Time (Relative to Non-Deterministic) 1. 8 1. 6 1. 4 1. 2 1. 0 0. 8 0. 6 0. 4 0. 2 0. 0 quicksort tsp ocean barnes radiosity raytrace fmm volrend Benchmarks Wait Overhead Interrupt Overhead Application Time water-nsqrd Geomean

Effect of Interrupt Frequency 5. 0 Execution Time (Relative to Non-Deterministic) 4. 5 4.

Effect of Interrupt Frequency 5. 0 Execution Time (Relative to Non-Deterministic) 4. 5 4. 0 3. 5 3. 0 2. 5 2. 0 1. 5 1. 0 0. 5 0. 0 64 128 256 512 1 K 2 K 4 K 8 K 16 K Interrupt Period Application Time Interrupt Overhead Deterministic Wait Overhead

Deterministic Multithreading Taxonomy � Weak Determinism ◦ Deterministic interleaving of all lock acquisitions for

Deterministic Multithreading Taxonomy � Weak Determinism ◦ Deterministic interleaving of all lock acquisitions for a given input ◦ Provided by Kendo ◦ Cheap to enforce � Strong Determinism ◦ Deterministic interleaving for all accesses to memory for a given input ◦ Implemented in DMP work by Devietti et. al ◦ Attractive, but difficult to achieve efficiently in software

Weak Determinism Can Be Sufficient � Offers same guarantees as strong determinism for data-race-

Weak Determinism Can Be Sufficient � Offers same guarantees as strong determinism for data-race- free program executions ◦ Checkable with a specialized dynamic race detector! � Provides a systematic way of debugging code ◦ Non-race bugs are always reproducible ◦ First data race detectible via race detector � Like Cilk + “Nondeterminator” race detector combination ◦ But for arbitrary multithreaded code and arbitrary non-commutative critical sections.

Conclusion � An efficient software approach for deterministic multithreading � Kendo: A prototype implementing

Conclusion � An efficient software approach for deterministic multithreading � Kendo: A prototype implementing this approach ◦ Simple yet efficient! ◦ Runs on today’s commodity hardware � Introduced Strong/Weak Determinism ◦ Weak determinism provides a systematic method of debugging multithreaded code ? ? ? ? http: //groups. csail. mit. edu/commit