Kendo Efficient Deterministic Multithreading in Software Marek Olszewski
- Slides: 64
Kendo: Efficient Deterministic Multithreading in Software Marek Olszewski Jason Ansel Saman Amarasinghe Commit Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
Example � Simple Open. MP Parallel Code: double inv_sum = 0. 0; #pragma omp parallel for reduction(+: sum) for (int i = 1; i < 10000000; i++) inv_sum += 1. 0 / i; printf(“inv_sum: %. 64 gn”, inv_sum);
Example � Simple Open. MP Parallel Code: double inv_sum = 0. 0; #pragma omp parallel for reduction(+: sum) for (int i = 1; i < 10000000; i++) inv_sum += 1. 0 / i; printf(“inv_sum: %. 64 gn”, inv_sum);
Example � Simple Open. MP Parallel Code: double inv_sum = 0. 0; Threads Critical section #pragma omp parallel for reduction(+: sum) for (int i = 1; i < 10000000; i++) inv_sum += 1. 0 / i; printf(“inv_sum: %. 64 gn”, inv_sum); Run 1: inv_sum: 16. 69531126586006308798459940589964389801025390625 Run 2: inv_sum: 16. 695311265860066640698278206400573253631591796875 Re n io cdtuc u tio d n Re
Another Example Threads Global State Critical section data Non-commutative updates data Global data structure Lock Locks Threads perform repeated well-synchronized updates to global state � Common parallel programming paradigm: ◦ Radiosity (Singh et al. 1994) ◦ Locus. Route (Rose 1988) ◦ Delaunay Triangulation (Kulkarni et al. 2008)
Another Example Threads Global State Critical section data Non-commutative updates data Global data structure Lock Locks � Non-deterministic � Difficult internal states and output to eliminate using today’s programming idioms
Non-Determinism � Hard to create programs with repeatable results ◦ Determinism is often part of program specifications, eg: �Don’t want a verilog compiler to generate different circuits every time �Multi-threaded replicas in fault-tolerant systems must be deterministic ? ? ?
Non-Determinism � Hard to create programs with repeatable results ◦ Determinism is often part of program specifications, eg: �Don’t want a verilog compiler to generate different circuits every time �Multi-threaded replicas in fault-tolerant systems must be deterministic � Debugging becomes more difficult ◦ Heisenbugs ◦ Difficult to perform cyclic debugging �Common debugging method used for sequential programs � Testing ? ? ? offers weak guarantees ◦ Will the code pass the test again? ? ? ?
Deterministic Execution Model � Non-determinism causes many problems ◦ Why do we put up with it? � Present parallel programmer with deterministic execution model ◦ Interleave critical sections deterministically ◦ Only allow one interleaving �Find a good interleaving that preserves the parallel performance
Token Algorithm Thread 2 Thread Progress Thread 1 Threads racing to acquire Lock A
Token Algorithm Thread 2 Thread Progress Thread 1 Threads racing to acquire Lock A
Token Algorithm Thread 2 Thread Progress Thread 1 Threads racing to acquire Lock A
Token Algorithm Thread 2 Thread Progress Thread 1
Token Algorithm Thread 2 det_lock(A) Thread Progress Thread 1
Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token
Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token
Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token
Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token
Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token
Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token
Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token
Token Algorithm Thread 1 Thread 2 det_lock(A) wait_for_token() lock(A) pass_token() Thread Progress Token
Token Algorithm Thread 1 Thread 2 det_lock(A) det_unlock(A) wait_for_token() lock(A) pass_token() Guarantees that thread 1 will always acquire lock before thread 2 Thread Progress Token
Token Algorithm Thread 1 Thread 2 Thread Progress Token � Load imbalance! � Allow threads to pass token outside of critical sections � High overhead! � Too much serialization!
What Do We Need? � Method of tracking thread progress ◦ Must be deterministic ◦ Must match true progress of thread in physical time as close as possible ◦ Must be cheap to compute � Ability to pass the token in advance (before it is received) ◦ Decouples threads
Logical Time Algorithm � Each thread keeps a low overhead counter called its “Logical Clock” ◦ Incremented often, and in a way that tries to match progress of thread as close as possible ◦ Clocks collectively create a notion of “logical time” �Abstract counterpart to physical time � Threads take turns holding a “virtual token” ◦ Thread’s turn when its clock is a global minimum � Thread passes the “virtual token” by incrementing its clock ◦ Does not have to wait for its turn ◦ Allows threads to execute asynchronously while outside of critical sections
Logical Time Algorithm Physical Time Thread 2 t=3 Threads racing to acquire t=3 Lock A Deterministic Logical Time Thread 1
Logical Time Algorithm Physical Time Thread 2 t=6 t=8 Threads racing to acquire Lock A Deterministic Logical Time Thread 1
Logical Time Algorithm Physical Time det_lock(A) Thread 2 t=8 t=18 Deterministic Logical Time Thread 1
Logical Time Algorithm Physical Time det_lock(A) wait_for_turn(); lock(A); Thread 2 t=8 t=18 Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time t=16 det_lock(A) wait_for_turn(); lock(A); t=18 Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time det_lock(A) wait_for_turn(); lock(A); t=18 t=20 Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time t=22 t=20 Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time det_lock(A) t=24 t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time det_lock(A) t=24 t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time det_lock(A) det_unlock(A) t=22 t=26 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time t=24 t=29 Deterministic Logical Time Thread 1
Logical Time Algorithm Physical Time Thread 2 t=3 Threads racing to acquire t=3 Lock A Deterministic Logical Time Thread 1
Logical Time Algorithm Physical Time Thread 2 t=6 Threads racing to acquire t=8 Lock A Deterministic Logical Time Thread 1
Logical Time Algorithm Physical Time Thread 2 t=8 t=18 Threads racing to acquire Lock A Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time t=16 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time t=16 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time det_lock(A) wait_for_turn(); lock(A); t=18 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time det_lock(A) wait_for_turn(); lock(A); t=18 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time det_lock(A) wait_for_turn(); lock(A); t=18 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time t=20 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time t=23 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time t=23 det_lock(A) t=22 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time det_lock(A) det_unlock(A) t=22 t=26 wait_for_turn(); lock(A); Deterministic Logical Time Thread 1
Logical Time Algorithm Thread 2 Physical Time t=24 t=29 Guarantees that thread 1 will always acquire lock before thread 2 Deterministic Logical Time Thread 1
Are We Done? � Unfortunately, no. .
Nested Locks det_lock(A) Thread 2 t=25 det_lock(B) Lock A Lock B t=27 Deterministic Logical Time Thread 1
Nested Locks Thread 2 wait_for_turn(); lock(B); det_lock(A) t=25 det_lock(B) t=27 wait_for_turn(); lock(A); Cyclic Dependencies Deadlock! Deterministic Logical Time Thread 1
Preventing the Deadlock � Make threads spin in deterministic logical time � Must do so deterministically ◦ Threads wait for its turn after every increment ◦ Ensure that it performs the same number of increments on every run �Can’t wait for lock to be released in physical time �Wait for lock to be released in deterministic logical time ◦ Releasing thread stores its logical clock in the lock ◦ Spinning thread spins until its logical clock is greater than the time recorded in the lock See paper for more details
Kendo Prototype �A prototype deterministic locking framework ◦ Supports C and C++ code � Implements a subset of the pthreads API � Runs on commodity hardware today! � Uses performance counters to construct logical clocks ◦ Efficient and cheap, but �Not all track physical time �Some are non-deterministic �Not visible to other threads ◦ Processor traps every X amount of performance counter events �Increments thread’s logical clock ◦ Use “retired stores” performance counter event
Evaluation � Methodology ◦ Converted Splash 2 benchmark suite to run use the Kendo framework ◦ Eliminated data-races ◦ Checked determinism by examining output and the final deterministic logical clocks of each thread � Experimental Framework ◦ Processor: Intel Xeon 16 -way SMP (4 quad-cores) running at 2. 4 GHz ◦ OS: Linux 2. 6. 25 (modified for performance counter support)
Performance 1. 75 Execution Time (Relative to Non-Deterministic) 1. 50 1. 25 1. 00 0. 75 0. 50 0. 25 0. 00 quicksort tsp ocean barnes radiosity raytrace Benchmark 4 Processors fmm volrend water-nsqrd. Geomean
Performance 2. 00 Execution Time (Relative to Non-Deterministic) 1. 75 1. 50 1. 25 1. 00 0. 75 0. 50 0. 25 0. 00 quicksort tsp ocean barnes radiosity raytrace Benchmark 4 Processors 8 Processors fmm volrend water-nsqrd. Geomean
Performance 2. 75 Execution Time (Relative to Non-Deterministic) 2. 50 2. 25 2. 00 1. 75 1. 50 1. 25 1. 00 0. 75 0. 50 0. 25 0. 00 quicksort tsp ocean barnes radiosity raytrace fmm Benchmark 4 Processors 8 Processors 16 Processors volrend water-nsqrd. Geomean
Overhead Breakdown 4 Processors Execution Time (Relative to Non-Deterministic) 1. 8 1. 6 1. 4 1. 2 1. 0 0. 8 0. 6 0. 4 0. 2 0. 0 quicksort tsp ocean barnes radiosity raytrace fmm volrend Benchmarks Wait Overhead Interrupt Overhead Application Time water-nsqrd Geomean
Effect of Interrupt Frequency 5. 0 Execution Time (Relative to Non-Deterministic) 4. 5 4. 0 3. 5 3. 0 2. 5 2. 0 1. 5 1. 0 0. 5 0. 0 64 128 256 512 1 K 2 K 4 K 8 K 16 K Interrupt Period Application Time Interrupt Overhead Deterministic Wait Overhead
Deterministic Multithreading Taxonomy � Weak Determinism ◦ Deterministic interleaving of all lock acquisitions for a given input ◦ Provided by Kendo ◦ Cheap to enforce � Strong Determinism ◦ Deterministic interleaving for all accesses to memory for a given input ◦ Implemented in DMP work by Devietti et. al ◦ Attractive, but difficult to achieve efficiently in software
Weak Determinism Can Be Sufficient � Offers same guarantees as strong determinism for data-race- free program executions ◦ Checkable with a specialized dynamic race detector! � Provides a systematic way of debugging code ◦ Non-race bugs are always reproducible ◦ First data race detectible via race detector � Like Cilk + “Nondeterminator” race detector combination ◦ But for arbitrary multithreaded code and arbitrary non-commutative critical sections.
Conclusion � An efficient software approach for deterministic multithreading � Kendo: A prototype implementing this approach ◦ Simple yet efficient! ◦ Runs on today’s commodity hardware � Introduced Strong/Weak Determinism ◦ Weak determinism provides a systematic method of debugging multithreaded code ? ? ? ? http: //groups. csail. mit. edu/commit
- Marek olszewski
- Fedrigoni paper suppliers
- Quid scrabble
- Kendo kata 1-10
- Kendo kata 1-10
- Productively efficient vs allocatively efficient
- Allocative efficiency
- C b a d
- Productively efficient vs allocatively efficient
- Productive inefficiency and allocative inefficiency
- What is simultaneous multithreading
- Thread dalam java
- Multitasking in java
- Multithreading models
- Fine grained multithreading
- Fine grained multithreading
- Coarse-grained multithreading
- Fine grained multithreading
- Fine grained multithreading
- Multithreading models in os
- Multithreading patterns
- Multithreading adalah
- Ue4 concurrency
- Multitasking vs multithreading in java
- What is hardware multithreading
- Spanning tree of a graph
- Non deterministic algorithm for sorting
- Non-deterministic algorithm
- Deterministic demand vs stochastic demand
- Vicarious reinforcement
- Non-deterministic algorithm
- Nfa non deterministic finite automata
- Convert nfa to dfa
- Statistical vs deterministic relationship
- Deterministic games examples
- Deterministic seismic hazard analysis
- Dpda vs pda
- Non-deterministic algorithm
- Deterministic operations research
- Seventuple
- Agent a chapter 2
- Deterministic definition
- Deterministic and probabilistic inventory models
- Statistical versus deterministic relationship
- Deterministic lockstep
- Deterministic game
- Deterministic cross device tracking
- Dfa stands for in automata
- Non deterministic algorithm for sorting
- Deterministic attribution
- Statistical vs deterministic relationship
- Deterministic finite automaton
- A deterministic turing machine is: *
- Deterministic
- State yang merupakan non-deterministic ditandai dengan
- Deterministic finite state automata
- Deterministic and stochastic inventory models
- Ing marek pavlik
- Marek pavlik prednasky
- Ing marek pavlik phd
- Jan marek memorial
- Marek jamro
- Marek fapšo
- Marek darecki
- Słowosieć