What is the Cost of Determinism Cedomir Segulja

Non-Determinism • Same program + same input ≠ same output • This is bad

Determinism • Is good, but costly What is the fundamental cost of determinism? 2

What is Determinism? • Property that requires observing the same output whenever program runs

What is the impact of enforcing a fixed synchronization order on program execution time?

Schedule-Record-Replay Framework 1 2 application schedule thread 1 thread 2 scheduler replayer serial hybrid

Replayer 1. 1 1. 05 1 Non-deterministic execution vs. Non-deterministic execution with the replayer’s

Schedules • When does a thread pass its turn? • At the end –

parsec splash Benchmarks serial round-robin dynamic-S dynamic-A hybrid 1. 10 3. 39 4. 36

For this set of benchmarks and our platform, and implementation overhead set aside, the

What is the performance cost of insisting on the same schedule across different environments?

Schedule-Record-Perturb-Replay Framework 1 2 application schedule thread 1 thread 2 scheduler replayer serial hybrid

Perturber • Small perturbations (context switches, thread migrations, page faults) • Simulate first order

Benchmarks Small perturbations Backgroud proc. balanced unbalanced DVFS auto manual NUMA Asym. Arch. 4/4

Insisting on the same schedule in the presence of skewed conditions can slow down

Conclusions • Employed the schedule-record-replay framework to divorce implementation overhead from the fundamental cost

Slides: 20

Download presentation

What is the Cost of Determinism? Cedomir Segulja, Tarek S. Abdelrahman University of Toronto

Source: [Youtube] Source: [Intel]

Non-Determinism • Same program + same input ≠ same output • This is bad for … • Testing • Too many interleaving to test • Debugging • Hard to debug when behavior is not repeatable • Selling • CAD tools users expect each run to produce the same circuit

Determinism • Is good, but costly What is the fundamental cost of determinism? 2 • What is this cost across various execution environments? 1 • • “Determinism in the field” Deterministic Schedulers Maximum Slowdown DMP [Devietti et al. 2009] 1. 7 x Kendo [Olszewski et al. 2009] 1. 6 x Grace [Berger et al. 2009] 3. 6 x Core. Det [Bergan et al. 2010] 10 x Calvin [Hower et al. 2011] 1. 7 x RCDC [Devietti et al. 2011] 1. 7 x Dthreads [Liu et al. 2011] 4 x Conversion [Merriﬁeld and Eriksson 2013] 5 x Parrot [Cui et al. 2013] 3. 8 x RFDet [Lu et al. 2014] 2. 6 x Source: [Bergan et al. 2011] and the respective papers *Only to show that determinism comes at a cost, and not to be used for a direct comparison (different features, benchmarks, # threads, etc. )

What is Determinism? • Property that requires observing the same output whenever program runs with the same input • Sync. Order determinism [Lu and Scott 11] External Sync. Order • Require the same program result and same order of synchronization • More flexible than internal determinism Internal • Still greatly eases testing [Cui et al. 13] • We assume data-race-freedom • Determinism during debugging is needed • But the cost of determinism matters the most in production • All data races are bugs [Boehm 2008, S. Adve 2010, Marino et al. 2010, Lucia et al. 2010, …] • Data races in general do not help performance [Boehm 12]

What is the impact of enforcing a fixed synchronization order on program execution time?

Schedule-Record-Replay Framework 1 2 application schedule thread 1 thread 2 scheduler replayer serial hybrid round-robin perturber architectures dynamic-A idle dynamic-S recorder NUMA DVFS small perturbations background processes

Replayer 1. 1 1. 05 1 Non-deterministic execution vs. Non-deterministic execution with the replayer’s overhead vips swaptions streamcluster raytrace_parsec fluidanimate ferret facesim dedup bodytrack blackscholes water_spatial water_nsquared volrend raytrace_splash 2 x radiosity ocean_ncp ocean_cp lu_ncb lu_cb fmm fft 0. 9 cholesky 0. 95 barnes Normalized Execution Time • Force threads to wait only when absolutely necessary under the schedule • And do so with as little overhead as possible

Schedules • When does a thread pass its turn? • At the end – serial • After each synchronization operation – round-robin • After each instruction/store – dynamic -A/dynamic-S • After N instructions – hybrid • N = 100, 000 • No “reduced serial mode” Deterministic Schedulers Schedule Grace [Berger et al. 2009] serial Dthreads [Liu et al. 2011] round-robin Conversion [Merriﬁeld and Eriksson 2013] round-robin Parrot [Cui et al. 2013] round-robin Kendo [Olszewski et al. 2009] dynamic RCDC [Devietti et al. 2011] dynamic RFDet [Lu et al. 2014] dynamic DMP [Devietti et al. 2009] hybrid Core. Det [Bergan et al. 2010] hybrid Calvin [Hower et al. 2011] hybrid

Platform •

parsec splash Benchmarks serial round-robin dynamic-S dynamic-A hybrid 1. 10 3. 39 4. 36 6. 34 1. 00 7. 58 1. 00 7. 72 6. 12 1. 00 5. 87 5. 04 6. 19 1. 81 7. 26 1. 00 7. 61 0. 98 2. 39 1. 02 1. 33 1. 00 0. 99 1. 00 3. 04 1. 00 2. 93 1. 91 1. 00 1. 04 1. 77 1. 00 3. 19 0. 99 1. 52 1. 00 5. 27 0. 95 1. 07 1. 01 1. 16 1. 00 1. 01 1. 00 1. 09 1. 00 1. 08 1. 00 1. 05 1. 63 1. 00 1. 58 0. 99 1. 06 1. 00 1. 31 0. 96 1. 05 1. 01 1. 13 1. 00 1. 08 1. 00 1. 02 1. 00 1. 05 1. 33 1. 00 1. 23 0. 97 1. 01 1. 00 1. 06 0. 99 1. 10 1. 02 1. 19 1. 00 2. 67 1. 00 1. 88 1. 67 1. 00 1. 05 1. 34 1. 00 1. 25 0. 97 1. 01 1. 00 1. 05 average slowdown 3. 61 1. 60 1. 09 1. 04 1. 17 maximum slowdown 7. 72 5. 27 1. 63 1. 33 2. 67 barnes cholesky fft fmm lu_cb lu_ncb ocean_cp ocean_ncp radiosity radix raytrace volrend water_nsquared water_spatial blackscholes bodytrack dedup facesim ferret fluidanimate raytrace streamcluster swaptions vips

For this set of benchmarks and our platform, and implementation overhead set aside, the fundamental cost of determinism is small.

What is the performance cost of insisting on the same schedule across different environments?

Schedule-Record-Perturb-Replay Framework 1 2 application schedule thread 1 thread 2 scheduler replayer serial hybrid round-robin perturber architectures dynamic-A idle dynamic-S recorder NUMA DVFS small perturbations background processes

Perturber • Small perturbations (context switches, thread migrations, page faults) • Simulate first order effects by inserting small delays (μs and ms) • Background processes • Spawn additional threads and control their work to sleep ratio • Dynamic voltage and frequency scaling (DVFS) • Use Linux’s cpufreq system to explore different DVFS policies • Non-uniform memory access (NUMA) • Spread threads over two NUMA nodes • Asymmetric architectures • Use DVFS to create asymmetry [Shelepov et al. 2009]

Metric •

Benchmarks Small perturbations Backgroud proc. balanced unbalanced DVFS auto manual NUMA Asym. Arch. 4/4 1/7 0. 95 1. 01 1. 13 1. 00 1. 01 1. 00 1. 07 1. 00 1. 03 1. 00 1. 04 1. 33 1. 00 1. 24 0. 97 1. 00 1. 06 0. 96 1. 05 1. 02 1. 13 1. 00 1. 01 1. 00 1. 08 1. 00 1. 03 1. 00 1. 01 1. 00 1. 06 1. 33 1. 01 1. 25 0. 97 1. 01 1. 00 1. 07 0. 96 1. 07 1. 19 0. 99 1. 01 1. 00 1. 19 1. 00 1. 14 1. 08 1. 00 1. 01 1. 05 1. 35 1. 00 1. 19 0. 97 1. 01 1. 00 1. 09 0. 97 1. 25 1. 02 1. 24 1. 03 1. 01 1. 00 1. 94 1. 00 1. 92 1. 19 1. 00 1. 08 1. 00 1. 51 1. 31 1. 00 1. 29 0. 98 1. 77 1. 00 1. 09 0. 92 1. 06 1. 01 1. 13 1. 00 0. 97 1. 00 1. 13 1. 00 1. 08 1. 06 1. 00 1. 04 1. 29 0. 99 1. 21 0. 97 1. 05 1. 00 1. 15 0. 96 1. 02 1. 01 1. 13 1. 00 1. 01 1. 00 1. 07 1. 00 1. 02 1. 00 1. 05 1. 33 1. 00 1. 25 0. 97 1. 01 1. 00 1. 06 0. 91 1. 08 1. 01 1. 14 1. 00 1. 03 1. 00 1. 11 1. 00 1. 03 1. 32 1. 00 1. 15 0. 97 1. 02 1. 00 1. 06 0. 94 1. 03 1. 00 1. 15 1. 00 0. 99 1. 00 1. 46 1. 00 1. 44 1. 38 1. 02 1. 00 1. 33 1. 64 1. 00 1. 37 0. 98 1. 39 1. 00 1. 43 0. 96 1. 09 1. 01 0. 97 0. 98 1. 01 1. 00 1. 71 1. 00 1. 69 1. 55 0. 97 1. 03 1. 00 1. 56 1. 31 1. 00 1. 10 1. 01 1. 63 1. 00 1. 53 avg. slowdown 1. 04 1. 06 1. 19 1. 04 1. 15 1. 17 max. slowdown 1. 33 1. 35 1. 94 1. 29 1. 33 1. 32 1. 64 1. 71 splash 0. 96 1. 05 1. 01 1. 13 1. 00 1. 08 1. 00 1. 02 1. 00 1. 05 1. 33 1. 00 1. 23 0. 97 1. 01 1. 00 1. 06 parsec barnes cholesky fft fmm lu_cb lu_ncb ocean_cp ocean_ncp radiosity radix raytrace volrend water_nsquared water_spatial blackscholes bodytrack dedup facesim ferret fluidanimate raytrace streamcluster swaptions vips Quiet

Insisting on the same schedule in the presence of skewed conditions can slow down execution by a factor of almost 2 x.

Conclusions • Employed the schedule-record-replay framework to divorce implementation overhead from the fundamental cost of enforcing deterministic execution • Fundamental cost of determinism is small (4% on avg. , 33 % max. ) • There is room for lowering overheads in current deterministic systems • Measured this fundamental cost across a range of execution environments • The cost of raises to almost 2 x when threads face skewed conditions • Do we need a more relaxed definition of determinism? • Quantified various sources of non-determinism • Deterministic logical clocks are not deterministic (not only due to the performance counters imperfections [Weaver et al. 2013])

Thank you!