Relaxed Consistency Deterministic Computer deterministic deeds done dirt

  • Slides: 25
Download presentation
Relaxed Consistency Deterministic Computer “deterministic deeds, done dirt cheap” Joseph Devietti, Jacob Nelson, Tom

Relaxed Consistency Deterministic Computer “deterministic deeds, done dirt cheap” Joseph Devietti, Jacob Nelson, Tom Bergan Luis Ceze, Dan Grossman

determinism m s i n i m r dete Test testing results are reproducible

determinism m s i n i m r dete Test testing results are reproducible Debug no need to stress test improves the software development cycle tested inputs behave identically in production Deploy more robust production code reverse debugging is possible production bugs can be reproduced in-house 3

determinism improves the software development cycle Test Debug Deploy 4

determinism improves the software development cycle Test Debug Deploy 4

History of Deterministic Execution for Arbitrary Programs Deterministic Execution for Restricted Programs DMP [ASPLOS

History of Deterministic Execution for Arbitrary Programs Deterministic Execution for Restricted Programs DMP [ASPLOS ‘ 09] Core. Det [ASPLOS ‘ 10] d. OS [OSDI ‘ 10] Determinator [OSDI ‘ 10] Calvin [HPCA ‘ 11] [ASPLOS ‘ 11] Kendo [ASPLOS ‘ 09] Grace [OOPSLA ‘ 09] 5

History of Deterministic Execution DMP [ASPLOS ‘ 09] Core. Det [ASPLOS ‘ 10] seq.

History of Deterministic Execution DMP [ASPLOS ‘ 09] Core. Det [ASPLOS ‘ 10] seq. consistency total store order [ASPLOS ‘ 11] DRF 0 [ISCA ‘ 90] "Piled Higher and Deeper" by Jorge Cham www. phdcomics. com Jorge Cham © 2008 6

Contributions Outline 1 2 DMP-HB a new deterministic consistency model based on DRF 0

Contributions Outline 1 2 DMP-HB a new deterministic consistency model based on DRF 0 with improved performance a low-complexity hw/sw deterministic execution system hw: store buffers and instruction counting sw: everything else 4 3 C/C++ compiler based on LLVM, runs on commodity multicore hardware simulation using Pin 7

starting simple: serialization quantum round quantum threads T 1 deterministic quantum size + deterministic

starting simple: serialization quantum round quantum threads T 1 deterministic quantum size + deterministic scheduling determinism T 2 T 3 time → 8

recovering parallelism with DMP-TSO parallel T 1 T 2 T 3 wr A rd

recovering parallelism with DMP-TSO parallel T 1 T 2 T 3 wr A rd A serial lock A rd A commit parallel mode: buffer all stores (no communication) commit mode: deterministically publish buffers serial mode: for atomic ops lock B time → 9

Why is DMP-TSO slow? parallel T 1 commit serial Kendo [ASPLOS ‘ 09] serialization

Why is DMP-TSO slow? parallel T 1 commit serial Kendo [ASPLOS ‘ 09] serialization imbalance T 2 T 3 time → 10

Why is DMP-TSO slow? parallel commit Kendo [ASPLOS ‘ 09] serialization imbalance T 1

Why is DMP-TSO slow? parallel commit Kendo [ASPLOS ‘ 09] serialization imbalance T 1 DMP-HB T 2 parallel-mode synchronization complements relaxed consistency T 3 time → 11

synchronization in parallel mode with Kendo [Olszewski et al. , ASPLOS ‘ 09] thread

synchronization in parallel mode with Kendo [Olszewski et al. , ASPLOS ‘ 09] thread with globally min insn count can do atomic op T 2 not is globally min insn count T 1 T 2 lock A T 3 instruction count → 12

Why is DMP-TSO slow? parallel T 1 commit serial Kendo [ASPLOS ‘ 09] serialization

Why is DMP-TSO slow? parallel T 1 commit serial Kendo [ASPLOS ‘ 09] serialization imbalance T 2 T 3 time → 13

Why is DMP-TSO slow? parallel commit Kendo [ASPLOS ‘ 09] serialization imbalance T 1

Why is DMP-TSO slow? parallel commit Kendo [ASPLOS ‘ 09] serialization imbalance T 1 DMP-HB T 2 T 3 time → 14

DRF 0: happens-before consistency [Adve and Hill, ISCA ‘ 90] • happens-before edges defined

DRF 0: happens-before consistency [Adve and Hill, ISCA ‘ 90] • happens-before edges defined by synchronization operations • remote updates visible via cross-thread happens-before edges • SC for DRF programs • upholds C++/Java memory models • programmer-visible model doesn’t change 15

sync in parallel mode (Kendo) relaxed consistency (DRF 0) deterministic scheduling (DMP) DMP-HB 16

sync in parallel mode (Kendo) relaxed consistency (DRF 0) deterministic scheduling (DMP) DMP-HB 16

DMP-HB : happens-before determinism parallel commit explicit fences rarely necessary T 1 T 2

DMP-HB : happens-before determinism parallel commit explicit fences rarely necessary T 1 T 2 lock A unlock A TSO RC DRF 0 lock A T 3 time → no serial mode less imbalance explicit fence iff inter-thread HB edge doesn’t cross commit 17

Outline 1 2 4 3 DMP-HB C/C++ compiler based on LLVM, runs on commodity

Outline 1 2 4 3 DMP-HB C/C++ compiler based on LLVM, runs on commodity multicore a low-complexity hw/sw deterministic execution system hardware simulation using Pin a new deterministic consistency model with improved performance hw: store buffers and instruction counting sw: everything else 18

Architecture runtime system L 2$ Store Buffers in Private $ application/OS can Store. To.

Architecture runtime system L 2$ Store Buffers in Private $ application/OS can Store. To. SB choose nondeterminism Commit. SB align context switches Save. SB with quantum boundaries Restore. SB Precise Insn Counting L 1$ Core Start. Insn. Count Stop. Insn. Count Read. Insn. Count Traps SBFull Quantum. Reached 19

Outline 1 2 4 3 DMP-HB C/C++ compiler based on LLVM, runs on commodity

Outline 1 2 4 3 DMP-HB C/C++ compiler based on LLVM, runs on commodity multicore a low-complexity hw/sw deterministic execution system hardware simulation using Pin a new deterministic consistency model with improved performance hw: store buffers and instruction counting sw: everything else 20

Experimental Setup Pin-based simulator 1 IPC, except for memory ops PARSEC v 2. 1

Experimental Setup Pin-based simulator 1 IPC, except for memory ops PARSEC v 2. 1 with simsmall inputs structure size access latency private L 1 8 -way, 32 KB 1 cycle private L 2 8 -way, 256 KB 10 cycles shared L 3 16 -way, 8 MB 35 cycles memory - 120 cycles extended Core. Det C/C++ compiler [ASPLOS ‘ 10] 8 -core Intel Harpertown @ 2. 8 GHz, 10 GB RAM PARSEC v 2. 1 with simlarge inputs 21

Simulation: overhead < 60% in worst case 70% % overhead compared to nondet Overheads

Simulation: overhead < 60% in worst case 70% % overhead compared to nondet Overheads 2 p 60% 4 p 50% 8 p 40% 16 p 30% 20% 10% 0% blacksch quantum size 50 k (insns) dedup ferret fluid 50 k 25 k 1 k streamcl swaptions 1 k 50 k vips x 264 50 k 22

% overhead compared to nondet Compiler: DMP-HB vs. DMP-TSO 450% 400% hb 350% tso

% overhead compared to nondet Compiler: DMP-HB vs. DMP-TSO 450% 400% hb 350% tso 300% 250% 200% 150% 100% 50% 0% threads quantum size (insns) 2 4 8 blackscholes 200 k 2 4 8 swaptions 200 k 2 4 8 fluidanimate 50 k 2 4 fmm 8 50 k 23

Conclusions • DMP-HB: a new deterministic consistency model • : a new deterministic multiprocessor

Conclusions • DMP-HB: a new deterministic consistency model • : a new deterministic multiprocessor design – no speculation – lightweight hardware support • Relaxed consistency is a natural optimization for determinism source code and data available at http: //sampa. cs. washington. edu 24

Thanks! Questions? source code and data available at http: //sampa. cs. washington. edu 25

Thanks! Questions? source code and data available at http: //sampa. cs. washington. edu 25

DRF 0 hardware requirements [ISCA ‘ 90] 1. 2. 3. 4. 5. Intra-processor dependencies

DRF 0 hardware requirements [ISCA ‘ 90] 1. 2. 3. 4. 5. Intra-processor dependencies are preserved. All writes to the same location can be totally ordered based on their commit times, and this is the order in which they are observed by all processors. All synchronization operations to the same location can be totally ordered based on their commit times, and this is also the order in which they are globally performed. Further, if S 1 and S 2 are synchronization operations and S 1 is committed and globally performed before S 2, then all components of S 1 are committed and globally performed before any in S 2. A new access is not generated by a processor until all its previous synchronization operations (in program order) are committed. Once a synchronization operation S by processor Pi is committed, no other synchronization operations on the same location by another processor can commit until after all reads of Pi before S (in program order) are committed and all writes of Pi before S are globally performed. 26