De Novo ND Efficient Hardware Support for Disciplined

  • Slides: 28
Download presentation
De. Novo. ND: Efficient Hardware Support for Disciplined Non-Determinism Hyojin Sung, Rakesh Komuravelli, and

De. Novo. ND: Efficient Hardware Support for Disciplined Non-Determinism Hyojin Sung, Rakesh Komuravelli, and Sarita V. Adve Department of Computer Science University of Illinois at Urbana-Champaign Illinois-Intel Parallelism Center

Motivation • Shared memory is de-facto model for multicore SW and HW • BUT

Motivation • Shared memory is de-facto model for multicore SW and HW • BUT … – Complex SW: data races, unstructured parallelism, memory model, … – Inefficient HW: complex coherence/consistency, unnecessary traffic, … • Recent work on disciplined shared memory – SW: Easier programming model – HW: Parallelism If SW is more Illinois-Intel Center disciplined, can we build more efficient HW?

Disciplined Shared Memory Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, Explicit,

Disciplined Shared Memory Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, Explicit, synchronization structured side-effects Illinois-Intel Parallelism Center

Disciplined Shared Memory Deterministic Parallel Java (DPJ) – strong safety properties OOPSLA ‘ 09

Disciplined Shared Memory Deterministic Parallel Java (DPJ) – strong safety properties OOPSLA ‘ 09 • Determinism-by-default, simple semantics explicit effects Disciplined Shared Memory structured parallel control De. Novo – performance, complexity and power efficient PACT ‘ 11 • Simplify coherence and consistency Illinois-Intel Parallelism Center

Limitation • De. Novo for deterministic programs – Important assumptions 1. No conflicting concurrent

Limitation • De. Novo for deterministic programs – Important assumptions 1. No conflicting concurrent accesses, only barrier synchronization 2. Known side-effects – Allowed De. Novo to eliminate design complexity and inefficiency • Challenges for nondeterministic programs – The assumptions do not hold any more 1. Can have conflicting concurrent accesses, support lock synchronization Illinois-Intel Center 2. Parallelism Side-effects unknown in critical sections

Contribution Deterministic Parallel Java (DPJ) – strong safety properties • Determinism-by-default, simple semantics POPL

Contribution Deterministic Parallel Java (DPJ) – strong safety properties • Determinism-by-default, simple semantics POPL ‘ 11 Explicit & safe non • determinism structured Disciplined Shared explicit parallel effects Memory control De. Novo. ND: Non-deterministic codes with benefits of De. Novo • Minimal additional HW for non-determinism • Comparable performance to MESI • 30% lower network traffic than MESI • PLUS all advantages of De. Novo for deterministic codes Illinois-Intel Parallelism Center

Outline • Motivation • Background – DPJ/De. Novo for deterministic codes – DPJ support

Outline • Motivation • Background – DPJ/De. Novo for deterministic codes – DPJ support for disciplined non-determinism • • De. Novo. ND Design De. Novo. ND Implementation Evaluation Conclusion and Future Work Illinois-Intel Parallelism Center

DPJ for Deterministic Codes. . . • Structured parallel control – Fork-join parallelism •

DPJ for Deterministic Codes. . . • Structured parallel control – Fork-join parallelism • Explicit region and effect – Regions divide heap – Read or write effects on regions • Data-race freedom guarantee – Simple, modular type Illinois-Intel Parallelism Center checking LD ST ST . . . write effect heap

DPJ for Deterministic Codes • Java-compatible type system • Structured parallel control • .

DPJ for Deterministic Codes • Java-compatible type system • Structured parallel control • . . . LD ST ST Hardware – simplify coherence – Fork-join parallelism. problems!. Explicit region and effect – Regions divide heap – Read or write effects on regions • Data-race freedom Illinois-Intel Parallelism Center guarantee . write effect heap

 • De. Novo for Deterministic Codes Coherence Enforcement 1. Invalidate stale copies in

• De. Novo for Deterministic Codes Coherence Enforcement 1. Invalidate stale copies in private cache 2. Track up-to-date copy • Explicit effects – Compiler knows all writeable regions in this parallel phase – Cache can self-invalidate before next parallel phase • Registration – Directory keeps track of one up-to-date copy – Writer registers itself before next parallel phase Illinois-Intel Parallelism Center

De. Novo for Deterministic Codes • No space overhead – Keep valid data or

De. Novo for Deterministic Codes • No space overhead – Keep valid data or registered core id registry – LLC data arrays double as directory • No transient states • No invalidation traffic Invalid • No false sharing Read Write Registered Illinois-Intel Parallelism Center Valid

Example Run X in De. Novo-region Y in De. Novo-region ST L 1 of

Example Run X in De. Novo-region Y in De. Novo-region ST L 1 of Core 2 V R X VI Y VI X V R Y . . ST L 1 of Core 1 Registration Shared L 2 Ack self-invalidate( ) Illinois-Intel Parallelism Center Registration V X V Y R C 1 R C 2 Registered Valid Invalid Ack

DPJ Support for Safe Non. Determinism • Nondeterminism comes from conflicting. . concurrent accesses.

DPJ Support for Safe Non. Determinism • Nondeterminism comes from conflicting. . concurrent accesses. • Isolate these accesses as “atomic” – Enclosed in “atomic” sections – “Atomic” regions and effects → “Disciplined” non-determinism LD ST . . . - Race freedom, strong isolation - Determinism-by-default • De. Novo. ND converts “atomic” statements into semantics locks Parallelism Center Illinois-Intel

Outline • Motivation • Background • De. Novo. ND Design – Memory Consistency Model

Outline • Motivation • Background • De. Novo. ND Design – Memory Consistency Model – Distributed Queue-based Lock • De. Novo. ND Implementation • Evaluation • Conclusion and Future Work Illinois-Intel Parallelism Center

Memory Consistency Model • Deterministic accesses 1. Same task in this parallel phase 2.

Memory Consistency Model • Deterministic accesses 1. Same task in this parallel phase 2. Or before this parallel phase. . . De. Novo Coherence Mechanism ST 0 xa Parallel Phase LD 0 xa Illinois-Intel Parallelism Center . .

Memory Consistency Model • Non-deterministic accesses 1. Same task in this parallel phase 2.

Memory Consistency Model • Non-deterministic accesses 1. Same task in this parallel phase 2. Or before this parallel phase 3. Or in preceding critical sections. . . ST 0 xa Critical Section ST 0 xa LD 0 xa Illinois-Intel Parallelism Center . . Parallel Phase

Coherence for non-deterministic data • Coherence Enforcement 1. Invalidate stale copies in private cache

Coherence for non-deterministic data • Coherence Enforcement 1. Invalidate stale copies in private cache 2. Track up-to-date copy • When to invalidate? – Between the start of critical section and any read • What to invalidate? – Entire cache? regions with “atomic” effect? – Track atomic writes in a signature, transfer with lock • Registration – Writer updates before next critical section Illinois-Intel Parallelism Center

Distributed Queue-based Lock • Lock primitive that works on De. Novo. ND – No

Distributed Queue-based Lock • Lock primitive that works on De. Novo. ND – No directory, no write invalidation No spinning for lock • Modeled after QOSB Lock – Lock requests form a distributed queue – But much simpler • Details in the paper Illinois-Intel Parallelism Center

Outline • • • Motivation Background De. Novo. ND Design De. Novo. ND Implementation

Outline • • • Motivation Background De. Novo. ND Design De. Novo. ND Implementation Evaluation Conclusion and Future Work Illinois-Intel Parallelism Center

Access Signatures • Simple and small hardware Bloom filter per core – Track accesses

Access Signatures • Simple and small hardware Bloom filter per core – Track accesses with “atomic” effects only – Only 256 bits suffice • Operations on Bloom filter – On write: insert address – On read: query filter for address for selfinvalidation Illinois-Intel Parallelism Center

Example Run X in De. Novo-region Y in De. Novo-region Z in atomic De.

Example Run X in De. Novo-region Y in De. Novo-region Z in atomic De. Novo-region W in atomic De. Novo-region ST . . LD ST LD L 1 of Core 1 Z R V R W Read miss Registration ) ) Illinois-Intel Parallelism Center lock transfer X V Y Z VI W Ack self-invalidate( reset filter L 1 of Core 2 Z W V VI Shared L 2 R C 1 R C 2 V R C 1 R C 2 Z V W X R Y Z VR W Read miss Registration Ack

Optimization to reduce selfinvalidation X in De. Novo-region Y in De. Novo-region Z in

Optimization to reduce selfinvalidation X in De. Novo-region Y in De. Novo-region Z in atomic De. Novo-region W in atomic De. Novo-region . . ST LD LD LD self-invalidate( 1. loads in Registered state 2. “Touched-atomic” bit – Set on first atomic load – Subsequent load don’t selfinvalidate ) Illinois-Intel Parallelism Center • More in the paper

Overheads • Hardware Bloom filter – 256 bits per core • Storage overhead –

Overheads • Hardware Bloom filter – 256 bits per core • Storage overhead – One additional state, but no storage overhead (2 bits) – “Touched-atomic” bit per word in L 1 • Communication overhead – Bloom filter piggybacked on lock transfer message – Writeback messages for locks Illinois-Intel Parallelism Center • Lock writebacks carry more info

Evaluation Methodology • Simulator: Simics + GEMS + Garnet • System Parameters – 16

Evaluation Methodology • Simulator: Simics + GEMS + Garnet • System Parameters – 16 in-order cores • Workloads – SPLASH-2, PARSEC and STAMP – Unchanged except region/effect and selfinvalidation • Protocols – MESI and De. Novo. ND – With idealized Illinois-Intel Parallelism Center locks and realistic locks

MESI vs. De. Novo. ND: Idealized lock barnes ocean water fluidanimate streamcluster tsp kmeans

MESI vs. De. Novo. ND: Idealized lock barnes ocean water fluidanimate streamcluster tsp kmeans ssca 2 • De. Novo. ND performs comparable to MESI for all apps – For both DIL-INF and DIL-256 Illinois-Intel Parallelism Center

MESI vs. De. Novo. ND: Realistic lock barnes ocean water fluidanimate streamcluster tsp kmeans

MESI vs. De. Novo. ND: Realistic lock barnes ocean water fluidanimate streamcluster tsp kmeans ssca 2 • pthread lock vs. distributed queue-based lock • De. Novo. ND performs comparable or better Illinois-Intel than Parallelism MESI Center

Network Traffic (Realistic lock) barnes ocean water fluidanimate streamcluster tsp kmeans ssca 2 •

Network Traffic (Realistic lock) barnes ocean water fluidanimate streamcluster tsp kmeans ssca 2 • De. Novo. ND has 33% less traffic than MESI (67% max) – No invalidation traffic – Reduced load misses due to lack of false sharing Illinois-Intel Parallelism Center

Conclusions and Future Work • De. Novo. ND: Efficient HW support for nondeterminism –

Conclusions and Future Work • De. Novo. ND: Efficient HW support for nondeterminism – Minimal additional HW for safe non-determinism – Comparable performance to MESI – 30% lower network traffic than MESI – PLUS all advantages of De. Novo for deterministic codes • Future work: broaden the application space further Illinois-Intel Parallelism Center – Pipeline parallelism, “lock-free” data structures,