ECE 1147 Parallel Computation Oct 30 2006 Implementation

  • Slides: 41
Download presentation
ECE 1147, Parallel Computation Oct. 30, 2006 Implementation and Performance of Munin (Distributed Shared

ECE 1147, Parallel Computation Oct. 30, 2006 Implementation and Performance of Munin (Distributed Shared Memory System) Dongying Li (Original Authors: J. B. Carter, et al. ) Department of Electrical and Computer Engineering University of Toronto

Distributed Shared Memory • Shared address space spanning the processors of a distributed memory

Distributed Shared Memory • Shared address space spanning the processors of a distributed memory multiprocessor proc 1 proc 2 proc 3 X=0 X=0 2

Distributed Shared Memory shared memory network mem 0 mem 1 mem 2 proc 0

Distributed Shared Memory shared memory network mem 0 mem 1 mem 2 proc 0 proc 1 proc 2 . . . mem. N proc. N 3

Distributed Shared Memory • Design objectives – Good performance comparable to shared memory programs

Distributed Shared Memory • Design objectives – Good performance comparable to shared memory programs – No significant deviation from shared memory coding model – Low communication and message passing overheads 4

Munin System • Characterized features – Software released consistency – Multiple consistency protocols •

Munin System • Characterized features – Software released consistency – Multiple consistency protocols • Same interface with shared memory code model – Threads, syncs, data sharing etc. – Deviations • All shared variable annotated by access pattern • Syncs explicitly visible to runtime system (important for release consistency!) 5

Contents • Basic concepts – Shared object – Software release consistency – Multiple consistency

Contents • Basic concepts – Shared object – Software release consistency – Multiple consistency protocols • Software implementation – Prototype overview – Execution process – Advanced programming features – Data object directory and delayed update queue – Synchronization • Performance • Overview of other DSM systems • Conclusion 6

Basic Concepts • Basic concepts – Shared object – Software release consistency – Multiple

Basic Concepts • Basic concepts – Shared object – Software release consistency – Multiple consistency protocols • Software implementation – Prototype overview – Execution process – Advanced programming features – Data object directory and delayed update queue – Synchronization • Performance • Overview of other DSM systems • Conclusion 7

Shared Object 8 -kilo x x x y 8

Shared Object 8 -kilo x x x y 8

Software Release Consistency • Sequential Consistency – All processors observe the same order –

Software Release Consistency • Sequential Consistency – All processors observe the same order – Must correspond to some serial order – Only ordering constraint is that reads/writes of P 1 appear in the same order, but no restrictions on relative ordering between processors. • Synchronous read/write – Writes must be propagated before moving on to the next operation 9

Software Release Consistency • Special weak consistency protocol • Reduction of message passing overhead

Software Release Consistency • Special weak consistency protocol • Reduction of message passing overhead • Two categories of shared variable operations – Ordinary access • Read • Write – Synchronization access (lock, semaphore, barrier) • Acquire • Release 10

Software Release Consistency • Before ordinary access (read, write) allowed, all previous acquire performed

Software Release Consistency • Before ordinary access (read, write) allowed, all previous acquire performed • Before release allowed, all previous ordinary access performed • Before acquire allowed, all previous release performed • Before release allowed, all previous acquire performed • In a word, results of writes prior to a release propagated before next processor acquiring this released lock 11

Release Consistency • Write propagating at release 12

Release Consistency • Write propagating at release 12

Multiple Consistency Protocols • No single consistency protocol suitable for all parallelization purpose •

Multiple Consistency Protocols • No single consistency protocol suitable for all parallelization purpose • Shared variables accessed in different ways within single program • Variable access pattern changes during execution • Multiple protocols allow access pattern-oriented tuning for different shared variables 13

Multiple Consistency Protocols • High-level sharing pattern annotation – Specified in shared variable declaration

Multiple Consistency Protocols • High-level sharing pattern annotation – Specified in shared variable declaration – Combinations of low-level protocol parameters • Low-level protocol parameter – Specified in shared variable directory – Specific aspect of protocol 14

Protocol Parameters • I: propagate invalidating or updating after modification? • R: Replicas allowed

Protocol Parameters • I: propagate invalidating or updating after modification? • R: Replicas allowed in other nodes? • D: Delayed operation (update, invalidation) allowed? • FO: Having fixed owner (no writes at other nodes)? • M: Multiple writers allowed? • S: Stable sharing pattern (accessed by fixed threads)? • FL: Flush changes to owner & invalidate local copy? • W: Writable? 15

Sharing annotations • Read only – Simplest pattern: once initialized, no further access –

Sharing annotations • Read only – Simplest pattern: once initialized, no further access – Suitable for constant etc. • Migratory – Only one thread can access at one period of time – Suitable for variables accessed only in critical session • Write-shared – Can be written concurrently by multiple threads – Different threads update different words of variable • Producer-consumer – Written only by one threads and read by others – Replicate and update the object, not invalidate 16

Sharing annotations • Example: producer-consumer for some number of timesteps/iterations { for (i=0; i<n;

Sharing annotations • Example: producer-consumer for some number of timesteps/iterations { for (i=0; i<n; i++ ) for( j=1, j<n, j++ ) temp[i][j] = 0. 25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=0; i<n; i++ ) for( j=1; j<n; j++ ) grid[i][j] = temp[i][j]; } 17

Sharing annotations • Reduction – Accessed by fetching and operation (read, write then release)

Sharing annotations • Reduction – Accessed by fetching and operation (read, write then release) – Example: min(), a++ • Result – Phase 1: multiple write allowed – Phase 2: one thread (the result) access exclusively • Conventional – Conventional update protocol for shared variables 18

Sharing annotations w(x) r(x) w(x) 19

Sharing annotations w(x) r(x) w(x) 19

Sharing annotations Protocol Parameters Sharing Annotations I R D FO M S FL W

Sharing annotations Protocol Parameters Sharing Annotations I R D FO M S FL W Read-only N Y - - - N Migratory Y N - N Y Write-shared N Y Y N N Y Producer. Consumer N Y Y N Y Reduction N Y N - N Y Result N Y Y - Y Y Conventional Y Y N N N - N Y 20

Software Implementation • Basic concepts – Shared object – Software release consistency – Multiple

Software Implementation • Basic concepts – Shared object – Software release consistency – Multiple consistency protocols • Software implementation – Prototype overview – Execution process – Advanced programming features – Data object directory and delayed update queue – Synchronization • Performance • Overview of other DSM systems • Conclusion 21

Prototype Overview • A simple processor converting annotations to suitable format • A linker

Prototype Overview • A simple processor converting annotations to suitable format • A linker creating the shared memory segment • Library routines linked into program • Operating system support for page fault handling and page table manipulation 22

Execution Process • Compiling Sharing annotations Munin processor Auxiliary files Linker Shared data segment

Execution Process • Compiling Sharing annotations Munin processor Auxiliary files Linker Shared data segment Shared data description table 23

Execution Process • Initialization Munin root P 1 thread User_init() P 2 . user

Execution Process • Initialization Munin root P 1 thread User_init() P 2 . user root thread Code copy Data segment Munin worker thread . Pn Munin worker thread 24

Execution Process • Synchronization P 1 P 2 . . Munin root thread Synchronization

Execution Process • Synchronization P 1 P 2 . . Munin root thread Synchronization operation User thread Munin worker thread Pn 25

Advanced Programming Features • Associate data & Synch rel(m) msg acq(m) r(x) rel(m) msg

Advanced Programming Features • Associate data & Synch rel(m) msg acq(m) r(x) rel(m) msg w(x) acq(m) r(x) 26

Advanced Programming Features • Phase. Change() – Change the producer consumer relationship – Example:

Advanced Programming Features • Phase. Change() – Change the producer consumer relationship – Example: adaptive mesh sor • Change. Annotation() – Change the access pattern in execution • Invalidate() • Flush() • Single. Object() • Pre. Acquire() 27

Data Object Directory • • • Start Address and Size Protocol parameters Object state

Data Object Directory • • • Start Address and Size Protocol parameters Object state (valid, writable, invalid) Copyset (which remote has copies) Synchq (corresponding synchronization object) Probable owner Home node Access control semaphore Links 28

Delayed Update Queue rel(m) w(x) w(y) x x y acq(m) 29

Delayed Update Queue rel(m) w(x) w(y) x x y acq(m) 29

Multiple Writer Handling 30

Multiple Writer Handling 30

Synchronization • Queue based synchronization • Request – reply – lock forward mechanism •

Synchronization • Queue based synchronization • Request – reply – lock forward mechanism • Create. Lock(), Acquire. Lock(), Release. Lock(), Create. Barrier(), Wait. At. Barrier() 31

Performance • Basic concepts – Shared object – Software release consistency – Multiple consistency

Performance • Basic concepts – Shared object – Software release consistency – Multiple consistency protocols • Software implementation – Prototype overview – Execution process – Advanced programming features – Data object directory and delayed update queue – Synchronization • Performance • Overview of other DSM systems • Conclusion 32

Matrix Multiply 33

Matrix Multiply 33

Matrix Multiply Optimized 34

Matrix Multiply Optimized 34

SOR 35

SOR 35

Effect of Multiple Protocols Protocol Matrix Multiply SOR Multiple 72. 41 27. 64 Write-shared

Effect of Multiple Protocols Protocol Matrix Multiply SOR Multiple 72. 41 27. 64 Write-shared 75. 59 64. 48 Conventional 75. 85 67. 64 36

Performance Problem with Munin • Note: inefficient performance for task-queue model! (TSP -Q, quicksort,

Performance Problem with Munin • Note: inefficient performance for task-queue model! (TSP -Q, quicksort, etc. ) • Eg. Speed up with MPI for TSP (16 procs) code II 8. 9 13. 4 • Speed up with Munin code II 6. 0 8. 9 • Major overhead: time for thread waiting at the lock which protects the work queue: caused by transferring whole work queue between threads 37

Overview of Other DSM System • Basic concepts – Shared object – Software release

Overview of Other DSM System • Basic concepts – Shared object – Software release consistency – Multiple consistency protocols • Software implementation – Prototype overview – Execution process – Advanced programming features – Data object directory and delayed update queue – Synchronization • Performance • Overview of other DSM systems • Conclusion 38

Overview of Other DSM System • Clouds: • Mirage: • Orca: • Amber: •

Overview of Other DSM System • Clouds: • Mirage: • Orca: • Amber: • Linda: • Midway: • DASH: per-segment (object) based consistency protocol per-page based reliable ordered broadcast protocol user responsible for the data distribution among processors shared variable in tuple space, atomic operation: insertion, removal, reading using entry consistency (weaker consistency than release consistency) hardware DSM 39

Conclusion • Objective: efficient DSM system with similar protocol to shared memory programming and

Conclusion • Objective: efficient DSM system with similar protocol to shared memory programming and small message passing overhead • Special feature: multiple protocols, software release consistency • Implementation: synchronization realized by Munin root thread and Munin worker threads 40

Thank you 41

Thank you 41