Distributed Shared Memory part 1 Distributed Shared Memory

Distributed Shared Memory (part 1)

Distributed Shared Memory (DSM) shared memory network mem 0 mem 1 mem 2 proc 0 proc 1 proc 2 . . . mem. N proc. N

Shared memory programming • Standard – pthread • synchronizations – Barriers – Locks – Semaphores

Sequential SOR for some number of timesteps/iterations { for (i=0; i<n; i++ ) for( j=1, j<n, j++ ) temp[i][j] = 0. 25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=0; i<n; i++ ) for( j=1; j<n; j++ ) grid[i][j] = temp[i][j]; }

Parallel SOR with Barriers (1 of 2) void* sor (void* arg) { int slice = (int)arg; int from = (slice * (n-1))/p + 1; int to = ((slice+1) * (n-1))/p + 1; } for some number of iterations { … }

Parallel SOR with Barriers (2 of 2) for (i=from; i<to; i++) for (j=1; j<n; j++) temp[i][j] = 0. 25 * (grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]); barrier(); for (i=from; i<to; i++) for (j=1; j<n; j++) grid[i][j]=temp[i][j]; barrier();

Differences between SMP and Software DSM • Delay: tradeoffs, such as block size • Software => traps: cost of read/write misses • Goals of caches: multiprocessor = performance, dist. system = transparency • bus vs. long networks: reliance on serialization and broadcast.

Consequent differences in protocols and applications • Bigger block size – Cost amortization, higher hit ratio for larger blocks? – Reduced overhead • But therefore. . . – Migration vs. Replication – False sharing increases • DSM protocol more complex: Must handle lost, corrupted, and out-of-order packets • Above, coupled with cost of traps, => SDSM consistency cost much higher!

Results of high consistency costs • Manage sharing more carefully • Align data to page boundaries

Consistency Models • Sequential Consistency – All processors observe the same order – Must correspond to some serial order – Only ordering constraint is that reads/writes of P 1 appear in the same order, but no restrictions on relative ordering between processors.

Common consistency protocols • Write update – Multicast update to all replicas • Write invalidate – Invalidate cached copies in p 2, p 3 – Cache miss if p 2/p 3 access X • Valid data from other cache

Conventional Implementation • As proposed by Li & Hudak, TOCS ‘ 86. • Use virtual memory to implement sharing. • Shared memory divided up by virtual memory pages. • Use single-writer, multiple-reader writeinvalidate coherence protocol. • Keep pages in one of three states: – invalid, read-only, read-write

Example shared memory proc 0 proc 1 proc 2 proc. N

Example: Read Access Hit read proc 0 proc 1 proc 2 proc. N

Example: Write Access Hit write proc 0 proc 1 proc 2 proc. N

Example: Read Access Miss read proc 0 proc 1 proc 2 proc. N

Example: Read Fault read proc 0 fault proc 1 proc 2 proc. N

Example: Replication on Read read proc 0 proc 1 proc 2 proc. N

Example: Write Access Miss write proc 0 proc 1 proc 2 proc. N

Example: Write Fault write proc 0 proc 1 fault proc 2 proc. N

Example: Write Invalidation write proc 0 proc 1 proc 2 proc. N

Example: Write Access to Read-Only write proc 0 proc 1 proc 2 proc. N

Example: Write Fault write proc 0 proc 1 fault proc 2 proc. N

Example: Write Invalidation write proc 0 proc 1 proc 2 proc. N

How to Remember Locations? • Broadcast on miss (as in SMP). • Static home. • Dynamic home or owner.

Ownership and Owner Location • Owner is the last writer. • Owner maintains copyset. • Every processor maintains probable owner (not always the real owner).

Ownership Location • Every read or write miss is sent to (local) probable owner. • If owner, handle appropriately, else forward to probable owner.

Ownership Modification • If write miss, new writer becomes owner, and all forwarders set probable owner to requester. • If read miss, set probable owner to responding processor.

Example • Initially, owner(page 0) = p 0, and probable owner(page 0) = p 0 everywhere. • Write miss by p 1, sends message to its probable owner (p 0), handled there, new owner = p 1, probable owner(0) on p 0 = 1. • Read miss by p 2, sends message to probable owner (p 0), forwarded to probable owner (p 1), handled there, probable owner(0) on p 2 becomes p 1.

Implement synchronizations • Use messages to implement synchronizations

Barriers • Designate one processor as barrier manager. • When a process waits at a barrier, it sends an arrival message to the barrier manager and waits. • When barrier manager has received all messages, it sends a departure message to all processes.

Locks • Designate one process as the lock manager for a particular lock. • When a process acquires a lock, it sends an acquire message to the manager and waits. • Manager forwards message to last acquirer. • If lock free, send lock grant message. • If lock held, hold on to request until free, and then send lock grant message.

Problem: False Sharing • Concurrent access to different data within the same consistency unit. • With page as consistency unit, lots of opportunity for false sharing. • Two flavors: – read-write – write-write

Read-Write False Sharing x y

Read-Write False Sharing (Cont. ) w(x) r(y) r(x) r(y) synch

Write-Write False Sharing w(x) w(y) r(x) synch

Summary • Software shared memory on distributed memory hardware. – Uses virtual memory. • Home migration to improve locality – important because of high latencies. • Sequential consistency suffers from false sharing