Distributed Shared Memory A Survey of Issues and

Distributed Shared Memory: A Survey of Issues and Algorithms B, . Nitzberg and V. Lo University of Oregon

INTRODUCTION • Distributed shared memory is a software abstraction allowing a set of workstations connected by a LAN to share a single paged virtual address space

Why bother with DSM? • Key idea is to build fast parallel computers that are – Cheaper than shared memory multiprocessor architectures – As convenient to use

Conventional parallel architecture CPU CPU CACH E Shared memory

Today’s architecture • Clusters of workstations are much more cost effective – No need to develop complex bus and cache structures – Can use off-the-shelf networking hardware • Gigabit Ethernet • Myrinet (1. 5 Gb/s) – Can quickly integrate newest microprocessors

Limitations of cluster approach • Communication within a cluster of workstation is through message passing – Much harder to program than concurrent access to a shared memory • Many big programs were written for shared memory architectures – Converting them to a message

Distributed shared memory main memories DSM = one shared global address space

Distributed shared memory • DSM makes a cluster of workstations look like a shared memory parallel computer – Easier to write new programs – Easier to port existing programs • Key problem is that DSM only provides the illusion of having a shared memory architecture – Data must still move back and forth

Basic approaches • Hardware implementations: – Use extensions of traditional hardware caching architecture • Operating system/library implementations: – Use virtual memory mechanisms • Compiler implementations – Compiler handles all shared accesses

Design Issues (I) 1. Structure and granularity – Big units are more efficient • Virtual memory pages – Can have false sharing whenever page contains different variables that are accessed at the same time by different processors

False Sharing accesses x x accesses y y page containing x and y will move back and fo between main memories of workstations

Design Issues (II) 1. Structure and granularity (cont'd) – Shared objects can also be • Objects from a distributed object-oriented system • Data types from an extant language

Design Issues (III) 2. Coherence semantics – Strict consistency is not possible – Various authors have proposed weaker consistency models • Cheaper to implement • Harder to use in a correct fashion

Design Issues (IV) 3. Scalability – Possibly very high but limited by • Central bottlenecks • Global knowledge operation and storage

Design Issues (V) 4. Heterogeneity – Possible but complex to implement

Portability Issues • • Not in pape Portability of programs – Some DSMs allow programs written for a multiprocessor architecture to run on a cluster of workstations without any modifications (dusty decks) – More efficient DSMs require more changes Portability of DSM

Implementation Issues (I) 1. Data Location and Access: • Keep data a single centralized location • Let data migrate (better) but must have way to locate them • Centralized server (bottleneck) • Have a "home" node associated with each piece of data

Implementation Issues (II) 1. Data Location and Access (cont'd): • Can either • Maintain a single copy of each piece of data • Replicate it on demand • Must either • Propagate updates to all replicas • Use an invalidation protocol

Invalidation protocol • Before update: X=0 X=0 INVALID X=0 • At update time X=5

Main advantage • Locality of updates: – A page that is being modified has a high likelihood of being modified again • Invalidation mechanism minimizes consistency overhead – One single invalidation replaces many updates

A realization: Munin • Developed at Rice University • Based on software objects (variables) • Used the processor virtual memory to detect access to the shared objects • Included several techniques for reducing consistency-related communication • Only ran on top of the V kernel

Munin main strengths • • Excellent performance Portability of programs – Allowed programs written for a multiprocessor architecture to run on a cluster of workstations with a minimum number of changes (dusty decks)

Munin main weakness • Very poor portability of Munin itself – Depended of some features of the V kernel • Not maintained since the late 80's

Consistency model • Munin uses software release consistency – Only requires the memory to be consistent at specific synchronization points

SW release consistency (I) • Well-written parallel programs use locks to achieve mutual exclusion when they access shared variables – P(&mutex) and V(&mutex) – lock(&csect) and unlock(&csect) – acquire( ) and release( ) • Unprotected accesses can produce unpredictable results

SW release consistency (II) • SW release consistency will only guarantee correctness of operations performed within a request/release pair • No need to export the new values of shared variables until the release • Must guarantee that workstation has received the most recent values of all shared variables when it completes a request

SW release consistency (III) shared int x; acquire( ); x = 1; release ( ); // export x=1 shared int x; acquire( ); // wait for new value of x x++; release ( ); // export x=2

SW release consistency (IV) • Must still decide how to release updated values – Munin uses eager release: • New values of shared variables were propagated at release time

SW release consistency (V) Eager release Each release forwards the update to the two other processors.

Multiple write protocol • Designed to fight false sharing • Uses a copy-on-write mechanism • Whenever a process is granted access to write-shared data, the page containing these data is marked copyon-write • First attempt to modify the contents of the page will result in the creation of a copy of the page modified (the twin).

Creating a twin Not in paper

Example Not in paper Before x= 1 y= After 2 x= 3 y= First write access x= 1 y= twin 2 Compare with twin New value of x is 3

Other DSM Implementations (I) • Software release consistency with lazy release (Treadmarks) – Faster and designed to be portable • Sequentially-Consistent Software DSM (IVY): – Sends messages to other copies at each write – Much slower

Other DSM Implementations (II) • Entry consistency (Midway): – Requires each variable to be associated to a synchronization object (typically a lock) – Acquire/release operations on a given synchronization object only involve the variables associated with that object – Requires less data traffic

Other DSM Implementations (III) • Structured DSM Systems (Linda): – Offer to the programmer a shared tuple space accessed using specific synchronized methods – Require a very different programming style

TODAY'S IMPACT • Very low: – According to W. Zwaepoel. truth is that computer clusters are "only suitable for coarse-grained parallel computation" and this is "[a] fortiori true for DSM" – DSM competed with Open. MP model and OPen. MP model won
- Slides: 36