Improving the Efficiency of FaultTolerant Distributed SharedMemory Algorithms

Improving the Efficiency of Fault-Tolerant Distributed Shared-Memory Algorithms Eli Sadovnik and Steven Homberg Second Annual MIT PRIMES Conference, May 19 -20, 2012

Introduction • Shared memory supports concurrent access – Read & write interface • Memory models: single writer, multiple reader (SWMR) and multiple writer, multiple reader (MWMR) – Consistency is important • Strong consistency provides useful semantics • Abstraction for message-passing networks – Shared memory can be emulated – Difficult to do, but solutions exist – For example applications for the Internet, such as Dropbox

Our Research Project THE RAMBO PROJECT • Framework for emulating shared memory – Introduced by Lynch and Shvartsman, extended by Gilbert – Implements the MWMR model with strong consistency – Designed for dynamic distributed message-passing settings OUR GOAL • RAMBO is elegant but not always efficient • Extend RAMBO with intelligent data management

Consistency & Atomicity • There are many consistency models • We are interested in atomicity Atomicity write(8) 0 time read(8) Violation (Safety) read(8) write(8) 0 (Regularity) read(8) read(0) read(8) time write(8) 0 read(3) read(0) read(8) time

Emulating Shared Memory User 2: Writer Data: User 1: Reader Data: 5 5 Status: WORKING Data: 5 User 3: Reader Data: 5

Weakness of the Centralized Approach User 2: Writer Data: User 1: Reader Data: error Status: FAILED Data: User 3: Reader Data: error

Replication in Distributed Setting User 2: Writer Data: User 1: Reader Data: 5 5 Status: FAILED WORKING Data: 5 5 User 3: Reader Data: 5

The ABD Algorithm Hagit Attiya, Amotz Bar-Noy, Danny Dolev A SWMR algorithm • Operation level wait-freedom – Termination unaffected by concurrency • Designed for a message-passing setting – Allows limited failures – Communication is reliable – Messages can be delayed

Quorum Systems and ABD • ABD is a quorum based algorithm – Quorum system is a collection of intersecting sets • For example a voting majority quorum system • Data is replicated in a quorum systems – Quorum system members are networked servers • Guarantee of atomicity – Quorum intersection and read/write protocols • Reads must write! (… sometimes as we will see later) – A reader must write the latest data – Writer cannot be trusted to complete

Phased Read/Write Protocols User 2: Writer Data: User 1: Reader Data: 5 5 Q 1 Status: WORKING Data: 5 5 Q 2 User 2 writes its data, a 5, to quorum Q 1. User 3: Reader Data: 5

Phased Read/Write Protocols User 2: Writer Data: User 1: Reader Data: 5 5 User 1 queries quorum Q 2, sees the latest data is a 5, and writes that back to the computer that does not have the latest data. Q 1 Status: WORKING Data: 5 5 5 Q 2 User 3: Reader Data: 5

Data Versions & Timestamps User 2: Writer Data: User 1: Reader Data: 7, t=2 5, t=1 Timestamps allow us to distinguish among different versions of the data. Q 1 Status: WORKING Data: 5, t=1 7, t=2 Q 2 User 3: Reader Data: 5, t=1

Data Versions & Timestamps User 2: Writer Data: User 1: Reader Data: 7, t=2 Q 1 Status: WORKING Data: 7, t=2 Q 2 User 3: Reader Data: 7, t=2

Quorum Viability User 2: Writer Data: User 1: Reader Data: error A weakness of the ABD algorithm is that it is dependent on a quorum of servers always being viable. When no quorum is available, then operations are blocked. Q 1 Status: FAILED WORKING Data: 7, t=2 Q 2 User 3: Reader Data: error

The RAMBO Framework (Reconfigurable Atomic Memory for Basic Objects) Seth Gilbert Nancy Lynch Alexander Shvartsman

Quorum Reconfiguration Q 1 Status: FAILED Data: Q 1 Status: WORKING Data: 7, t=2 Q 2 Status: WORKING Data: 7, t=2 Q 2 RAMBO uses quorum reconfiguration to ensure service longevity. A new quorum system (a new set of servers) is installed to replace the old ones, allowing progress in spite of failures.

Replica Transfer Q 1 Status: FAILED Data: Q 1 Status: WORKING Data: 7, t=2 Q 2 7, t=2 Status: WORKING Data: 7, t=2 Q 2 After a new set of servers is installed, these servers do not have any information. The replica information (copies of data) must be transferred to the new configuration.

Garbage Collection Q 1 Status: FAILED Data: Q 1 Status: WORKING Data: 7, t=2 Q 2 Status: WORKING Data: 7, t=2 Q 2 After information is transferred to the new servers, the old servers are phased out of use. This process is called `garbage collection’. The mechanism for garbage collection has two phases and is analogous to read/write operations (introduced in the next slies).

Read/Write Operations Multi-Configuration Access Q 1 Status: FAILED Data: Q 1 Status: WORKING Data: 7, t=2 Q 2 Status: WORKING Data: 7, t=2 Q 2 User 1: Reader Data: 7, t=2 What if reads and writes occur during reconfiguration? Concurrent operations contact all existing configurations to ensure the latest information is accessed.

Read/Write Operations Garbage Collection Q 1 Status: FAILED Data: Q 1 Status: WORKING Data: 7, t=2 Q 2 Status: WORKING Data: 7, t=2 Q 2 User 1: Reader Data: 7, t=2 Old configurations need to be removed from use. Ongoing read/write operations use their existing configuration knowledge. New operations ignore the old configuration.

Research Questions Q 1: Can a reader (respectively writer) avoid contacting configurations that it learned have been marked as garbage collected? Q 2: When can a reader avoid its second phase, and can a reader propagate selectively? Q 3: Can we propagate to the most recent configuration only?

Concurrent Garbage Collection (Q 1) 7, t=2 4 3 Q 1 Status: WORKING Data: 5, t=1 7, t=2 Status: WORKING Data: 1 7, t=2 0, t=0 Q 2 6 7, t=2 0, t=0 Q 2 5 User 1: Reader Data: 7, t=2 2 We believe that the garbage collected configuration can in fact 7 be ignored because the reader Return 7 learns of the configuration’s information regardless.

Improved Configuration Management (Q 1) • Authors of RAMBO conjecture that operations must contact all configurations that are discovered during the query (respectively propagate) phase. • Communicating with configurations learned to be garbage collected mid-operation is unnecessary – Intermediate discovery of garbage collected configurations from another server – That server knows at least as recent tag as any known in the old configurations • IMPACT: improves operation liveness

Improved Bookkeeping (Q 2) User 1: Reader Data: Q 1 7, t=2 Status: WORKING Data: 7, t=2 Q 2 7 7 t=2 After querying the reader learns that a majority of nodes has the up-to-date information, thus making propagation needless.

Semi-Fast Read Operations (Q 2) • Read operations always propagate – Regardless of the actual replica dissemination – Redundant messages and slow operation • The proposed solution – During the query phase, reader records the latest timestamps of server with which it communicated – The reader contacts servers that are not up-to-date – Sometimes this allows omitting the propagation phase entirely (`semi-fast’ read operations) • IMPACT: improves operation latency and reduces communication costs

Overly Extensive Propagation (Q 3) Q 1 Status: FAILED Data: Q 1 Status: WORKING Data: 7, t=2 Q 2 Status: WORKING Data: 7, t=2 Q 2 User 1: Writer Data: 7, t=2 Currently, RAMBO both queries and propagates to all active configurations. In fact, just the query phase covering all active configurations is sufficient for atomicity.

Propagate to the Latest Configuration (Q 3) • We believe it is not necessary to propagate to any configuration but the last active configuration. • Properties of configuration information • All configurations are totally ordered. • Configuration have a forward link. • Discovery is faster than reconfiguration • Operations query all active configurations • IMPACT: reduces communication cost

Summary • Algorithmic optimizations • Opportunistic benefits – A clear advantage when • Servers gossip, and • Configurations have members in common • Changes are minimally intrusive – Modest increase in bookkeeping and the size of messages

Future Work • Formal reasoning – Use the Input/Output Automata framework to demonstrate that the new changes preserve consistency guarantees of RAMBO • Simulation – Use the TEMPO toolkit to simulate RAMBO executions and build confidence in our proofs • Empirical experiments – Augment the existing implementations of RAMBO and collect behavior data on Planet-Lab

Special Thanks to: The MIT PRIMES Program Supervisor Prof. Nancy Lynch Mentor Dr. Peter Musial