Dynamic Atomic Storage Without Consensus Alex Shraer Technion

  • Slides: 17
Download presentation
Dynamic Atomic Storage Without Consensus Alex Shraer (Technion) Joint work with: Marcos K. Aguilera

Dynamic Atomic Storage Without Consensus Alex Shraer (Technion) Joint work with: Marcos K. Aguilera (MSR), Idit Keidar (Technion), Dahlia Malkhi (MSR) 1

The Goal server (process) client • Reliable replicated storage • Using unreliable components •

The Goal server (process) client • Reliable replicated storage • Using unreliable components • Asynchrony - tolerate unpredictable network delays 2

Designing an Asynchronous Replicated System • State machine replication (e. g. , Paxos) –

Designing an Asynchronous Replicated System • State machine replication (e. g. , Paxos) – Any object – Impossible in asynchronous systems • Atomic R/W Register [Attiya, Bar-Noy, Dolev 95] – – Simple object: read( ), write(v) Possible in asynchronous system Atomic (linearizable) Liveness: if #failures < #servers/2 then every operation invoked on a correct server eventually completes. 3

Breaking the Minority Barrier • Over a long period of time #failures < #servers/2

Breaking the Minority Barrier • Over a long period of time #failures < #servers/2 is not good enough • Reconfiguration! Our first – Increasing resilience by contribution: changing the set of servers First "black box" – Example: 3 failures out of 5 definition (in Aterms D E B of. C user interface) • Semantics of Reconfigurable R/W register: – Atomic (linearizable) – Liveness: ? 4

Reconfigurable Register: User Interface • read() • write(value) • (returns a value) (returns OK)

Reconfigurable Register: User Interface • read() • write(value) • (returns a value) (returns OK) reconfig(c) (returns OK) – c is a set of changes (relative to current config. ) – Each change is either (Add, pid) or (Remove, pid) – Example: c = {+C, +E, –D} change • Only processes that were successfully added can invoke ops • Universe of processes (servers): – – Unknown, unbounded, possibly infinite At any given time, only a finite number has been added

Definitions • Current(t) – servers in the system at time t – the “current

Definitions • Current(t) – servers in the system at time t – the “current configuration” • Add. Pending(t) – servers whose Add is pending at t • Remove. Pending(t) – servers whose Remove is pending at t • Faulty(t) – servers that have crashed by t • pi is active in an execution if – – – During the execution, pi does not crash Some process invokes reconfig adding pi No process invokes reconfig removing pi

Dynamic System Liveness • Static system: operations complete if #failures<#servers/2 • What should this

Dynamic System Liveness • Static system: operations complete if #failures<#servers/2 • What should this be in a dynamic system? • Try #1: for every t, a minority of Current(t) is in Faulty(t) What if processes crash while others are removed? reconfig({–A}) A B C OK no operation is guaranteed to complete in new configuration! • Try #2: for every t, a minority of Current(t) is in Faulty(t) Remove. Pending(t)

B D A OK nf ig( co re OK fig on re c Q:

B D A OK nf ig( co re OK fig on re c Q: At time t 0, who can crash from {A, B, . . . , G}? A: minority of {A, B, . . . , E}, and in addition, – in this scenario G can crash – in a different scenario F can crash ({+ F} ) {+ G} ) Adding Servers C F E G time t 0 • Simple condition: any 2 servers can fail (fewer than |Current(t)|/2)

Dynamic Service Liveness If #reconfigs invoked in the execution is finite and at every

Dynamic Service Liveness If #reconfigs invoked in the execution is finite and at every time t in the execution, fewer than |Current(t)|/2 processes out of Current(t) Add. Pending(t) are in Faulty(t) Remove. Pending(t) Then: Eventually, every active process that was successfully added can invoke operations Every operation invoked by an active process eventually completes

Reconfigurable Solutions Many previous solutions: All use consensus (or similar) • State machine replication

Reconfigurable Solutions Many previous solutions: All use consensus (or similar) • State machine replication (Paxos) – Use state-machine to agree on set of servers Our second contribution: membership service • Virtual Synchrony. Consensus based solutions is NOT needed! stronger than consensus – e. g. , [Yeger-Lotem, Keidar, Dolev 97] - algorithm Dyna. Store (equivalent to P) for a completely asynchronous system • R/W register + reconfiguration service – – [Lynch, Shvartsman 97], [Englert, Shvartsman 00] Rambo [Lynch, Shvartsman 02] Rambo II [Gilbert, Lynch, Shvartsman 03] Long Lived Rambo [Georgiou, Musial, Shvartsman 04] • Is consensus really necessary? one designated “reconfigurer” consensus to agree on next configuration 10

“Old” and “New” Configurations • A reconfiguration transfers the state from a majority of

“Old” and “New” Configurations • A reconfiguration transfers the state from a majority of the old config. to a majority of the new config. • What if there are concurrent reconfigurations ? • Suppose that initial configuration is {A, B, C, D} – – A invokes reconfig({+E}); C invokes reconfig({ D}) A writes to {A, D, E}, a majority of {A, B, C, D, E} C reads from {B, C}, a majority of {A, B, C} No intersection Atomicity is violated! • Simple solution: consensus on the sequence of configurations • But how can we do this without consensus?

The approach in Dyna. Store • For each configuration c, we use a (weak)

The approach in Dyna. Store • For each configuration c, we use a (weak) snapshot next. Config(c) to store the next configuration • (weak) snapshot objects are (easily) implemented in an asynchronous environment • Processes update next. Config(c) to suggest the next configuration after included c (concurrent updates possible) in every scan from next. Config(c) • Sequence of Established Configurations (simplified): – The initial configuration is established – If c is established, then the first snapshot update to next. Config(c) is the next established configuration after c

Transferring the State • scan of next. Config(c) returns a set of configs that

Transferring the State • scan of next. Config(c) returns a set of configs that follow c – if c is established, one config in the returned set is the next established config after c • scanning next. Config for each returned config returns a further set, etc. this creates a DAG of configurations – This DAG contains the sequence of established configs • A reconfiguration transfers state along all paths in the DAG – This guarantees that state is transferred along the sequence of established configurations

Example • Suppose that initial configuration is {A, B, C, D} • A invokes

Example • Suppose that initial configuration is {A, B, C, D} • A invokes reconfig({+E}); C invokes reconfig({ D}) {A, B, C, D, E} {A, B, C, D} C 1 C 0 • A updates next. Config(C 0) to C 1 • A scans next. Config(C 0) to check for concurrent updates. Scan returns {C 1}, i. e. , no concurrent updates detected – C 1 is the next established config after C 0 • A’s state transfer: – Read from maj. of C 0 and maj. of C 1 – Write latest value found to maj. of C 1

Example • Suppose that initial configuration is {A, B, C, D} • A invokes

Example • Suppose that initial configuration is {A, B, C, D} • A invokes reconfig({+E}); C invokes reconfig({ D}) {A, B, C, D, E} {A, B, C, D} C 1 C 0 C 3 C 2 {A, B, C, E} {A, B, C} • Concurrently, C updates next. Config(C 0) to C 2 and scans it. Scan returns {C 1, C 2}, implying that A’s update was concurrent • C updates next. Config(C 1) and next. Config(C 2) to C 3. No concurrent updates detected – C 3 is an established configuration • C’s state transfer: – Read from maj. of each config on every path found from C 0 to C 3 – Write latest value found to maj. of C 3

Example • Suppose that initial configuration is {A, B, C, D} • A invokes

Example • Suppose that initial configuration is {A, B, C, D} • A invokes reconfig({+E}); C invokes reconfig({ D}) {A, B, C, D, E} {A, B, C, D} C 1 C 0 C 3 C 2 {A, B, C, E} {A, B, C} • A invokes a write(new. Value) operation in C 1 • In this scenario, Dyna. Store guarantees: 1. Either C’s state transfer finds new. Value in C 1, or A’s write op discovers C 3 and ends after writing new. Value to maj. of C 3 3. Read operations also traverse the DAG, and will find new. Value on the path of established configurations, intersecting the write

Conclusions • First “black box” definition of dynamic R/W register – – – •

Conclusions • First “black box” definition of dynamic R/W register – – – • In terms of events visible to user A natural failure model – resilience changes dynamically Possibly useful for specifying other dynamic problems Dyna. Store: first asynch. dynamic storage protocol – – – Implements a Reconfigurable Atomic MWMR register In a completely asynchronous system (consensus impossible) Proves that R/W storage is really easier than consensus (not only in a static system) 17