Eventual Consistency Jinyang Sequential consistency Sequential consistency properties

  • Slides: 37
Download presentation
Eventual Consistency Jinyang

Eventual Consistency Jinyang

Sequential consistency • Sequential consistency properties: – Latest read must see latest write •

Sequential consistency • Sequential consistency properties: – Latest read must see latest write • Handles caching – All writes are applied in a single order • Handles concurrent writes • Realizing sequential consistency: – Reads/writes from a single node execute one at a time – All reads/writes to address X must be ordered by one memory/storage module responsible for X

Realizing sequential consistency )1 (A W Cache or replica I id l a nv

Realizing sequential consistency )1 (A W Cache or replica I id l a nv , e t a ) B ( R W (B ) 3 W (A )2 Cache Or replica

Disadvantages of sequential consistency • Requires highly available connections – Lots of chatter between

Disadvantages of sequential consistency • Requires highly available connections – Lots of chatter between clients/servers • Not suitable for certain scenarios: – Disconnected clients (e. g. your laptop) – Apps might prefer potential inconsistency to loss of availability

Why (not) eventual consistency? • Support disconnected operations – Better to read a stale

Why (not) eventual consistency? • Support disconnected operations – Better to read a stale value than nothing – Better to save writes somewhere than nothing • Potentially anomalous application behavior – Stale reads and conflicting writes…

Operating w/o total connectivity Sync w/ server resolves non-conflicting changes, reports conflicting ones to

Operating w/o total connectivity Sync w/ server resolves non-conflicting changes, reports conflicting ones to user W(A)1 W(A)2 replica Client writes to its local replica No sync between clients

Pair-wise synchronization Pair-wise sync resolves non-conflicting changes, reports conflicting ones to users W(A)1 replica

Pair-wise synchronization Pair-wise sync resolves non-conflicting changes, reports conflicting ones to users W(A)1 replica W(B)3 replica W(A)2 replica

Examples usages? • File synchronizers – One user, many gadgets

Examples usages? • File synchronizers – One user, many gadgets

File synchronizer • Goal 1. All replica contents eventually become identical 2. No lost

File synchronizer • Goal 1. All replica contents eventually become identical 2. No lost updates – Do not replace new version with old ones

Prevent lost updates • Detect if updates were sequential – If so, replace old

Prevent lost updates • Detect if updates were sequential – If so, replace old version with new one – If not, detect conflict • “Optimistic” vs. “Pessimistic” – Eventual Consistency: Let updates happen, worry about whether they can be serialized later – Sequential Consistency: Updates cannot take effect unless they are serialized first

How to prevent lost updates? W(f)a H 1 f mtime: 15648 W(f)b f 16679

How to prevent lost updates? W(f)a H 1 f mtime: 15648 W(f)b f 16679 W(f)c H 2 12354 f 15648 23657 • Strawman: use mtime to decide which version should replace the other • Problem w/ wallclock: cannot detect disagreement on ordering

Strawman fix W(f)a H 1: 15648 W(f)b H 1: 15648 H 1: 16679 W(f)c

Strawman fix W(f)a H 1: 15648 W(f)b H 1: 15648 H 1: 16679 W(f)c H 1: 15648 H 2: 23657 • Carry the entire modification history • If history X is a prefix of Y, Y is newer

Compress version history W(f)a H 1: 1 W(f)b H 1: 2 H 1: 1

Compress version history W(f)a H 1: 1 W(f)b H 1: 2 H 1: 1 H 1: 2 W(f)c H 2 H 1: 1 H 1: 2 implies H 1: 1, so we only need one number per host H 1: 2 H 1: 1 H 1: 2 H 2: 1

Compare vector timestamp H 1: 1 H 2: 3 H 3: 2 < <

Compare vector timestamp H 1: 1 H 2: 3 H 3: 2 < < H 1: 1 H 2: 5 H 3: 7 H 1: 2 H 2: 1 H 3: 7

Using vector timestamp W(f)a H 1: 1 W(f)b H 1: 2 W(f)c H 2

Using vector timestamp W(f)a H 1: 1 W(f)b H 1: 2 W(f)c H 2 H 1: 1 H 1: 2 H 1: 1 H 2: 1

Using vector timestamp W(f)a H 1: 1 W(f)b H 1: 2 W(f)c H 2

Using vector timestamp W(f)a H 1: 1 W(f)b H 1: 2 W(f)c H 2 H 1: 1 H 2: 1

How to deal w/ conflicts? • Easy: mailboxes w/ two different set of messages

How to deal w/ conflicts? • Easy: mailboxes w/ two different set of messages • Medium: changes to different lines of a C source file • Hard: changes to same line of a C source file • After conflict resolution, what should the vector timestamp be?

What about file deletion? • Can we forget about the vector timestamp for deleted

What about file deletion? • Can we forget about the vector timestamp for deleted files? • Simple solution: treat deletion as a write – Conflicts involving a deleted file is easy • Downside: – Need to remember vector timestamp for deleted files indefinitely

Tra [Cox, Josephson] • What are Tra’s novel properties? – Easy to compress storage

Tra [Cox, Josephson] • What are Tra’s novel properties? – Easy to compress storage of vector timestamps – No need to check every file’s version vector during sync – Allows partial sync of subtrees – No need to keep timestamp for deleted files forever

Tra’s key technique • Two vector timestamps: 1. One represents modification time – Tracks

Tra’s key technique • Two vector timestamps: 1. One represents modification time – Tracks what a host has 2. One represents synchronization time – Tracks what a host knows • Sync time implies no modification happens since mod time H 1: 1 H 2: 5 H 3: 7 H 1: 10 H 2: 20 H 3: 25

Using sync time W(f 1)a H 1 H 2 H 1: 1 f 1

Using sync time W(f 1)a H 1 H 2 H 1: 1 f 1 H 1: 1 H 2: 0 W(f 2)b H 1: 2 f 2 H 1: 2 H 2: 0 H 1: 1 f 1 H 1: 0 H 1: 2 H 2: 0 f 2 H 1: 0 H 1: 2 H 2: 0

Compress mtime and synctime • dir synctime = element-wise min of child sync times

Compress mtime and synctime • dir synctime = element-wise min of child sync times • dir mtime = element-wise max of child mod times • Sync(d 1 d 1’) – Skip d 1 if mtime of d 1 is less than synctime of d 1’ • Can we achieve this with single mtime? – Skip d 1 if mtime of d 1 is less than mtime of d 1’

Synctime enables partial synchronization • Directory d 1 contains f 1 and f 2,

Synctime enables partial synchronization • Directory d 1 contains f 1 and f 2, suppose host sync a subtree (d 1/f 1) – With synctime+mtime: synctime of d 1 does not change. Mtime of d 1 increases – With mtime only: Mtime of d 1 increases • Host later syncs subtree d 1/f 2 – With synctime+mtime: will pull in modifications in e 2 because synctime of d 1 is smaller – With mtime only: skips d 1 because mtime is high enough

Using sync time W(f 1)a f 1 H 1: 1 f 2 H 1:

Using sync time W(f 1)a f 1 H 1: 1 f 2 H 1: 2 d H 1: 0 H 2: 0 only H 1: 1 f 2 H 1: 0 f 1 H 1: 0 H 1: 2 H 1: 0 H 2: 0 c f 2 Syn H 2 only H 1: 2 d H 1: 2 H 2: 0 c f 1 Syn H 1 W(f 2)b f 1 H 1: 1 f 2 H 1: 2 d H 1: 2 H 2: 0

How to deal w/ deletion W(f 1)a H 1 f 1 H 1: 1

How to deal w/ deletion W(f 1)a H 1 f 1 H 1: 1 D(f 2) f 2 H 1: 2 H 2: 0 Deletion notice for a deleted file contains its sync time d H 1: 2 H 2: 0 H 2 H 1: 1 f 2 H 1: 0 f 1 H 1: 0 H 1: 2 d H 1: 2 H 1: 0 H 2: 0

How to deal w/ deletion W(f 1)a H 1 f 1 H 1: 1

How to deal w/ deletion W(f 1)a H 1 f 1 H 1: 1 D(f 2) f 2 H 1: 2 H 2: 0 Deletion notice for a deleted file contains its sync time d H 1: 2 H 2: 0 H 2 f 1 H 1: 0 f 2 H 2: 1 f 1 H 1: 1 f 2 H 2: 1 H 1: 0 d H 1: 0 H 2: 1 H 1: 2 d H 1: 2 H 2: 1

Another definition of eventual consistency • Eventual consistency (Tra) – All replica contents are

Another definition of eventual consistency • Eventual consistency (Tra) – All replica contents are eventually identical – Do not care about individual writes, just overwrite old replica w/ new one • Eventual consistency (Bayou) – Writes are eventually applied in total order – Reads might not see most recent writes in total order

Bayou Write log Version Vector 0: 0 1: 0 2: 0 N 1 0:

Bayou Write log Version Vector 0: 0 1: 0 2: 0 N 1 0: 0 1: 0 2: 0 N 0 0: 0 1: 0 2: 0 N 2

Bayou propagation 1: 1 W(x) Write log 1: 0 W(x) 2: 0 W(y) 3:

Bayou propagation 1: 1 W(x) Write log 1: 0 W(x) 2: 0 W(y) 3: 0 W(z) N 0 Version Vector 0: 3 1: 0 2: 0 0: 0 1: 1 2: 0 N 1 1: 0 W(x) 2: 0 W(y) 3: 0 W(z) 0: 3 1: 0 2: 0 0: 0 1: 0 2: 0 N 2

Bayou propagation Write log 1: 0 W(x) 2: 0 W(y) 3: 0 W(z) N

Bayou propagation Write log 1: 0 W(x) 2: 0 W(y) 3: 0 W(z) N 0 Version Vector 0: 3 1: 0 2: 0 1: 0 W(x) 1: 1 W(x) 2: 0 W(y) 3: 0 W(z) N 1 1: 1 W(x) 0: 3 1: 4 2: 0 0: 0 1: 0 2: 0 N 2

Bayou propagation Write log N 0 1: 0 W(x) 1: 1 W(x) 2: 0

Bayou propagation Write log N 0 1: 0 W(x) 1: 1 W(x) 2: 0 W(y) 3: 0 W(z) Version Vector N 1 1: 0 W(x) 1: 1 W(x) 2: 0 W(y) 3: 0 W(z) 0: 3 1: 4 2: 0 0: 4 1: 4 2: 0 Which portion of The log is stable? 0: 0 1: 0 2: 0 N 2

Bayou propagation Write log N 0 1: 0 W(x) 1: 1 W(x) 2: 0

Bayou propagation Write log N 0 1: 0 W(x) 1: 1 W(x) 2: 0 W(y) 3: 0 W(z) Version Vector N 1 1: 0 W(x) 1: 1 W(x) 2: 0 W(y) 3: 0 W(z) 0: 3 1: 4 2: 0 0: 4 1: 4 2: 0 N 2 1: 0 W(x) 1: 1 W(x) 2: 0 W(y) 3: 0 W(z) 0: 3 1: 4 2: 5

Bayou propagation Write log N 0 1: 0 W(x) 1: 1 W(x) 2: 0

Bayou propagation Write log N 0 1: 0 W(x) 1: 1 W(x) 2: 0 W(y) 3: 0 W(z) Version Vector N 1 0: 4 1: 4 2: 0 1: 0 W(x) 1: 1 W(x) 2: 0 W(y) 3: 0 W(z) 0: 3 1: 6 2: 5 0: 3 1: 4 2: 5 N 2 1: 0 W(x) 1: 1 W(x) 2: 0 W(y) 3: 0 W(z) 0: 4 1: 4 2: 5

Bayou uses a primary to commit a total order • Why is it important

Bayou uses a primary to commit a total order • Why is it important to make log stable? – Stable writes can be committed – Stable portion of the log can be truncated • Problem: If any node is offline, the stable portion of all logs stops growing • Bayou’s solution: – – A designated primary defines a total commit order Primary assigns CSNs (commit-seq-no) Any write with a known CSN is stable All stable writes are ordered before tentative writes

Bayou propagation ∞: 1: 1 W(x) Write log 1: 1: 0 W(x) 2: 2:

Bayou propagation ∞: 1: 1 W(x) Write log 1: 1: 0 W(x) 2: 2: 0 W(y) 3: 3: 0 W(z) Version Vector 0: 3 1: 0 2: 0 0: 0 1: 1 2: 0 N 1 ∞: 1: 1 W(x) N 0 0: 0 1: 1 2: 0 0: 0 1: 0 2: 0 N 2

Bayou propagation ∞: 1: 1 W(x) Write log 1: 1: 0 W(x) 2: 2:

Bayou propagation ∞: 1: 1 W(x) Write log 1: 1: 0 W(x) 2: 2: 0 W(y) 3: 3: 0 W(z) N 0 4: 1: 1 W(x) Version Vector 0: 4 1: 1 2: 0 0: 0 1: 1 2: 0 N 1 1: 1: 0 W(x) 2: 2: 0 W(y) 3: 3: 0 W(z) 4: 1: 1 W(x) 0: 4 1: 1 2: 0 0: 0 1: 0 2: 0 N 2

Bayou’s limitations • Primary cannot fail • Server creation & retirement makes node. ID

Bayou’s limitations • Primary cannot fail • Server creation & retirement makes node. ID grow arbitrarily long • Anomalous behaviors for apps? – Calendar app