Replication techniques Primarybackup RSM Paxos Jinyang Li Fault

Replication techniques Primary-backup, RSM, Paxos Jinyang Li

Fault tolerance => replication • How to recover a single node from power failure? – Wait for reboot • Data is durable, but service is unavailable temporarily – Use multiple nodes to provide service • Another node takes over to provide service

Replicated state machine (RSM) • RSM is a general replication method – Lab 6: apply RSM to lock service • RSM Rules: – All replicas start in the same initial state – Every replica apply operations in the same order – All operations must be deterministic • All replicas end up in the same state

RSM op. A op. B • How to maintain a single order in the face of concurrent client requests?

RSM: primary/backup op. A op. B primary op. A backup op. B • Primary/backup: ensure a single order of ops: – Primary orders operations – Backups execute operations in order

Case study: Hypervisor [Bressoud and Schnider] • Hypervisor’s goal: fault tolerance – Banks, telephone exchanges, NASA need fault tolerant computing – CPUs are most likely to fail due to complexity • Hypervisor: primary/backup replication – If primary fails, backup takes over – Caveat: assuming failure detection is perfect

Hypervisor replicates at VM-level • Why replicating at VM-level? – Hardware fault-tolerant machines is big in 80 s – Software solution is more economical than hardware ones – Replicating at O/S level is messy (many interfaces) – Replicating at app level requires programmer efforts (web developers are not used to writing RSM codes) – Replicating at VM level has a cleaner interface (and no need to change O/S or app) • Primary and backup executes the same sequence of machine instructions

A Strawman design mem • • mem Two identical machines Same initial memory/disk content Start execute on both machines Will they perform the same computation?

A strawman design • Property: nodes executing the same sequence of ops see same effect if ops are deterministic • What are deterministic operations? – ADD, MUL etc. – Read time-of-day register, cycle counter, privilege level? – Read memory? – Read disk? – Interrupt timing? – External input devices (network, keyboard) – External output (network, display)

Deal with I/O devices mem Strawman replicates disks at both machines Problem: hardwares might not behave identically (e. g. fail at different sectors) SCSI bus primary ethernet backup Hypervisor connects devices to at both machines • Only primary reads/writes to devices • Primary sends read values to backup • Only primary handles interrupts from h/w • Primary sends interrupts to backup

Hypervisor executes in epochs • How to execute interrupts at the same time on both nodes? • Strawman: execute instructions one at a time – Backup waits from primary to send interrupt at end of each instruction – Very slow…. • Hypervisor executes in epochs – CPU h/w interrupts every N instructions (so both nodes stop at the same point) – Primary delays all interrupts till end of an epoch – Primary sends all interrupts to backup

Hypervisor failover • If primary fails, backup must handle I/O • How does backup know primary has failed? • How des backup know if primary has handled I/O writes in the last epoch? – Relies on O/S to re-try the I/O • Device needs to support repeated ops – OK for disk writes/reads – OK for network (TCP will figure it out) – How about keyboard, printer, ATM cash machine?

Hypervisor implementation • Hypervisor needs to trap every nondeterministic instruction – – Time-of-day register HP TLB replacement HP branch-and-link instruction Memory-mapped I/O loads/stores • Performance penalty is reasonable – A factor of two slow down – How about its performance on modern hardware?

Caveats in Hypervisor • What if the network between primary and backup fails? – Primary is still running – Backup becomes a new primary – Two primaries at the same time! • Can timeouts detect failures correctly? – Pings from backup to primary are lost – Pings from backup to primary are delayed

Paxos: general approach • One (or more) node decides to be the leader • Leader proposes a value and solicits acceptance from others • Leader announces result or try again

Paxos: fault tolerant agreement • Paxos lets all nodes agree on the same value despite node failures, network failures and delays • Extremely useful: – e. g. Nodes agree that X is the primary – e. g. Nodes agree that the last instruction I is executed

Paxos requirement • Correctness: – All nodes agree on the same value – The agreed value X has been proposed by some node • Fault-tolerance: – If less than N/2 nodes fail, the rest nodes should reach agreement eventually w. h. p – Liveness is not guranteed

Why is agreement hard? • • What if >1 nodes become leaders simultaneously? What if there is a network partition? What if a leader crashes in the middle of solicitation? What if a leader crashes after deciding but before announcing results? • What if the new leader proposes different values than already decided value?

Paxos setup • Each node proposes a value for a particular view – Agreement is of type vid=<XYZ> • Each node runs as a proposer, acceptor or learner

Strawman • Designate a single node X as acceptor (e. g. one with smallest id) – Each proposer sends its value to X – X decides on one of the values – X announces its decision to all learners • Problem? – Failure of the single acceptor halts decision – Need multiple acceptors!

Strawman 2: multiple acceptors • Each proposer (leader) prpose to all acceptors • Each acceptor accepts the first proposal it receives • If the leader receives positive replies from a majority of acceptors, it decides on its own value – There is at most 1 majority, hence only a single value is decided • Leader sends decided value to all learners • Problem: – What if multiple leaders propose simultaneously so there is no majority accepting?

Paxos solution • Proposals are ordered by proposal # • Each acceptor may accept multiple proposals – If a proposal with value v is chosen, all higher proposals have value v

Paxos operation: node state • Each node maintains: – na, va: highest proposal # and its corresponding value accepted by me – nh: highest proposal # seen – myn: my proposal # in current Paxos

Paxos operation: 3 P protocol • Phase 1 (Prepare) – – A node decides to be leader (and propose) Leader choose myn > nh Leader sends <prepare, myn> to all nodes Upon receiving <prepare, n> If n < nh reply <prepare-reject> Else nh = n reply <prepare-ok, na, va>

Paxos operation • Phase 2 (Accept): – If leader gets prepare-ok from a majority V = non-empty value corresponding to the highest na received If V= null, then leader can pick any V Send <accept, myn, V> to all nodes – If leader fails to get majority prepare-ok • Delay and restart Paxos – Upon receiving <accept, n, V> If n > nh na = n; va = V reply with <accept-ok> Else reply with reject

Paxos operation • Phase 3 (Decide) – If leader gets accept-ok from a majority • Send <decide, va> to all nodes – If leader fails to get accept-ok from a majority • Delay and restart Paxos

Paxos operation: an example hn: N 1: 0 na = va = null hn: N 0: 0 na = va = null Prepare, N 1: 1 hn: N 1: 1 na = null va = null Prepare, N 1: 1 ok, na= va=null ok, na =va=nulll Accept, N 1: 1, val 1 hn: N 1, 1 na = N 1, 1 va = val 1 hn: N 1, 1 na = null va = null Accept, N 1: 1, val 1 ok ok Decide, val 1 N 0 hn: N 2, 0 na = N 1, 1 va = null hn: N 1, 1 na = N 1, 1 va = val 1 Decide, val 1 N 2

Paxos properties • When is the value V chosen? 1. When leader receives a majority prepare-ok and proposes V 2. When a majority nodes accept V 3. When the leader receives a majority acceptok for value V

Understanding Paxos • What if more than one leader is active? • Suppose two leaders use different proposal number, N 0: 10 N 1: 11 • Can both leaders see a majority of prepare-ok?

Understanding Paxos • What if leader fails while sending accept? • What if a node fails after receiving accept? – If it doesn’t restart, ? – If it reboots, ? • What if a node fails after sending prepare-ok? – If it reboots, ?

Using Paxos for RSM • Fault-tolerant RSM requires consistent replica membership – Membership: <primary, backups> – RSM goes through a series of membership changes <vid-0, primary, backups><vid-1, primary, backups>. . • Use Paxos to agree on the <primary, backups> for a particular vid

Lab 5: Using Paxos for RSM vid 1: N 1 N 2 joins vid 2: N 1, N 2 All nodes start with static config vid 1: N 1 A majority in vid 1: N 1 accept vid 2: N 1, N 2 N 3 joins vid 3: N 1, N 2, N 3 A majority in vid 2: N 1, N 2 accept vid 3: N 1, N 2, N 3 fails vid 4: N 1, N 2 A majority in vid 3: N 1, N 2, N 3 accept vid 4: N 1, N 2

Lab 5: Using Paxos for RSM vid 1: N 1 vid 2: N 1, N 2 Pre par oldv iew, N 2 e, v id 2, vid 2 N 3: 1 =N 1 , N 2 N 3 vid 1: N 1 N 3 joins

Lab 5: Using Paxos for RSM vid 1: N 1 vid 2: N 1, N 2 Pre par e, v id 3, N 3: 1 N 3 3: 1 vid 1: N 1 vid 2: N 1, N 2 Pr , N 3 d i v , epare vid 1: N 1 vid 2: N 1, N 2 N 3 joins

Lab 6: re-configurable RSM • Use RSM to replicate lock_server • Primary in each view assigns a viewstamp to each client requests – Viewstamp is a tuple (vid: seqno) – (0: 0)(0: 1)(0: 2)(0: 3)(1: 0)(1: 1)(1: 2)(2: 0)(2: 1) • All replicas execute client requests in viewstamp order

Lab 6: Viewstamp replication • To execute an op with viewstamp (vs), a replica must have executed all ops < vs • A replica can transfer state from another node to ensure its state reflect executions of all ops < vs

Lab 5: Using Paxos for RSM vid 1: N 1 myvs: (1: 50) N 1 Pre par e, a … c cep Larg est t, de cide vs ( 1: 50 ) Tran N 2 sfer sta up-t o-da te to br ing te etc. N 2 vid 1: N 1 N 2 joins