CS 162 Section 10 Twophase commit Faulttolerant computing

  • Slides: 19
Download presentation
CS 162 Section 10 Two-phase commit Fault-tolerant computing

CS 162 Section 10 Two-phase commit Fault-tolerant computing

Administrivia • Project 3 code due Thursday 4/17 by 11: 59 PM • Project

Administrivia • Project 3 code due Thursday 4/17 by 11: 59 PM • Project 4 design due date – Thursday 4/24 by 11: 59 PM • Midterm II is April 28 th 4 -5: 30 pm in 245 Li Ka Shing and 100 GPB – Covers Lectures #13 -24 – Closed book and notes, no calculators – One double-sides handwritten page of notes allowed – Review session: Fri Apr 25 4 -6 pm in 245 Li Ka Shing 4/14/2014 CS 162 ©UCB Spring 2014 Sec 10. 2

QUIZ 4/14/2014 CS 162 ©UCB Spring 2014 Sec 10. 3

QUIZ 4/14/2014 CS 162 ©UCB Spring 2014 Sec 10. 3

Quiz True/False 1. Two-phase commit is used to guarantee consistency. False 2. You always

Quiz True/False 1. Two-phase commit is used to guarantee consistency. False 2. You always need 2 PC if you want multiple systems to stay consistent. False (ex: use consistent client-side hashing to put keys amongst slaves, with no replication. ) 3. If a master server comes awake after crashing in the WAIT state, it resends VOTE_REQ in order to recount the slaves. False 4. Increasing mean-time-to-repair decreases availability (as defined in lecture). True 5. Bohrbugs won’t be fixed by restarting a task or system. True Short answer 5. What was the purpose of the TLS heartbeat, target of the Heartbleed attack? Make sure NATs/firewalls in the middle don’t shut down the connection. 4/14/2014 CS 162 ©UCB Spring 2014 Sec 10. 4

TWO-PHASE COMMIT 4/14/2014 CS 162 ©UCB Spring 2014 Sec 10. 5

TWO-PHASE COMMIT 4/14/2014 CS 162 ©UCB Spring 2014 Sec 10. 5

Durability and Atomicity, distributed • How do you make sure transaction results persist in

Durability and Atomicity, distributed • How do you make sure transaction results persist in the face of failures (e. g. , disk failures)? • Replicate database – Commit transaction to each replica • What happens if you have failures during a transaction commit? – Need to ensure atomicity: either transaction is committed on all replicas or none at all • How can we replicate with atomicity? 4/14/2014 CS 162 ©UCB Spring 2014 Sec 10. 6

Two-Phase Commit Coordinator Algorithm Worker Algorithm Coordinator sends VOTE-REQ to all workers – If

Two-Phase Commit Coordinator Algorithm Worker Algorithm Coordinator sends VOTE-REQ to all workers – If receive VOTE-COMMIT from all N workers, send GLOBAL-COMMIT to all workers – If doesn’t receive VOTE-COMMIT from all N workers, send GLOBALABORT to all workers 4/14/2014 – Wait for VOTE-REQ from coordinator – If ready, send VOTE-COMMIT to coordinator – If not ready, send VOTE-ABORT to coordinator – And immediately abort – If receive GLOBAL-COMMIT then commit – If receive GLOBAL-ABORT then abort CS 162 ©UCB Spring 2014 Sec 10. 7

Worker, Master states INIT Recv: START Send: VOTE-REQ Master WAIT Recv: VOTE-ABORT Send: GLOBAL-ABORT

Worker, Master states INIT Recv: START Send: VOTE-REQ Master WAIT Recv: VOTE-ABORT Send: GLOBAL-ABORT Recv: VOTE-COMMIT Send: GLOBAL-COMMIT INIT Recv: VOTE-REQ Send: VOTE-ABORT Recv: VOTE-REQ Send: VOTE-COMMIT READY Recv: GLOBAL-ABORT Worker 4/14/2014 ABORT CS 162 Recv: GLOBAL-COMMIT ©UCB Spring 2014 Sec 10. 8

Failure Free Example Execution coordinator GLOBALCOMMIT VOTE-REQ worker 1 worker 2 worker 3 VOTECOMMIT

Failure Free Example Execution coordinator GLOBALCOMMIT VOTE-REQ worker 1 worker 2 worker 3 VOTECOMMIT time 4/14/2014 CS 162 ©UCB Spring 2014 Sec 10. 9

Dealing with Worker Failures • How to deal with worker failures? – Failure only

Dealing with Worker Failures • How to deal with worker failures? – Failure only affects states in which the node is waiting for messages – Coordinator only waits for votes in “WAIT” state – In WAIT, if doesn’t receive INIT N votes, it times out and sends GLOBAL-ABORT Recv: START Send: VOTE-REQ WAIT Recv: VOTE-ABORT Send: GLOBAL-ABORT 4/14/2014 CS 162 Recv: VOTE-COMMIT Send: GLOBAL-COMMIT ©UCB Spring 2014 Sec 10. 10

Dealing with Coordinator Failure • How to deal with coordinator failures? – worker waits

Dealing with Coordinator Failure • How to deal with coordinator failures? – worker waits for VOTE-REQ in INIT » Worker can time out and abort (coordinator handles it) – worker waits for GLOBAL-* message in READY » If coordinator fails, workers must BLOCK waiting for coordinator INIT to recover and send Recv: VOTE-REQ GLOBAL_* message Send: VOTE-ABORT Send: VOTE-COMMIT READY Recv: GLOBAL-ABORT Recv: GLOBAL-COMMIT ABORT 4/14/2014 CS 162 ©UCB Spring 2014 COMMIT Sec 10. 11

Example of Coordinator Failure INIT READY ABORT COMM coordinator restarted VOTE-REQ worker 1 VOTE-COMMIT

Example of Coordinator Failure INIT READY ABORT COMM coordinator restarted VOTE-REQ worker 1 VOTE-COMMIT GLOBAL-ABORT worker 2 block waiting for coordinator worker 3 4/14/2014 CS 162 ©UCB Spring 2014 Sec 10. 12

Remembering Where We Were (Durability) • All nodes use stable storage* to store which

Remembering Where We Were (Durability) • All nodes use stable storage* to store which state they are in • Upon recovery, it can restore state and resume: – Coordinator aborts in INIT, WAIT, or ABORT – Coordinator commits in COMMIT – Worker aborts in INIT, ABORT – Worker commits in COMMIT – Worker asks Coordinator in READY * - stable storage is non-volatile storage (e. g. backed by disk) that guarantees atomic writes. 4/14/2014 CS 162 ©UCB Spring 2014 Sec 10. 13

FAULT-TOLERANT COMPUTING 4/14/2014 CS 162 ©UCB Spring 2014 Sec 10. 14

FAULT-TOLERANT COMPUTING 4/14/2014 CS 162 ©UCB Spring 2014 Sec 10. 14

Dependability: The 3 ITIES • Reliability / Integrity: does the right thing. (Need large

Dependability: The 3 ITIES • Reliability / Integrity: does the right thing. (Need large MTBF) • Availability: does it now. (Need small MTTR MTBF+MTTR) Integrity Reliability Security Availability • System Availability: if 90% of terminals up & 99% of DB up? (=> 89% of transactions are serviced on time) MTBF or MTTF = Mean Time Between (To) Failure MTTR = Mean Time To Repair (see next slide) 4/14/2014 CS 162 ©UCB Spring 2014 Lec 20. 15

Mean Time to Recovery • Critical time as further failures can occur during recovery

Mean Time to Recovery • Critical time as further failures can occur during recovery • Total Outage duration (MTTR) = Time to Detect + Time to Diagnose + Time to Decide + Time to Act 4/14/2014 (need good monitoring) (need good docs/ops, best practices) (need good org/leader, best practices) (need good execution!) CS 162 ©UCB Spring 2014 Sec 10. 16

Traditional Fault Tolerance Techniques • Fail fast modules: work or stop • Spare modules:

Traditional Fault Tolerance Techniques • Fail fast modules: work or stop • Spare modules: yield instant repair time • Process/Server pairs: Mask HW and SW faults • Transactions: yields ACID semantics (simple fault model) 4/14/2014 CS 162 ©UCB Spring 2014 Sec 10. 17

Fail-Fast is Good, but Repair is Needed Fault Lifecycle of a module fail-fast gives

Fail-Fast is Good, but Repair is Needed Fault Lifecycle of a module fail-fast gives short fault latency High Availability is low UN-Availability Unavailability ~ Detect XX Return Repair MTTR MTBF Improving either MTTR or MTBF gives benefit Simple redundancy does not help much (can actually hurt!) 4/14/2014 CS 162 ©UCB Spring 2014 Sec 10. 18

Fail-Fast and High-Availability Execution Process Pairs: Instant repair Use Defensive programming to make a

Fail-Fast and High-Availability Execution Process Pairs: Instant repair Use Defensive programming to make a process fail-fast Have separate backup process ready to “take over” if primary faults • SW fault is a Bohrbug no repair “wait for the next release” or “get an emergency bug fix” or “get a new vendor” • SW fault is a Heisenbug restart process “reboot and retry” • Yields millisecond repair times • Tolerates some HW faults 4/14/2014 CS 162 ©UCB Spring 2014 Sec 10. 19