Unreliable Failure Detectors for Reliable Distributed Systems Tushar

Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS 454 Lawrence Leinweber

Two-Army Problem • Unreliable Channel – Can’t Guarantee Correct Communication – Last Message May be Lost

Byzantine Generals Problem (1) 3 1 2 1 4 3 7 4 1 1 4 3 • Unreliable Processors (Traitors) – Report Incorrect Values (Troop Levels)

Byzantine Generals Problem (2) 1, 2, 3, 4 1, 7, 3, 4 1, 1, 3, 4 4, 6, 6, 8 1, 2, 3, 4 1, 1, 1, 1 1, 7, 3, 4 • Loyal Generals Need to Verify Reports – Use Reports as Votes on Correct Values – That’s About It with the Color Diagrams

Distributed System 1. System of Processors 2. Connected In a Network 3. Running Independently 4. Solving Problems Together

Types of Failure 1. Unreliable Communication Channels 2. Processors Crash or Create Mischief 3. Synchronizing Processors • Atomic Broadcast 4. Problems Agreeing On Results • Consensus

Scope of This Solution 1. Processors Can Crash • Crashed Processors Never Recover • Processors are Not Malicious 2. Reliable Communication Channels 3. Asynchronous • Synchronize After a Finite Number of Steps 4. At Least One Processor is Correct • Every Down Processor is Detected By at Least One Up Processor • At Least One Up Processor is Detected By All Up Processors

Failure Detectors • Attached to Each Processor • Determine the Crash State of Some Processors – Processors Communicate Crash State Information • Imperfect – Suspect Processors Crashed – Slow Processors Might Become “Unsuspected” – Cause Host Processor to Abandon Other Processors

Completeness & Accuracy • Completeness – Down Processors are Abandoned • Accuracy – Up Processors are Not Abandoned

Function Definitions • abandons(p, q, t) – Processor p Abandons Processor q at Time t • is. Down(q, t) – Processor q is Really Down at Time t

Completeness • Strong Completeness – Every Down Processor is Abandoned by Every Up Processor Eventually – p, q, t 0, t > t 0: is. Down(q, t) abandons(p, q, t) • Weak Completeness – Every Down Processor is Abandoned by At Least One Up Processor Eventually – p, q, t 0, t > t 0: is. Down(q, t) abandons(p, q, t)

Accuracy • Strong Accuracy (Perpetual/Eventual) – Every Up Processor is Not Abandoned by Every Processor Ever/Eventually – Perpetual: p, q, t: is. Down(q, t) abandons(p, q, t) – Eventual: p, q, t 0, t > t 0: is. Down(q, t) abandons(p, q, t) • Weak Accuracy (Perpetual/Eventual) – At Least One Up Processor is Not Abandoned by Any Processor Ever/Eventually – Perpetual: p, q, t: is. Down(q, t) abandons(p, q, t) – Eventual: p, q, t 0, t > t 0: is. Down(q, t) abandons(p, q, t)

Classes of Failure Detectors • 8 Combinations of Completeness and Accuracy Strong Weak Perpetual Accuracy Strong Eventual Accuracy Weak Eventual Accuracy Strong Completeness P S P S Weak Completeness Q W Q W

Reducibility (Emulation) • Some Classes are More Powerful Than Others – Strong Complete Can Emulate Weak Complete • Some Classes Can Emulate Others Using an Algorithm: – Up Processors Share Lists of Abandoned Processors, Exclude Themselves – Abandoned by One Becomes Abandoned by All – Weak Complete Can Emulate Strong Complete

Completeness Classes Are Equivalent • 4 Distinct Accuracy Classes Strong Weak Perpetual Accuracy Strong Eventual Accuracy Weak Eventual Accuracy Strong Completeness P S P S Weak Completeness Q W Q W

Relationship of Accuracy Classes • Perpetual is More Powerful Than Eventual – Perpetual: t – Eventual: t 0, t > t 0 • Strong is More Powerful Than Weak – Strong: q – Weak: q

Relationship of Failure Detector Classes • P is Most Powerful; S is Least Powerful Strong Weak Perpetual Accuracy Strong Eventual Accuracy Weak Eventual Accuracy Strong Completeness P S P S Weak Completeness Q W Q W

The Consensus Problem • Processors Reach Agreement on a Value – Termination: All Up Processors – Agreement: All Agree to Same Value – Integrity: Decision is Final – Validity: A Proposed Value is Chosen • If They Can Agree on One Thing, They Can Agree on Anything • Algorithms for S and S Detectors – At Least One Up Processor Using S Detectors – A Majority of Up Processors Using S Detectors

Algorithm for S Detectors • S Detectors – At Least One Up Processor is Not Abandoned by Any Up Processor Ever 1. Collect Proposed Values from Each Processor – or the News That the Process Crashed 2. Collect Other Processors’ Knowledge of Proposed Values – Discard Values not Known to All 3. Pick (Consistently) a Value from Known Values • All Processors Get Phase 1 & 2 Information from the Processor That is Never Abandoned

Algorithm for S Detectors • Rotating Coordinator – Each Processor Takes Their Turn – Tries to Make Decision – If the Processor is Up and is Not Abandoned by Any Up Processor, the Decision is Made

Each Round of S Algorithm • At Least One Up Processor is Not Abandoned by Any Up Processor Eventually 1. All Processors Send Value and the Round Number to Coordinator 2. Coordinator Waits for a Majority and Sends the Value with the Latest Round Number to All Processors 3. Each Processor Indicates If It Abandoned Coordinator 4. Coordinator Waits for a Majority, If No Processor Abandoned Coordinator, the Value is Decided • Repeat Until Coordinator is Not Abandoned Eventually

Atomic Broadcast • All Processors Receive the Same Messages in the Same Order • Atomic Broadcast is Equivalent to Consensus – Each Can Be Reduced to the Other – Solution to Consensus Applies to Atomic Broadcast

Atomic Broadcast Reduces to Consensus • Atomic Broadcast Can Be Implemented Using a Consensus Algorithm – Each Processor Proposes a Message – Consensus is Used to Decide Which Message is Recognized as the Next Atomically Broadcast Message

Consensus Reduces to Atomic Broadcast • Consensus Can Be Implemented Using An Atomic Broadcast Algorithm – To Decide a Value, a Process Atomically Broadcasts It – Go to Lunch Early

Summary • Reliable Distributed Systems • Unreliable Failure Detectors • Relationship of Detector Classes • Algorithms for Consensus • Equivalence with Atomic Broadcast
- Slides: 25