Broadcast Variants Distributed Systems DNR why broadcasts distributed

Broadcast Variants Distributed Systems (DNR)

why broadcasts? • distributed systems are inherently group oriented and hence it is more useful to talk about one-to-all or one-to-many communication, that is broadcast and multicast within the broader context of group communication • most useful the general replication is expected sequence of in database replication and in case of state machine – where every server replica to respond to the same requests Distributed Systems (DNR) 2

• compared to unicast communication, the problems are made complex by message ordering (at the receiving end) and reliability (sending process crashes) issues in broadcast • message ordering and reliability are orthogonal to each other, with often hybrid models existing Distributed Systems (DNR) 3

*p 1, p 2 with p 1 FIFO order broadcast and receive in misorder *P 2 crashing in the midst Distributed Systems (DNR) 4

• message ordering definitions: • FIFO order –if a process p sends m 1 before it sends m 2, then m 2 is not delivered at a process q before m 1 (easily implemented using message sequence numbers) • total order – if a process (correct or faulty) p delivers a message m 1 before m 2, then every process delivers m 2 only after it has delivered m 1 • causal order – for every process p, if m 1 happens before m 2, then m 2 is not delivered at q before m 1 is Distributed Systems (DNR) 5

• causal ordering single source FIFO ordering • total ordering FIFO or causal ordering • a combination of FIFO-total order broadcast (which enforces single source FIFO), or, causal-total order broadcast (which preserves causality) is possible Distributed Systems (DNR) 6

m 1 m 2 (FIFO) and m 1 m 3 (causal) is maintained in the total order p 1 m 3 m 2 p 3 m 1 m 3 Distributed Systems (DNR) m 2 7

• we will discuss: – best effort broadcast (BEBcast) – reliable broadcast (RBcast) – terminating reliable broadcast (TRBcast) – uniform reliable broadcast (URBcast) – (uniform reliable) causal order broadcast (COBcast) – (uniform reliable) total order broadcast (ABcast, or atomic broadcast) Distributed Systems (DNR) 8

assumptions • groups are static: dynamic groups are not addressed here • processes will not have access to stable storage (no fail-recovery) • asynchronous and at the network level, point-to-point communication • fail-stop processes unless otherwise stated Distributed Systems (DNR) 9

• Channels- two interpretations of liveness criterion: • reliable channel – a reliable channel between processes p and q ensures the following: if p executes send(m) and q is correct, then q eventually receives m • quasi reliable channel – a quasi reliable channel between processes p and q ensures the following: if p and q are correct and p executes send(m), then q eventually receives m Distributed Systems (DNR) 10

• reliable vs. quasi-reliable: • let process q be correct; a reliable channel implies if p executes send(m) at time t, and crashes at time t+1, then q must eventually receive m, a useful model of a shared persistent space • a quasi reliable channel is weaker – both p and q must be correct at the same time, a useful model of TCP with error recovery Distributed Systems (DNR) 11

Best effort broadcast (BEBcast) • burden of ensuring reliability is only on the sender: as long as the sender of a message does not crash, the properties of a quasi reliable channel ensure that all correct processes eventually deliver message • operations: • at p, BEBcast(m): for every process q p, send (m) by reliable unicast • on receive(m) at q : BEBdeliver(m) at q Distributed Systems (DNR) 12

• transport level mechanisms: reliable unicast by TCP (ack-implosion problem) or IP multicast Distributed Systems (DNR) 13

• properties: • validity (a liveness property)– for any two correct processes p and q, every message broadcast by p is eventually delivered by q • integrity (a safety property)– for any message m, every correct process q delivers m at most once, and only if m was previously broadcast by some process p Distributed Systems (DNR) 14

Distributed Systems (DNR) 15

Reliable broadcast (RBcast) • in best effort broadcast, if the sender fails immediately after broadcasting to all, as end to end error recovery is not possible in such a case, the correct processes might disagree on whether or not to deliver the message • reliable broadcast ensures that correct process agree on the messages they deliver even when the sender crashes, i. e. , adheres to the properties of a reliable channel Distributed Systems (DNR) 16

• reliable broadcast is built on top of best -effort broadcast + failure detector abstraction Distributed Systems (DNR) 17

• operations: • at p, RBcast(m) BEBcast(m) • at q BEBdeliver(m) RBdeliver(m) • if q unreliably detects that p has crashed then BEBcast(m) • note – retransmission received by other correct processes must handle duplicates properly Distributed Systems (DNR) 18

• properties: • validity – if a correct process p broadcasts a message m, then p eventually delivers m • integrity – for a message m, a correct process q delivers m at most once and only if m was previously broadcast by some process p • agreement (a liveness property)– if a correct process p delivers a message m, then m is eventually delivered by every correct process q Distributed Systems (DNR) 19

• Is the following run acceptable? • process p executes RBcast(m) and later crashes; some process q RBdelivers m and then crashes; all other processes are correct, but none of them RBdelivers m • process p executes RBcast(m) and later crashes: validity not violated Distributed Systems (DNR) 20

uniform reliable broadcast (URBcast) • consider the scenario discussed earlier: process p 1 executes RBcast(m) and later crashes; some process p 2 RBdelivers m and then crashes; all other processes are correct, but none of them RBdelivers m; satisfies reliable broadcast, nevertheless seem to be lacking in some aspect. . Distributed Systems (DNR) 21

• the problem is q RBdelivers m and then only takes a step to rebroadcast if the source failure is detected • URBCAST ensures that a process (correct or not) delivers the message only when it knows that the message has been seen (BEBdeliver) by all correct processes • URB property is important, say if processes are interacting with outside world; a fact that a process has delivered a message is important, even if it has crashed afterwards; because before it had got crashed it might have communicated with external world; other processes must be aware of this situation Distributed Systems (DNR) 22

• agreement property replaced by uniform agreement – if some process (correct or not) p delivers a message m, then m is eventually delivered by every correct process q • reliable channel assumption holds – where, if p executes send(m) to q, q is correct, then eventually q receives m Distributed Systems (DNR) 23

• operations: • at p, URBcast(m) BEBcast(m) • at q BEBdeliver(m); if m received by q for the first time and q p, then BEBcast(m) URBdeliver(m) Distributed Systems (DNR) 24

Causal order broadcast (COBcast) • reliable broadcast does not guarantee any ordering among messages delivered by different processes • single source FIFO ordering is a special case of causal ordering where messages from the same process should be delivered in the order they were broadcast Distributed Systems (DNR) 25

• practical scenario: • on a publish-subscribe whiteboard p 1 broadcasts m 1 proposal to all which p 2 (sees and) replies with comment m 2 to all • here m 1 m 2 • due to arbitrary delay p 3 delivers m 2 before m 1 and has to withhold m 2 • a suitable ‘middleware’ for causal ordering would relieve the programmer from performing such a task Distributed Systems (DNR) 26

• we say that a message m 1 may potentially have caused another message m 2 (or m 1 m 2), if any of the following applies • m 1 and m 2 were broadcast by the same process p and m 1 was broadcast before m 2 • m 1 was delivered by process p, m 2 was broadcast after the delivery of m 1 • there exist some message m’ such that m 1 m’ and m’ m 2 Distributed Systems (DNR) 27

Distributed Systems (DNR) 28

• additional property: • causal delivery – no process p delivers a message m 2 unless p has already delivered every message m 1 such that m 1 m 2 • causally ordered broadcast can be achieved in the presence of crash failures • when RBcast is replaced by URBcast, we get a reliable causally ordered broadcast • two implementations discussed: Distributed Systems (DNR) 29

no-waiting causal broadcast • whenever a process RBdeliver(m), it COdeliver(m) without waiting for other messages to be RBdelivered • algorithm outline: • each message m carries a control field pastm which includes all messages that causally precede m Distributed Systems (DNR) 30

• when a message m is RBdelivered, pastm is first inspected where all messages in pastm that have not been COdelivered must be done so before m it self is COdelievered • each process memorises all messages it has COBcast or COdelivered in a variable past_list • past_list and pastm are ordered sets Distributed Systems (DNR) 31

$at pi: init: past_list = delivered_list = empty; upon <COBcast(m)> { RBcast(m, past_list); past_list$

at pi: init: past_list = delivered_list = empty; upon <COBcast(m)> { RBcast(m, past_list); past_list = past_list m; } upon <RBdeliver(pj, pastm, m)> if (m delivered_list) then { for all messages m’ pastm not delivered so far { COdeliver() in deterministic order; delivered_list= delivered_list m’; past_list= past_list m’; } COdeliver (pj, m); delivered_list = delivered_list m; past_list=past_list m; } Distributed Systems (DNR) 32

• in the figure above, p 4 RBdeliver m 2 first but since the message carries m 1 in its pastm, m 1 and m 2 are COdelivered in order; finally when m 1 is RBdelivered from p 1, it is discarded • weakness: long message size due to past casual history carried Distributed Systems (DNR) 33

waiting causal order broadcast • instead of keeping a record of all past messages, history is now represented by vector clocks • vector clocks essentially capture the causal precedence between messages • waiting COBcast relies on as before, underlying RBcast and RBdeliver primitives Distributed Systems (DNR) 34

• every process p maintains a vector clock that represents the number of messages that p has COdelivered from every other process, i. e. , VCp[j], j=1. . n, j p, and the number of messages it has itself COBcast, i. e. , VCp[p] • this vector is then attached to every message m that p COBcast • a process q that RBdeliver m interprets this vector time stamp to determine how many messages are missing (if any), and from which process Distributed Systems (DNR) 35

• as far as all previous messages from p are concerned this is VCp[p]-1 and then, all messages received by p before it had sent m, that is VCp[k], k p • process q needs to COdeliever all these missing messages before it can COdeliver m Distributed Systems (DNR) 36

• at p 2, interpretation of the vector time stamp [0, 2, 0] implies that there is one message pending from p 1, one message from p 1 already RBdelivered but pending COdeliver and, none from p 0 Distributed Systems (DNR) 37

at pi: init: pending = empty; i, j VCi[j] =0; pending list ordered in increasing order of vector time upon COBcast(m) { COdeliver(pi, m); /receive locally RBcast(VCi, pi, m); VCi[i]++; } upon RBdeliever(VCj, pj, m) { for i j augment pending with (VCj, pj, m); /ignore messages from self wait until VCj[j]=VCi[j]+1 and k i VCj[k] VCi[k]; { remove (VCj, pj, m) from pending; COdeliever(pj , m); VCi[j]++; } } Distributed Systems (DNR) 38

Total order broadcast (TOBcast) • causal order broadcast enforces a global ordering for all messages that are causally depended on each other • messages that are no so, are said to be concurrent and could be delivered in any order • a total order abstraction orders all messages, even those that are concurrent • it is some times possible to have a total order that does not respect causal order • a convenient abstraction for managing replicated state machines (e. g. , in fault tolerant servers) Distributed Systems (DNR) 39

• totally ordered reliable broadcast cannot be achieved in the presence of crash failures when the underlying communication is asynchronous • this is because totally ordered broadcast consensus; recall that consensus cannot be solved in an asynchronous system with failures (FLP result) • assumptions: asynchronous with no process failures, or synchronous with fail-stop processes • how do we achieve causal-total order broadcast? Distributed Systems (DNR) 40

• properties: • validity – if a correct process p broadcasts a message m, then p eventually delivers m • integrity – for a message m, a correct process q delivers m at most once, and only if m was previously broadcast by some process p • uniform agreement (atomicity in delivery) – if a process p delivers a message m, then m is eventually delivered by every correct process q • uniform total order (an order property) – if a process (correct or faulty) p delivers a message m 1 before m 2, then every process delivers m 2 only after it has delivered m 1. Distributed Systems (DNR) 41

• algorithm 1 – asynchronous with no process failures • assume reliable (stronger condition under no failure assumption) and single source FIFO channel (each process stamps sequence numbers) • each process maintains an increasing counter, a time stamp, which is tagged with the message it broadcasts • each process also maintains a vector with estimates of the time stamps of all others Distributed Systems (DNR) 42

• suppose ts[j] is the vector element that corresponds to pj on pi; it says that pi will never again receive a message from pj with a smaller time stamp than or equal to this value • processes use special update time stamp messages to keep up the estimates • RBdelivered messages are queued in a pending list in the order of increasing <time stamp-ts(m): pid> pairs, say ts(m)^; pid used to break a tie • ABdeliver can be done for any message in pending list that has a time stamp greater than all of the elements of the current vector time of a process Distributed Systems (DNR) 43

at pi: (0 i n-1) init ts[j] = 0; (0 j n-1); pending = empty; ABcast (m) { ts[i]++; add (m, ts(i), pi) to pending; RBcast(m, ts[i], pi); } upon RBdeliver(m, ts(msg), pj), j i ignore self msg{ ts[j] = ts(msg); add (m, ts(msg), pj) to pending; if (ts(msg) > ts[i]) then { ts[i] = ts(msg); RBcast(new_ts, ts[i], pi); }} upon RBdeliver(new_ts, ts(new_ts), pj), j i ignore self msg ts[j] = ts(new_ts); delivery_test() /at any time while (m, ts(msg), pj) at head of pending list { k ts(msg) ts[k] { remove(m, ts(msg), pj) from pending; ABdeliver(m); }} Distributed Systems (DNR) 44

total order broadcast with time stamps Distributed Systems (DNR) 45

Total order broadcast by consensus • uses reliable broadcast and consensus as building blocks • messages are first disseminated using a reliable broadcast primitive and are stored in a bag of unordered messages at every process • processes then use consensus to order the messages in the bag Distributed Systems (DNR) 46

• algorithm works in rounds • there is one consensus instance per round • messages to be delivered in a round are agreed upon before proceeding to next round • RBcast can be replaced with URBcast to give ‘uniform total order broadcast’ • algorithm 2 – synchronous with fail-stop processes Distributed Systems (DNR) 47

Distributed Systems (DNR) 48

init: unordered = delivered = empty; round = 1; wait = false; TOBcast (m) { RBcast(m); } upon RBdeliver(m){ if (m delivered) then unordered = unordered m; } upon ((unordered empty) (wait = false)) { wait = true; propose(round, unordered); }/ propose() and decide() are consensus primitives upon (m’ decide(round)) { / may take f+1 rounds in case of failures delivered = delivered m’; unordered = unordered m’; TOdeliever(m’); round++; wait = false; } Distributed Systems (DNR) 49

Terminating reliable broadcast (TRBcast) • uniform reliable broadcast says that if some process (correct or not) p delivers a message m, then m is eventually delivered by every correct process q • however, q cannot decide whether it should wait for m or not; q has no means to distinguish the case where some process has delivered m, and where q can indeed wait for m, from the case where no process will ever deliver m, in which case q should definitely not keep waiting for m Distributed Systems (DNR) 50

• suppose a process r urbcasts message m, but crashed while doing so and another process p detects that r has crashed without seeing m • this does not mean that m was not broadcast • this nuance is captured by terminating reliable broadcast • TRBcast ensures precisely that every process q either delivers the message m or some indication F that m will never be delivered (by any process); abstraction is defined for a specific originator process src Distributed Systems (DNR) 51

• properties: • validity – if the sender src is correct and broadcasts a message m, then src eventually delivers m • integrity – if a correct process delivers a message m then either m=F or m was previously broadcast by src • uniform agreement – if any process delivers a message m, then m is eventually delivered by every correct process • assumptions: synchronous with fail stop processes Distributed Systems (DNR) 52

• underlying abstractions – a perfect failure detector, consensus, best effort broadcast • the source of message src identifies it self as the originator in the message m in the best effort broadcast to all • a participant joins in the trbcast by broadcasting a special null message • every process waits until it either gets a message broadcast by the sender or detects the crash of sender • all processes run a consensus instance to agree on whether to deliver m or the failure notification F Distributed Systems (DNR) 53

Distributed Systems (DNR) 54

init: proposal =decision = null; TRBcast (m, psrc) BEBcast(m); upon (BEBdeliver(m, psrc) (proposal= null)) propose (m); upon ((psrc_crash) (proposal=null)) propose (Fsrc); upon decide(decision) / consensus round TRBdeliever(decision, psrc); --------------------[Scanned figures in the slides have been extracted from the text books of R. Guerroui and H. Attiya] Distributed Systems (DNR) 55