Savitribai Phule Pune University Fourth Year of Computer

Savitribai Phule Pune University Fourth Year of Computer Engineering (2015 Course) Elective II 410245(A): Distributed Systems Credit : Examination Scheme: Teaching Scheme: 03 In-Sem (Paper): 30 Marks TH: 03 Hours/Week End-Sem (Paper): 70 Marks (Unit-5) Group Communication Prof. (Mr. ) Rahul B. Diwate www. rahuldiwate. com Department of Computer Engineering, AISSMSIOIT, Pune

Acknowledgement All the contents of presentation are referred from Text Book � Sukumar Ghosh, “Distribute Systems: An Algorithmic Approach”, Chapman and Hall, CRC Press, Second Edition, 2015, ISBN 10: 1584885645 ISBN 13: 9781584885641. � 2 www. rahuldiwate. com

Unit-5 Contents � Group Communication: Atomic multicast, � IP Multicast, � Application layer multicast, � Ordered multicast, � Reliable multicast, � Open groups. � � Replicated Data Management: Architecture of replicated Data Management, � Data-Centric Consistency models, � Client centric consistency protocols, � � Implementation of Data-Centric Consistency models, � Quorum based protocols, � Replica Placement, � Brewer‟s CAP algorithm. 3 www. rahuldiwate. com

Sr. No. Topic Name Book Ref. Page No. 1 Group Communication: Atomic multicast T 2 317 2 IP Multicast, Application layer multicast T 2 320, 321 3 Ordered multicast T 2 322 4 Reliable multicast T 2 326 5 Open groups. T 2 329 T 2 339, 340 T 2 351, 352 T 2 354, 355 Replicated Data Management: Architecture of replicated Data Management Data-Centric Consistency models Implementation of Data-Centric Consistency models, Quorum based protocols 6 7 8 9 Replica Placement, Brewer’s CAP algorithm. Text Book (T 2) : Sukumar Ghosh, “Distribute Systems: An Algorithmic Approach”, Chapman and Hall, CRC Press, Second Edition, 2015, ISBN 10: 1584885645 ISBN 13: 9781584885641 4 www. rahuldiwate. com

Group Communication A group is a collection of users sharing some common interest. Group based activities are steadily increasing. There are many types of groups: ¨ Open group (anyone can join, customers of Walmart) ¨ Closed groups (membership is closed, class of 2000) ¨ Peer-to-peer group (all have equal status, graduate students of CS department, members in a videoconferencing / netmeeting) ¨ Hierarchical groups (one or more members are distinguished from the rest. President and the employees of a company, distance learning). 5 www. rahuldiwate. com

Group Communication � In fault-tolerant distributed computing, an atomic broadcast or total order broadcast is a broadcast where all correct processes in a system of multiple processes receive the same set of messages in the same order; that is, the same sequence of messages. � The broadcast is termed "atomic" because it either eventually completes correctly at all participants, or all participants abort without side effects. � Atomic broadcasts are an important distributed computing primitive. 6 www. rahuldiwate. com

Group Communication �A group is a collection of users or objects sharing a common interest or working toward a common goal. With the rapid growth of the WWW and electronic commerce, group oriented activities have substantially increased in recent years. � Examples of groups are as follows: � (1) the batch of students who graduated from a high school in a given year, � (2) the members of a particular travel club, � (3) the students of a long-distance education course, � (4) the members participating in a videoconference, � (5) a set of replicated servers forming a highly available service, etc. 7 www. rahuldiwate. com

ATOMIC MULTICAST �A multicast in a group is called atomic, when the message is received either by every nonfaulty (i. e. , functioning) member or by no member at all. � Atomic multicast is a basic requirement of all grouporiented activities. � Cases where some nonfaulty members receive a particular message but others do not lead to inconsistent update of the states of the members and are not acceptable. 8 www. rahuldiwate. com

For example, � Consider a group of people forming a travel club. If a multicast about a special travel opportunity for the coming Christmas season is sent out, then every member of the travel club should receive it. � Another example is a group of replicated servers. � If the primary server fails, then a backup server is expected to take over. � For this to happen, the states of all servers must be identical to that of the primary server before the failure occurred (via the multicasts from the primary server). However, this will not be possible if one or more backup servers fail to receive some of the updates from the primary server. 9 www. rahuldiwate. com

� Failures play a prominent role in the implementation of atomic multicast. � Accordingly, we will consider two broad classes of atomic multicasts: basic and reliable. � Basic multicast rules out process crashes (or does not provide any guarantee when processes crash), whereas reliable multicasts take process crash into account and provide guarantees. If failures are ruled out, then every basic multicast is trivially atomic. Reliable atomic multicasts should satisfy the following three properties: � Validity: If a correct process multicasts a message m, then it eventually delivers m. � Agreement: If a correct process delivers m, then all correct processes eventually deliver m. � Integrity: Every correct process delivers a message m at most once, only if some process in the group multicasts that message. The reception of spurious messages is ruled out. 10 www. rahuldiwate. com

Link 11 www. rahuldiwate. com

IP MULTICAST � IP multicast is a method of sending Internet Protocol (IP) datagram to a group of interested receivers in a single transmission. It is the IP-specific form of multicast and is used for streaming media and other network applications. It uses specially reserved multicast address blocks in IPv 4 and IPv 6. � IP multicast is a popular technique for multicasting at the network layer. � It is a bandwidth conserving technology that reduces traffic by simultaneously delivering a single stream of information to multiple clients. � This is particularly suitable for large-scale applications— examples include distance learning, videoconferencing, and the distribution of software, stock quotes, and news. � The source sends only one copy, which is replicated by the 12 routers. www. rahuldiwate. com

� An arbitrary set of clients forms a group before receiving the multicast. The Internet � Assigned Numbers Authority (IANA) has assigned class D IP addresses for IP multicast. � This means that all IP multicast group addresses belong to the range of 224. 0. 0. 0 to 239. 255. The multicast group address serves as a virtual channel. � Group members select the channel by selecting the appropriate address, and the network configures itself to deliver the multicast traffic to the group members. � The data are distributed via a distribution tree. Members of groups can join or leave at any time, so be dynamically updated. 13 the distribution trees mustwww. rahuldiwate. com

Distribution trees Shared tree source Rendezvous point Source tree Routers maintain & update distribution All multicasts are Routed via the trees whenever members join / leave a Rendezvous point. group. The shared tree is also called core Too much load on routers. Application -based tree. layer multicast overcomes this. 14 www. rahuldiwate. com

15 www. rahuldiwate. com

� Figure 15. 1 a shows a source tree where a host connected to router B is the source. For a different source, the tree will be different. � The source sends one copy to each neighboring router across the shortest path links. These routers replicate it and forward a copy to each of their neighbors. The shortest path property optimizes network latency and works well for streaming data. � This optimization does come with a price, though: � The routers must maintain path information for each source. � In a network that has thousands of sources and thousands of groups, this quickly becomes a resource 16 issue for the routers. www. rahuldiwate. com

� (Figure 15. 1 b). All routers must forward the group communication traffic from their local hosts toward the (RP), which forwards them to the appropriate destinations via a common shortest path tree rooted at the RP. � The overall memory requirement for the routers of a network that allows only shared trees is lower. � The disadvantages of shared trees are primarily twofold: � (1) the load on the RP is large and � (2) the paths between the source and the destination nodes may not be optimal, introducing additional latency (notice the path from B to F in Figure 15. 1 b where E is the RP). 17 www. rahuldiwate. com

Link 18 www. rahuldiwate. com

Application layer multicast � The routers have to know about group composition. As groups memberships change, routers have to be updated. � Maintaining data about group composition and replicating the copy for each create too much load on routers. � Application layer multicast overcomes this. The responsibility of multicast is left to the applications layer. Each multicast is implemented as a series of unicasts, and their routes are carefully planned to 19 reduce the stress on the links. www. rahuldiwate. com

Figure 15. 3 shows an example. Here, host 0 multicasts a message to hosts 1, 2, and 3. In Figure 15. 3 a, the same message is sent to router A three times (so the stress on the link is 3). Figure 15. 3 b uses a different routing strategy (using the paths 0 A C 2 and 0 A B 1 B D 3), which reduces the load on the link from 0 to A and A to B. 20 www. rahuldiwate. com

a) IP layer multicast; b) application layer multicast; c) overlay tree LINK 21 www. rahuldiwate. com

Ordered multicast � In multicast groups, there are two orthogonal issues: reliability and order. � Reliable multicast addresses only the reliability issue by guaranteeing that each member receives every message sent out by the other members in spite of process failures. So far, it is silent about the order in which these messages are delivered. � However, even in basic multicast, many applications require a guarantee stronger than atomicity—here, the order of message delivery becomes important. � Even if the underlying communication is reliable, guaranteeing the order of message delivery can be far from trivial. One such version requires all messages to be delivered to every group member in the same total order. 22 www. rahuldiwate. com

� For example, a group of replicated files cannot be in the same state unless all replicas apply the updates from their users in the same order. � Other applications may have a weaker ordering requirement. * � Guaranteeing ordered message delivery in the presence of process crashes (i. e. , implementing the reliable version of ordered multicast) is much more challenging than the basic versions. 23 www. rahuldiwate. com

Three Types � 1. Local order multicast (also called single-source FIFO) � 2. Causal order multicast � 3. Total order multicast 24 www. rahuldiwate. com

Local order multicast � In local order multicast, if a process multicasts two messages in the order (m 1, m 2), then every correct process in the group must deliver m 1 before m 2. � There are many applications of local order multicast: One is in the implementation of a DSM where the primary copy of each variable is maintained by an exclusive process, and all other processes use cached copies of it. � Whenever the primary copy is updated, the owner of the primary copy multicasts the updates to the holders of the cached copies, and these copies are updated in the same order. � Other applications include video distribution and software distribution. 25 www. rahuldiwate. com

Causal order multicast � Let m 1 and m 2 be a pair of messages in a group, such that sent(m 1) ≺ sent(m 2). � Then causal order multicast requires that every process in the system must deliver m 1 before m 2. Local order multicast trivially satisfies this. Causal order multicast modifies it by imposing delivery orders among causally ordered messages from distinct senders too. � Here is an example: A group of students scattered across a large campus are preparing for an upcoming quiz through a shared bulletin board. Someone comes up with a question and throws it to the entire group, and whoever knows the answer multicasts it to the entire group. The delivery of a question to each student must happen before the delivery of the corresponding answer, since these are causally related. It will be awkward (and a violation of the rules of causal ordered the answer first and then 26 multicast) if some student receives www. rahuldiwate. com

Total order multicast � In total order atomic multicast, every member of the group is required to deliver all messages sent within the group in identical order. It implies that if every process i maintains a queue Q ・ i (initially empty) to which a message is appended as soon as it is delivered, then eventually, for any two distinct processes i and j, Q. i = Q. j. � Note that the order in which the messages are delivered has no connection with the real time at which these messages were sent out. 27 www. rahuldiwate. com

Ordered multicasts: Basic versions only Total order multicast. Every member must receive all updates in the same order. Example: consistent update of replicated data on servers Causal order multicast. If a, b are two updates and a happened before b, then every member must accept a before accepting b. Example: implementation of a bulletin board. Local order (a. k. a. Single source FIFO). Example: video distribution, distance learning using “push technology. ”

Implementing total order multicast First method. Basic multicast using a sequencer {The sequencer S} define seq: integer (initially 0} do receive m → multicast (m, seq); seq : = seq+1; deliver m od sequencer

Implementing total order multicast Second method. Basic multicast without a sequencer. Uses the idea of 2 PC (two-phase commit) Each process will deliver the messages from the senders in the order (p, r, q)

Implementing total order multicast Step 1. Sender i sends (m, ts) to all Step 2. Receiver j saves it in a holdback queue, and sends an ack (a, ts) Step 3. Sender receive all acks, and pick the largest ts. Then send (m, ts, commit) to all. Step 4. Receiver removes it from the holdback queue and delivers m in the ascending order of timestamps. Why does it work?

Implementing causal order multicast Basic multicast only. Uses vector clocks. Recipient i will deliver a message from j iff 1. VCj(j) = LCj(i) + 1 {LC = local vector clock} 2. ∀k: k≠j : : VCk(j) ≤ LCk(i) VC = incoming vector clock LC = Local vector clock ∀k =For all K Note the slight difference in the implementation of the vector clocks

Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes are required to receive the messages from all correct processes in the group. Multicasts by faulty processes will either be received by every correct process, or by none at all.

A theorem on reliable multicast Theorem. In an asynchronous distributed system, total order reliable multicasts cannot be implemented when even a single process undergoes a crash failure. (Hint) The implementation will violate the FLP impossibility result. Complete the arguments!

Scalable Reliable Multicast IP multicast or application layer multicast provides unreliable datagram service. Reliability requires the detection of the message omission followed by retransmission. This can be done using ack. However, for large groups (as in distance learning applications or software distribution) scalability is a major problem. The reduction of acknowledgements and retransmissions is the main contribution in Scalable Reliable Multicasts (SRM) (Floyd et. al).

Scalable Reliable Multicast If omission failures are rare, then instead of using ACK, receivers will only report the nonreceipt of messages using NACK. If several members of a group fail to receive a message, then each such member waits for a random period of time before sending its NACK. This helps to suppress redundant NACKs. Sender multicasts the missing copy only once. Use of cached copies in the network and selective point-to-point retransmission further reduces the traffic.

Scalable Reliable Multicast Missed m[7] and sent NACK m[7] cached here m[7] Source sending m[0], m[1], m[2] … m[7] cached here m[7] Missed m[7] and sent NACK

Open Groups � � 38 Open groups allow members to spontaneously join and leave. Changing group sizes adds a new twist to the problems of group communication and needs a precise specification of the requirements. Consider a group g that initially consists of four members {0, 1, 2, 3}. Assume that each member knows the current membership of this group. We call this a view of the group and represent it as V(g)={0, 1, 2}. www. rahuldiwate. com

Dealing with open groups The view of a process is its current knowledge of the membership. It is important that all processes have identical views. Inconsistent views can lead to problems. Example: Four members (0, 1, 2, 3) will send out 144 emails. Assume that 3 left the group but only 2 knows about it. So, 0 will send 144/4 = 36 emails (first quarter 1 -36) 1 will send 144/4 = 36 emails (second quarter 37 -71) 2 will send 144/3 = 48 emails (last one-third 97 -144) 3 has left. The mails 72 -96 will not be delivered!

Dealing with open groups Views can change unpredictably, and no member may have exact information about who joined and who left at any given time. These views and their changes should propagate in the same order to all members.

Dealing with open groups Example. Current view (of all processes) v 0(g) = {0, 1, 2, 3}. Let 1, 2 leave and 4 join the group concurrently. This view change can be serialized in many ways: � {0, 1, 2, 3}, {0, 1, 3} {0, 3, 4}, OR � {0, 1, 2, 3}, {0, 3, 4}, OR � {0, 1, 2, 3}, {0, 3, 4} To make sure that every member observe these changes in the same order, changes in the view should be sent via total order multicast.

View propagation {Process 0}: � v 0(g); � send m 1, . . . ; v 0(g) = {0, 1, 2, 3}, � v 1(g); send m 2, send m 3; � v 2(g) ; � {Process 1}: � v 0(g); � send m 4, send m 5; � v 1(g); � send m 6; � v 2(g). . . ; v 1(g) = {0, 1, 3}, v 2(g) = {0, 3, 4}

View delivery guidelines Rule 1. If a process j joins and continues its membership in a group g that already contains a process i, then eventually j appears in all views delivered by process i. Rule 2. If a process j permanently leaves a group g that contains a process i, then eventually j is excluded from all views delivered by process i.

View-synchronous communication Rule. With respect to each message, all correct processes have the same view. m sent in view V ⇒ m received in view V This is also known as virtual synchrony

View-synchronous communication Agreement. If a correct process k delivers a Sender k message m in vi(g) before delivering the m next view vi+1(g), then every correct vi(g) process j ∈ vi(g) ∩ vi+1(g) must deliver m before delivering vi+1(g), Integrity. If a process j delivers a view vi(g), then vi(g) must include j. Validity. If a process k delivers a message m in view vi(g) and another process j ∈ vi(g) does not deliver that message m, then the next view vi+1(g) delivered by k must exclude j. vi(g) m Receiver j vi+1(g),

Example Let process 1 deliver m and then crash. Possibility 1. No one delivers m, but each delivers the new view {0, 2, 3}. Possibility 2. Processes 0, 2, 3 deliver m and then deliver the new view {0, 2, 3} Possibility 3. Processes 2, 3 deliver m and then deliver the new view {0, 2, 3} but process 0 first delivers the view {0, 2, 3} and then delivers m. Are these acceptable? 0 m 1 m 2 m 3 {0, 1, 2, 3} {0, 2, 3} Possibility 3

Replicated Data Management: Architecture of replicated Data Management � Data replication is an age-old technique for tolerating faults, increasing data availability, and reducing latency in data access. With online activities taking over our lives, our critical data, both personal and professional, are being increasingly stored in data centers somewhere in the globe. � Such data are replicated across multiple data centers either proactively or reactively, so that even when some of the copies are lost due to a crash or become inaccessible due to a network partition, our data still remain intact and accessible. � Cached copies of downloaded data on our personal devices enable us to use them even if the network connectivity is absent. � Apart from data backup, replication is widely used in the implementation of distributed shared memory (DSM), distributed file systems, and bulletin boards. 47 www. rahuldiwate. com

Reliability versus Availability � Two primary motivations behind data replication are reliability and availability. Data replicated in the cloud not only serve as a backup and safeguard against failures but also facilitate anytime anywhere availability through a simple browser. � In the WWW, proxy servers provide service when the main server becomes overloaded. All these illustrate how replication can improve data or service availability. There may be cases in which the service provided by a proxy server is not as efficient as the service provided by the main server but certainly, it is better than having no service at all. 48 www. rahuldiwate. com

Two different ways of sharing a file F: (a) Users sharing a single copy and (b) each user has a local replica of the file. Consider two users, Alice and Bob, updating a shared file F. There may be a single copy of F, or Alice and Bob each may want to maintain a private replica of F (Figure 16. 1). Each user’s life is as follows: {Alice and Bob sharing a file F} do true → read F; modify F; write 49 F www. rahuldiwate. com

Links & References � http: //homepage. divms. uiowa. edu/~ghosh/ � Sukumar Ghosh, “Distribute Systems: An Algorithmic Approach”, Chapman and Hall, CRC Press, Second Edition, 2015, ISBN 10: 1584885645 ISBN 13: 9781584885641. 50 www. rahuldiwate. com