Reliable and Highly Available Distributed PublishSubscribe Systems Reza

Reliable and Highly Available Distributed Publish/Subscribe Systems Reza Sherafat Hans-Arno Jacobsen University of Toronto September 2009 Symposium on Reliable and Distributed Systems

Distributed Publish/Subscribe Systems Publish Pu b PP PPu b Pub/Sub SS Subscribe S Su b � � S S S P S Subscribe Su b Many-to-many communication High-level operations: “subscribe” and “publish“ Decoupling between sources and sinks Flexible content-based messaging SRDS'09 2

Agenda � Existing approaches � δ-Fault-tolerance � Architecture � Reliable publication delivery protocol � Experimental results SRDS'09 3

Store-and-Forward From here P P P ack To here ack � A copy is first preserved on disk and then forwarded � Intermediate hops send an ACK to previous hop after preserving � ACKed copies can be dismissed from disk � Upon failures, unacknowledged copies survive failure and are re-transmitted after recovery ◦ This ensures reliable delivery but may cause delays while the machine is down SRDS'09 4

Mesh-Based Overlay Networks [Snoeren, et al. , SOSP 2001] P From here To here � Use a mesh network to concurrently forward messages on disjoint paths � Upon failures, the message is delivered using alternative routes � Pros: Minimal impact on delivery delay � Cons: Imposes additional traffic & possibility of duplicate delivery SRDS'09 5

Replica-based Approach [Bhola et al. , DSN 2002] � Replicas are grouped into virtual nodes � Replicas have identical routing information Virtual node � We compare against this approach in evaluation section SRDS'09 6

Next � Existing approaches � δ-Fault-Tolerance � Architecture � Reliable publication delivery protocol � Experimental results SRDS'09 7

δ-Fault-Tolerance � In distributed messaging system � Configuration parameter δ � A δ-fault-tolerant P/S system ensures reliable delivery when there are up to δ concurrent crash failures � Reliability: ◦ Failed brokers may be down for a long time ◦ There often are concurrent failures ◦ Reliable message delivery is essential ◦ Exactly-once delivery of publications to matching subscribers ◦ Per-source FIFO ordered message delivery SRDS'09 8

Next � Existing approaches � δ-Fault-Tolerance � Architecture � Reliable publication delivery protocols � Experimental results SRDS'09 9

Architecture � Broker are organized in a tree-based overlay network � In our approach δ-fault-tolerance is closely related to how much brokers know about the broker tree 3 -neighborhood � (δ+1)-neighborhood: within distance δ+1 brokers 2 -neighborhood 1 -neighborhood � This information is stored in a data structure called the topology map ◦ Topology maps are updated as brokers enter/leave the network SRDS'09 10

Join Algorithm 1. Joining broker connects to a joinpoint (δ+1)-neighborhood 2. join. Request message is sent to the joinpoint 3. joinpoint replies with a subset of its topology map 4. join. Request is propagated in the network 5. Receiving brokers update their topology maps 6. confirmation messages propagated from edge brokers are sent back 7. Joining broker receives the confirmation: join is complete Joining broker δ-neighborhood Joinpoint SRDS'09 11

Subscription Routing Information � Subscription routing protocol is used to construct forwarding paths � Subscription messages encapsulate: � Subscriptions are sent hop-by-hop throughout the network ◦ pred: Conjunct predicates specifying client’s interests ◦ from: Broker. ID points back to broker δ+1 hops closer to subscriber ◦ Brokers update from as message is forwarded ◦ Brokers handle confirmation msgs similar to join ◦ Confirmed subs are inserted into subscription routing table s. from δ=2 S A s. from S S S B C D SRDS'09 E 12

Next � Existing approaches � δ-Fault-Tolerance � Architecture � Reliable publication forwarding protocols � Experimental results SRDS'09 13

Received pubs are placed in a FIFO message queue and kept until processing is complete 2. Using subscription info: subs matching the publication are identified 3. Matching subs’ from field are inserted into the recipient. Set 4. Using topology map: pub is sent to closest available brokers towards matching subscribers (outgoing. Set) 5. Receiving downstream brokers similarly forward the publication until delivered to subscribers 6. Confirmations from all downstream brokers are received 7. Clean-up: once all confirmations arrive, the publication is discarded from the queue P A SRDS'09 Downstream 1. Upstream Publication Forwarding Algorithm (No Failure Case) (δ+1)neighborhood 14

Once failures are detected the broker reconnects the topology by creating new links to downstream neighbors of the failed brokers � � Unconfirmed publications are retransmitted from msg queue � Subsequent pubs are forwarded via the new links instead ◦ � ◦ queue P PA Downstream Brokers use heartbeats to monitor availability of their connected peers � Upstream Publication Forwarding Algorithm (Failure Case) Bypass failed brokers Multiple concurrent failures (up to δ) are handled similarly In the worst case, δ brokers have failed in a row SRDS'09 15

Eliminating Need for Confirmation Messages � For each pub msg sent over a link there is a confirmation msg that is sent back ◦ Increased network traffic � We use an aggregated acknowledgement mechanism called Depth Acknowledgements (DACK) ◦ It is very similar to the normal way that ◦ This substitutes the need for confirmation messages SRDS'09 16

� � � B and C keep track of the highest sequence number they received and discarded (prefix-based) from arrived: {seq(A), …} A A and periodically report it discarded: {seq'(A), seq'(A)}, … upstream using DACK messages. Update DACK Brokers append their own information to DACK and also relay portions of arrived: {seq(A), …} their neighbors’ DACK messages. For each publication, A evaluates safety conditions for all brokers in the publication’s recipient. Set. discarded: {seq'(A), …} P(? ) Downstream � Upstream Discarding Publications Using DACK P Messages Arr: seq(A) Dsc: seq'(A) B DACK Update C Arr: seq(A) Dsc: seq'(A) Safety conditions ◦ All intermediate brokers report an arrived seq# is higher than publication’s seq#, OR ◦ Any intermediate broker has reported a discarded prefix seq# that is higher that the publication’s seq# (necessary when there are failures) SRDS'09 18

Next � Existing approaches � δ-Fault-Tolerance � Architecture � Reliable publication forwarding protocols � Experimental results SRDS'09 19

Experimental Setup � Algorithms implemented in Java � We run the system on a cluster computer: � Topology setup (δ=3) � ◦ 21 nodes each with 4 cores ◦ Gigabit eathernet ◦ Consists of 83 brokers ◦ #subscriptions: 2600 ◦ #publishers: 26 at varied publication rates We inject failures to R 1, R 2, R 3 and perform measurements R 1 SRDS'09 R 2 R 3 20

Publication Delivery Delay Impact of failures on publication delivery delay ◦ Use stream of publications (10 msg/s) ◦ Measure delivery delay between publishing and subscribing endpoints � 3 separate runs with different number of simultaneous failures 3 -Failures 2 -Failures 1 -Failure � After a short-lived jump, the delivery delay quickly goes back to normal ◦ Difference corresponds to failure detection timeout SRDS'09 21

Change in Load After Failures Spikes at R 2 after brokers reconnect � spike Non-faulty Smaller on R 1 failures: R 2’s input traffic stabilizes at exactly the same rate input traffic broker's. R 1’s load after stabilizes at exactly the same rate ◦ R 3 Input Fails msg traffic: no change! R 1 ◦ Output msg traffic: Input increase ◦ CPU utilization: increase Msg Rate � R 2 R 3 Output rate/CPU utilization is affected by nearby failures Spikes at R 2 after brokers reconnect Lower spikes on R 1 R 3 Fails R 2’s output traffic stabilizes at slightly higher rate Spikes at R 2 after brokers reconnect Lower spikes on R 1 sees no change Output Msg Rate R 3 Fails R 2 stabilizes at slightly higher R 1 sees no chance CPU Load 23

Comparison with Replica-based Approach Our approach Replica-based Virtual node R 1 � R 2 R 3 Topology network ◦ Our approach: δ=2 ◦ Replica-based: 2 replicas ◦ Considered situation after 2 failures (R 2 and R 3 fail) � Compared load on R 1 after failures occur � In our approach CPU load on R 1 is about 30% lower 30% difference SRDS'09 24

Conclusions � Our system delivers reliable P/S service in the face of up to δ concurrent broker failures � We also proposed optimizations: � Ongoing and future work: ◦ To use aggregated acknowledgement messages ◦ To reduce the network traffic ◦ Explore multi-path forwarding ◦ http: //research. msrg. utoronto. ca/Padres/Web. Home SRDS'09 25

Questions? Thanks! SRDS'09 26

Backup slides … SRDS'09 27

Sample DACK Propagation and Publication Purging (δ=3) First safety condition 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Direction of pub forwarding Second safety condition LEGEND: Node receives pub Node holds pub in MQ Node discards pub

Publication Propagation and Purging 1. 2. Using DACK info (δ=3) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. SRDS'09 29

Publication Propagation and Purging Using DACK info with failures (δ=3) SRDS'09 30