Load Management and High Availability in Borealis Magdalena
Load Management and High Availability in Borealis Magdalena Balazinska, Jeong-Hyon Hwang, and the Borealis team MIT, Brown University, and Brandeis University Borealis is a distributed stream processing system (DSPS) based on Aurora and Medusa HA Semantics and Algorithms Contract-Based Load Management Goals: • Manage load through collaborations between autonomous participants • Ensure acceptable allocation where each node’s load is below threshold Goal: Streaming applications can tolerate different types of failure recovery: • Gap recovery: may lose tuples • Rollback recovery: produces duplicates but does not lose tuples • Precise recovery: takes over precisely from the point of failure Challenges: Incentives, efficiency, and customization Approach: 1 - Offline, participants negotiate and establish bilateral contracts that: • Fix or tightly bound price per unit-load • Are private and customizable (e. g. , performance, availability guarantees, SLA) Participant A p p B [p, p+e] Contract specifying that A will pay C, $p per unit of load C Passive Standby most suitable for precise recovery 0. 8 p D Total cost (delay, $) MC(t) at A MC(t) at B Convex cost function Contract at p load(t) Offered load (msgs/sec) Task t moves from A to B if: • unit MC task t > p, at A • unit MC task t < p, at B 2 - At runtime, Load moves only between participants that have a contract Movements are based on marginal costs: • Each participant has a private convex cost function • Load moves when it’s cheaper to pay partner than to process locally Goal: Handle network partitions in a distributed stream processing system Challenges: • Maximize availability • Minimize reprocessing • Maintain consistency Union, operators with timeouts Arbitrary Deterministic BSort, Resample, Aggregate Convergent Repeatable Filter, Map, Join Active Standby shortest recovery time ACK C Checkpoint A B C Trim B’ Upstream backup lowest runtime overhead Trim ACK A B C Replay C B ACK B’ B’ A B A Properties: • Simple, efficient, and low overhead (provable small bounds) • Provable incentives to participate in mechanism • Experimental result: A small number of contracts and small price-ranges suffice to achieve acceptable allocation Network Partitions Challenges: Operator and processing non-determinism B’ Approach: Favor availability. Use updates to achieve consistency • Use connection points to create replicas and stream versions • Downstream nodes • Monitor upstream nodes • Reconnect to available upstream replica • Continue processing with minimal disruptions
Load Management Demonstration Setup All nodes process a network monitoring query over real traces of connection summaries Connection information 60 s Group by IP count Filter > 100 60 s Group by IP count distinct port Filter > 10 F 60 s T Group by IP prefix, sum IPs that connect over many ports Query: Count the connections established by each IP over 60 sec and the number of distinct ports to which each IP connected A p p B C p Clusters of IPs that establish many connections Filter > 100 Node A A sheds load to B then to C overloaded 1) Three nodes with identical contracts and uneven initial load distribution A B C Acceptable allocation 2) As node A becomes overloaded it sheds load to its partners B and C until system reaches acceptable allocation 3) Load increases at node B causing system overload A B C 0. 8 p D 4) Node D joins the system. Load flows from node B to C and C to D until the system reaches acceptable allocation B System overload C Acceptable allocation D Node D joins Load flows from C to D and from B to C
High Availability Demonstration Setup Identical queries traverse nodes that use different high availability approaches B 0 B 1 Passive Standby B 0’ Statically assigned secondary C 1 C 0 Active Standby C 0’ D 1 Upstream Backup D 0’ E 1 E 0 2) All other nodes run on the other laptop 3) We compare the runtime overhead of the approaches A D 0 1) The four primaries, B 0, C 0, D 0, and E 0 run on one laptop E 0’ Upstream Backup & Duplicate Elimination Active Standby Passive Standby 4) We kill all primaries at the same time 5) We compare the recovery time and the effects on tuple delay and duplication Passive standby adds most end-to-end delay Failure Tuples received E 2 E delay Upstream Backup UB no dups Active standby has highest runtime overhead Duplicate tuples Failure Upstream backup has highest overhead during recovery
Network Partition Demonstration Setup 1) The initial query distribution crosses computer boundaries Laptop 1 A 2) We unplug the cable connecting the laptops R B Laptop 2 3) Node C detects that node B has become unreachable No duplications and no losses after network partitions C Tuples received through B Tuples received through R 4) Node C identifies node R as reachable alternate replica: Output stream has the same name but a different version 5) Node C connects to node R and continues processing from the same point on the stream Sequence nb of received tuples 6) Node C changes the version of its output stream 7) When partition heals, node C remains connected to R and continues processing uninterrupted End-to-end tuple delay increases while C detects the network partition and re-connects to R
- Slides: 4