Infrastructurebased Resilient Routing Ben Y Zhao Ling Huang
Infrastructure-based Resilient Routing Ben Y. Zhao, Ling Huang, Jeremy Stribling, Anthony Joseph and John Kubiatowicz University of California, Berkeley ICSI Lunch Seminar, January 2004 ravenben@eecs. berkeley. edu
Motivation n Network connectivity is not reliable q q n Disconnections frequent in the Internet (UMich. TR 98, IMC 02) n 50% of backbone links have MTBF < 10 days n 20% of faults last longer than 10 mins IP-level repair relatively slow n Wide-area: BGP 3 mins n Local-area: IS-IS 5 seconds Next generation wide-area network applications q Streaming media, Vo. IP, B 2 B transactions q Low tolerance of delay, jitter and faults 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
The Challenge n Routing failures are diverse q Many causes n q Occur anywhere with local or global impact: n q n Misconfigurations, cut fiber, planned downtime, software bugs Single fiber cut can disconnect AS pairs One event can lead to complex protocol interactions Isolating failures is difficult q End user symptoms often dynamic or intermittent q WAN measurement research is ongoing (Rocketfuel, etc) q Observations: n Fault detection from multiple distributed vantage points n In-network decision making necessary for timely responses 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
Talk Overview n Motivation n A structured overlay approach n Mechanisms and policy n Evaluation n Some questions 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
An Infrastructure Approach n n Our goals q Resilient overlay to route around failures q Respond in milliseconds (not seconds) Our approach (data & control plane) q q q Nodes are observation points (similar to Plato’s NEWS service) Nodes are also points of traffic redirection (forwarding path determination and data forwarding) No edge node involvement n Fast response time, security focused on infrastructure n Fully transparent, no application awareness necessary 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
Why Structured Overlays n n n Resilient Overlay Networks (MIT) Fully connected mesh Each node has full knowledge of network q q n D Fast, independent calculation of routes Nodes can construct any path, maximum flexibility Cost of flexibility q q Protocol needs to choose the “right” route/nodes Per node O(n) state n n Monitors n - 1 paths O(n 2) total path monitoring is expensive S 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
The Big Picture v v v OVERLAY v v v v Internet n n n Locate nearby overlay proxy Establish overlay path to destination host Overlay traffic routes traffic resiliently 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
Traffic Tunneling Legacy Node A P’(B) B A, B are IP addresses register Legacy Node B register Proxy get (hash(B)) P’(B) put (hash(A), P’(A)) put (hash(B), P’(B)) Structured Peer to Peer Overlay n n Store mapping from end host IP to its proxy’s overlay ID Similar to approach in Internet Indirection Infrastructure (I 3) 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
Pros and Cons n Leverage small neighbor sets n Less neighbor paths to monitor: O(n) O(log(n)) n q Reduction in probing bandwidth q Faster fault detection Actively maintain static route redundancy q q q Manageable for “small” # of paths Redirect traffic immediately when a failure is detected Eliminate on-the-fly calculation of new routes Restore redundancy in background after failure n Fast fault detection + precomputed paths = more responsiveness n Cons: overlay imposes routing stretch (mostly < 2) 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
In-network Resiliency Details n Active periodic probes for fault-detection q Exponentially weighted moving average link quality estimation q n n Avoid route flapping due to short term loss artifacts n Loss rate Ln = (1 - ) Ln-1 + p Simple approach taken, much ongoing research n Smart fault-detection / propagation (Zhuang 04) n Intelligent and cooperative path selection (Seshardri 04) Maintaining backup paths q Create and store backup routes at node insertion q Query neighbors after failures to restore redundancy n q Ask any neighbor at or above routing level of faulty node e. g. ABCD sees ABDE failed, can ask any AB? ? node for info Simple policies to choose among redundant paths 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
First Reachable Link Selection Use link quality estimation to choose (FRLS) n shortest “usable” path n Use shortest path with minimal quality > T n Correlated failures q q Reduce with intelligent topology construction Goal: leverage redundancy available 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
Evaluation n n Metrics for evaluation q How much routing resiliency can we exploit? q How fast can we adapt to faults (responsiveness)? Experimental platforms q Event-based simulations on transit stub topologies n q Data collected over multiple 5000 -node topologies Planet. Lab measurements n 10/26/2020 Microbenchmarks on responsiveness ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
Exploiting Route Redundancy (Sim) n n Simulation of Tapestry, 2 backup paths per routing entry Transit-stub topology shown, results from TIER and AS graphs similar 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
Responsiveness to Faults (Planet. Lab) 660 = 0. 2 = 0. 4 300 n n Two reasonable values for filter constant Response time scales linearly to probe period 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
Link Probing Bandwidth (Planetlab) n n Bandwidth increases logarithmically with overlay size Medium sized routing overlays incur low probing bandwidth 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
Conclusion n Trading flexibility for scalability and responsiveness q Structured routing has low path maintenance costs n q Can no longer construct arbitrary paths n n Allows “caching” of backup paths for quick failover But simple policy exploits available redundancy well Fast enough for most interactive applications q 300 ms beacon period response time < 700 ms q ~300 nodes, b/w cost = 7 KB/s 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
Ongoing Questions n Is this the right approach? q Is there a lower bound on desired responsiveness? q Is this responsive enough for Vo. IP? n n If not, is multipath routing the solution? What about deployment issues? q How does inter-domain deployment happen? q A third-party approach? (Akamai for routing) 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
Related Work n n n Redirection overlays q Detour (IEEE Micro 99) q Resilient Overlay Networks (SOSP 01) q Internet Indirection Infrastructure (SIGCOMM 02) q Secure Overlay Services (SIGCOMM 02) Topology estimation techniques q Adaptive probing (IPTPS 03) q Internet tomography (IMC 03) q Routing underlay (SIGCOMM 03) Many, many other structured peer-to-peer overlays Thanks to Dennis Geels / Sean Rhea for their work on BMark 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
Backup Slides 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
Another Perspective on Reachability Portion of all pairwise paths where no failure-free paths remain A path exists, but neither IP nor FRLS can locate the path Portion of all paths where IP and FRLS both route successfully 10/26/2020 FRLS finds path, where short-term IP routing fails ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
Constrained Multicast n n Used only when all paths are below quality threshold Send duplicate messages on multiple paths Leverage route convergence q Assign unique message 2299 2274 IDs q Mark duplicates q Keep moving window of IDs 2046 2281 q Recognize and drop duplicates Limitations ? ? ? q Assumes loss not from congestion 1111 q Ideal for local area routing 10/26/2020 ICSI Lunch Seminar, Jan. 2004 2225 2286 2530 ravenben@eecs. berkeley. edu
Latency Overhead of Misrouting 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
Bandwidth Cost of Constrained Multicast 10/26/2020 ICSI Lunch Seminar, Jan. 2004 ravenben@eecs. berkeley. edu
- Slides: 23