Delayed Internet Routing Convergence Craig Labovitz Abha Ahuja

Delayed Internet Routing Convergence Craig Labovitz, Abha Ahuja, Abhijit Bose, Farham Jahanian Presented By Harpal Singh Bassali

Introduction Ø Conventional Wisdom - Rapid restoration and rerouting in the event of link or router failure. Ø Actual convergence time of the order of minutes!! Ø What happens to the data packets till then? Ø Loss of connectivity Ø Packet Loss Ø Latency

Infrastructure Ø Used both passive data collection and fault-injection machines. ØData collected over a 2 year period. ØInjected over 250, 000 routing faults from diverse locations. Ø Used Route. View probes to monitor BGP updates in core internet routers. Ø Active probe machines measured end-to-end performance by sending ICMP echo messages to random web sites.

Infrastructure

Taxonomy Tup : A previously unavailable route is announced as available. Tdown : A previously available route is withdrawn. Tshort : An active route with a long ASPath is implicitly replaced by a new route with a shorter ASPath. Tlong : An active route with a long ASPath is implicitly replaced by a new route with a shorter ASPath.

Routing Measurements Latency Vs Number of BGP updates

Observations Ø Long Tailed distribution. Ø 20% of Tlong and 40% of Tdown take more than 3 minutes to converge. Ø (Tshort, Tup) and (Tlong, Tdown) form equivalence classes. Ø A 20 second separation between Tlong and Tdown. Ø Tdown and Tlong had twice as many update messages as Tshort and Tup. Ø Strong correlation between number of updates and latency.

Routing Measurements Latency Vs Type of BGP update

Observations Ø Significant variation in convergence latencies for the ISPs. Ø No correlation between convergence latency and geographic or network distance. Ø Factors contributing to Internet fail-over delay are independent of network load and congestion.

End-to-End Measurements

Observations Packet Loss Vs Type of BGP update Ø Less than 1% packet loss throughout the 10 minute period. Ø Tlong event has 17% and Tshort event has 32% packet loss. Ø Wider curve of Tlong due to the slower speed of routing table convergence.

Observations Latency Vs Type of BGP update Ø Wider curve of Tlong due to the slower speed of routing table convergence. Ø Tup event had all it’s packet within 1 minute.

BGP Convergence Upper Bound on Convergence

Assumptions Ø Ø Each AS is a single node. We have a complete graph of Ases. Exclude the analysis of Min. Route. Adver. Model the BGP processing as a single linear, global queue.

BGP Convergence Upper Bound on Convergence

Results Ø Loop detection, if performed at both sender and receiver side, all mutual dependencies could be discovered and eliminated in a single round. Ø Convergence Latency is independent of geographic and network distance. Ø These variations are directly related to topological factors like the length and number of possible paths between ASes.

harpal: vbfdsvdjn The Impact of Internet Policy and Topology on Delayed Routing Convergence Craig Labowitz, Roger Wattenhofer, Srinivasan Venkatachary and Abha Ahuja

Major Results Ø Internet fail-over convergence = , where n is the length of the longest backup path between source and destination. Ø Customers of bigger ISPs exhibit faster convergence. Ø Errant paths are frequently explored during delayed convergence.

Methodology Ø Inject BGP route transitions into more than 10 geographically and topologically diverse providers. Ø A set of probe machines actively injected faults at random intervals of roughly 2 hours. Ø Generated faults over a six month period. Ø Treated the address space as a customer wrt to policy and filtering by the cooperating providers. Ø Logged periodic routing table snapshots and all BGP updates from additional 20 ISPs.

Inter-provider Relationships Peer : Bilateral exchange of customer and backbone routing information. Routes learnt from other peers and upstream providers are not exchanged. Customer/Transit : The customer announces its backbone and downstream routes to an upstream provider. Backup transit : A peer relationship in which a provider only provides transit after detection of a fault. Both are peers in steady-state but after a failure, the backup transit peer begins advertising its now downstream peer’s backbone and customer routes.

Relationships

Convergence Topologies

Observations

Conclusions Ø Vagabond paths are responsible for delays in convergence. Ø The more densely the router is peered, the more time it takes to converge. Ø Min. Route. Adver responsible for significant additional latency during delayed convergence.

Topology Impact on Convergence

Observations Ø Long-tailed distribution due to vagabond paths. Ø ISP 3 exhibits significantly slower convergence times. Ø Average convergence latency for a route failure corresponds to the longest possible backup path allowed by policy and topology.

Observations(contd. ) Latency Vs Longest ASPath explored

Observations(contd. ) Provider Type Vs Observed ASPath length

Conclusions Ø Customers sensitive to fail-over latency should multi-home to larger providers. Ø Smaller providers should limit their number of transit and backup transit interconnections. Ø A large number of vagabond paths suggest a need for a better route validation and authentication mechanism.