CS 4700 CS 5700 Network Fundamentals Lecture 10

  • Slides: 111
Download presentation
CS 4700 / CS 5700 Network Fundamentals Lecture 10: Inter Domain Routing (It’s all

CS 4700 / CS 5700 Network Fundamentals Lecture 10: Inter Domain Routing (It’s all about the Money) Revised 2/4/2014

Network Layer, Control Plane 2 � Set Data Plane Applicatio n Presentatio n Session

Network Layer, Control Plane 2 � Set Data Plane Applicatio n Presentatio n Session Transport Network Data Link Physical Function: up routes between networks Key challenges: � Implementing provider policies � Creating stable paths RIP OSPF BGP Control Plane

3 q q Outline BGP Basics Stable Paths Problem BGP in the Real World

3 q q Outline BGP Basics Stable Paths Problem BGP in the Real World Debugging BGP Path Problems

ASs, Revisited 4 AS-1 AS-3 Interior Routers AS-2 BGP Routers

ASs, Revisited 4 AS-1 AS-3 Interior Routers AS-2 BGP Routers

AS Numbers 5 Each AS identified by an ASN number � 16 -bit values

AS Numbers 5 Each AS identified by an ASN number � 16 -bit values (latest protocol supports 32 -bit ones) � 64512 – 65535 are reserved Currently, there are > 20000 ASNs � AT&T: 5074, 6341, 7018, … � Sprint: 1239, 1240, 6211, 6242, … � Northeastern: 156 � North America ASs ftp: //ftp. arin. net/info/asn. txt

Inter-Domain Routing 6 Global connectivity is at stake! � Thus, all ASs must use

Inter-Domain Routing 6 Global connectivity is at stake! � Thus, all ASs must use the same protocol � Contrast with intra-domain routing What are the requirements? � Scalability � Flexibility in choosing routes Cost Routing around failures Question: link state or distance vector? � Trick question: BGP is a path vector protocol

BGP 7 Border Gateway Protocol � De facto inter-domain protocol of the Internet �

BGP 7 Border Gateway Protocol � De facto inter-domain protocol of the Internet � Policy based routing protocol � Uses a Bellman-Ford path vector protocol Relatively simple protocol, but… � Complex, manual configuration � Entire world sees advertisements Errors � Policies How can screw up traffic globally driven by economics much $$$ does it cost to route along a given path? Not by performance (e. g. shortest paths)

BGP Relationships 8 Provider Peer 2 has no incentive Peers do not to route

BGP Relationships 8 Provider Peer 2 has no incentive Peers do not to route 1 3 pay each other Customer Peer 1 Provider Peer 2 Customer Peer 3 Customer pays provider Customer

Tier-1 ISP Peering 9 Inteliquent Centurylink Verizon Business AT&T Sprint Level 3 XO Communications

Tier-1 ISP Peering 9 Inteliquent Centurylink Verizon Business AT&T Sprint Level 3 XO Communications

Peering Wars 11 Peer Reduce upstream costs Improve end-to-end performance May be the only

Peering Wars 11 Peer Reduce upstream costs Improve end-to-end performance May be the only way to connect to parts of the Internet Don’t Peer You would rather have customers Peers are often competitors Peering agreements require periodic renegotiation Peering struggles in the ISP world are extremely contentions, agreements are usually confidential

Two Types of BGP Neighbors 12 IGP Exterior routers also speak IGP e. BGP

Two Types of BGP Neighbors 12 IGP Exterior routers also speak IGP e. BGP i. BGP

Full i. BGP Meshes 13 e. BGP i. BGP Question: why do we need

Full i. BGP Meshes 13 e. BGP i. BGP Question: why do we need i. BGP? � OSPF does not include BGP policy info � Prevents routing loops within the AS i. BGP updates do not trigger announcements

Path Vector Protocol 14 AS-path: sequence of ASs a route traverses � Like distance

Path Vector Protocol 14 AS-path: sequence of ASs a route traverses � Like distance vector, plus additional information Used for loop detection and to apply policy Default choice: route with fewest # of ASs AS 4 120. 10. 0. 0/16 AS 3 130. 10. 0. 0/16 AS 2 AS 1 AS 5 110. 0. 0/16 120. 10. 0. 0/16: AS 2 AS 3 AS 4 130. 10. 0. 0/16: AS 2 AS 3 110. 0. 0/16: AS 2 AS 5

BGP Operations (Simplified) 15 Establish session on TCP port 179 AS-1 P BG Exchange

BGP Operations (Simplified) 15 Establish session on TCP port 179 AS-1 P BG Exchange incremental updates Se ss io n Exchange active routes AS-2

Four Types of BGP Messages 16 Open: Establish a peering session. Keep Alive: Handshake

Four Types of BGP Messages 16 Open: Establish a peering session. Keep Alive: Handshake at regular intervals. Notification: Shuts down a peering session. Update: Announce new routes or withdraw previously announced routes. announcement = IP prefix + attributes values

BGP Attributes 17 Attributes used to select “best” path � Local. Pref Local preference

BGP Attributes 17 Attributes used to select “best” path � Local. Pref Local preference policy to choose most preferred route Overrides default fewest AS behavior � Multi-exit Discriminator (MED) Specifies path for external traffic destined for an internal network Chooses peering point for your network � Import What � Export Rules route advertisements do I accept? Rules Which routes do I forward to whom?

Route Selection Summary 18 18 Highest Local Preference Enforce relationships Shortest AS Path Lowest

Route Selection Summary 18 18 Highest Local Preference Enforce relationships Shortest AS Path Lowest MED Traffic engineering Lowest IGP Cost to BGP Egress Lowest Router ID When all else fails, break ties

Shortest AS Path != Shortest Path 19 4 hops 4 ASs Source ? ?

Shortest AS Path != Shortest Path 19 4 hops 4 ASs Source ? ? Destination 9 hops 2 ASs

Hot Potato Routing 20 5 hops total, 2 hops cost Source ? Destination ?

Hot Potato Routing 20 5 hops total, 2 hops cost Source ? Destination ? 3 hops total, 3 hops cost

Importing Routes 21 From Provider ISP Routes From Peer From Customer

Importing Routes 21 From Provider ISP Routes From Peer From Customer

Exporting Routes 22 $$$ generating routes To Provider To Peer Customer and ISP routes

Exporting Routes 22 $$$ generating routes To Provider To Peer Customer and ISP routes only To Peer To Customers get all routes

Modeling BGP 23 AS relationships � Customer/provider � Peer � Sibling, IXP Gao-Rexford model

Modeling BGP 23 AS relationships � Customer/provider � Peer � Sibling, IXP Gao-Rexford model � AS prefers to use customer path, then peer, then provider Follow the money! � Valley-free routing � Hierarchical routing (incorrect but frequently P-P view of. P-P used) P-C C-P P-C P-P

AS Relationships: It’s Complicated 24 GR Model is strictly hierarchical � Each AS pair

AS Relationships: It’s Complicated 24 GR Model is strictly hierarchical � Each AS pair has exactly one relationship � Each relationship is the same for all prefixes In practice it’s much more complicated � Rise of widespread peering � Regional, per-prefix peerings � Tier-1’s being shoved out by “hypergiants” � IXPs dominating traffic volume Modeling is very hard, very prone to error � Huge potential impact for understanding Internet behavior

Other BGP Attributes 25 AS_SET � Instead of a single AS appearing at a

Other BGP Attributes 25 AS_SET � Instead of a single AS appearing at a slot, it’s a set of Ases � Why? Communities � Arbitrary number that is used by neighbors for routing decisions Export this route only in Europe Do not export to your peers � Usually stripped after first interdomain hop � Why? Prepending � Lengthening � Why? the route by adding multiple instances of ASN

26 q q Outline BGP Basics Stable Paths Problem BGP in the Real World

26 q q Outline BGP Basics Stable Paths Problem BGP in the Real World Debugging BGP Path Problems

What Problem is BGP Solving? 27 Underlying Problem Shortest Paths ? ? ? Distributed

What Problem is BGP Solving? 27 Underlying Problem Shortest Paths ? ? ? Distributed Solution RIP, OSPF, IS-IS, etc. BGP Knowing ? ? ? can: � Aid in the analysis of BGP policy � Aid in the design of BGP extensions � Help explain BGP routing anomalies � Give us a deeper understanding of the protocol 27

The Stable Paths Problem 28 An instance of the SPP: 210 20 � Graph

The Stable Paths Problem 28 An instance of the SPP: 210 20 � Graph of nodes and edges � Node 0, called the origin � A set of permitted paths from each node to the origin Each � Each Null 2 5210 4 420 430 2 set contains the null path 0 set of paths is ranked path is always least preferred 5 1 3 130 10 30

A Solution to the SPP 29 A solution is an assignment of permitted paths

A Solution to the SPP 29 A solution is an assignment of permitted paths to each node such that: u’s path is either Solutions need not null or uw. P, the where path uw is use shortest assigned to node w and edge paths, or form a u w exists spanning tree � Each node is assigned the � Node 1 higest ranked path that is consistent with their neighbors 210 20 2 5 5210 4 420 430 2 0 3 130 10 30

Simple SPP Example 30 10 130 1 2 2 20 210 • Each node

Simple SPP Example 30 10 130 1 2 2 20 210 • Each node gets its preferred route 0 • Totally stable topology 3 30 4 42 30 20 43

Good Gadget 31 130 10 1 2 2 210 20 • Not every node

Good Gadget 31 130 10 1 2 2 210 20 • Not every node gets preferred route • Topology is still stable 0 • Only one stable configuration • No matter which router chooses first! 3 30 4 430 420

SPP May Have Multiple Solutions 32 120 10 1 0 0 2 210 20

SPP May Have Multiple Solutions 32 120 10 1 0 0 2 210 20 1 0 2 210 20

Bad Gadget 33 130 10 20 • That was only one round of 2

Bad Gadget 33 130 10 20 • That was only one round of 2 oscillation! 1 2 • This keeps going, infinitely 0 • Problem stems from: • Local (not global) decisions • Ability of one 3 node to improve 4 its path selection 3420 30 430

SPP Explains BGP Divergence 34 BGP is not guaranteed to converge to stable routing

SPP Explains BGP Divergence 34 BGP is not guaranteed to converge to stable routing � Policy inconsistencies may lead to “livelock” � Protocol oscillation Solvable Can Diverge Good Gadgets Bad Gadgets Must Converge Naughty Gadgets Must Diverge

Beware of Backup Policies 35 130 10 1 2 2 210 20 • BGP

Beware of Backup Policies 35 130 10 1 2 2 210 20 • BGP is not robust 0 • It may not recover from link failure 3 3420 30 4 40 420 430

BGP is Precarious 36 If node 1 uses path 1 0, this is solvable

BGP is Precarious 36 If node 1 uses path 1 0, this is solvable 4310 453120 4 310 3120 3 5310 563120 5 120 10 1 No longer stable 6 6310 643120 63120 0 2 210 20

Can BGP Be Fixed? 37 Unfortunately, SPP is NP-complete Possible Solutions Static Approach Automated

Can BGP Be Fixed? 37 Unfortunately, SPP is NP-complete Possible Solutions Static Approach Automated Analysis of Routing Policies (This is very hard) Dynamic Approach Inter-AS coordination Extend BGP to detect and suppress policy-based oscillations? These approaches are complementary

38 q q Outline BGP Basics Stable Paths Problem BGP in the Real World

38 q q Outline BGP Basics Stable Paths Problem BGP in the Real World Debugging BGP Path Problems

Motivation 39 Routing reliability/fault-tolerance on small time scales (minutes) not previously a priority Transaction

Motivation 39 Routing reliability/fault-tolerance on small time scales (minutes) not previously a priority Transaction oriented and interactive applications (e. g. Internet Telephony) will require higher levels of end-to-end network reliability How well does the Internet routing infrastructure tolerate faults?

Conventional Wisdom 40 Internet routing is robust under faults � Supports path re-routing �

Conventional Wisdom 40 Internet routing is robust under faults � Supports path re-routing � Path restoration on the order of seconds BGP has good convergence properties � Does not exhibit looping/bouncing problems of RIP Internet fail-over will improve with faster routers and faster links More redundant connections (multi-homing) will always improve fault-tolerance

Delayed Routing Convergence 41 Conventional wisdom about routing convergence is not accurate � Measurement

Delayed Routing Convergence 41 Conventional wisdom about routing convergence is not accurate � Measurement of BGP convergence in the Internet � Analysis/intuition behind delayed BGP routing convergence � Modifications to BGP implementations which would improve convergence times

Open Question 42 After a fault in a path to multi-homed site, how long

Open Question 42 After a fault in a path to multi-homed site, how long does it take for majority of Internet routers to failover to secondary path? Route Withdrawn Primary ISP Customer Backup ISP Traffic Routing table convergence Stable end-to-end paths

Bad News 43 With unconstrained policies: � Divergence � Possible create unsatisfiable policies �

Bad News 43 With unconstrained policies: � Divergence � Possible create unsatisfiable policies � NP-complete to identify these policies � Happening today? With constrained policies (e. g. shortest path first) � Transient oscillations � BGP usually converges � It may take a very long time… BGP Beacons: focuses on constrained policies

16 Month Study of Convergence 44 Instrument the Internet � Inject BGP faults (announcements/withdrawals)

16 Month Study of Convergence 44 Instrument the Internet � Inject BGP faults (announcements/withdrawals) of varied prefix and AS path length into topologically and geographically diverse ISP peering sessions � Monitor impact faults through Recording BGP peering sessions with 20 tier 1/tier 2 ISPs Active ICMP measurements (512 byte/second to 100 random web sites) � Wait two years (and 250, 000 faults)

Measurement Architecture 45 Researchers pretending to be an AS

Measurement Architecture 45 Researchers pretending to be an AS

Announcement Scenarios 46 Tup – a new route is advertised Tdown – A route

Announcement Scenarios 46 Tup – a new route is advertised Tdown – A route is withdrawn � i. e. Tshort – Advertise a shorter/better AS path � i. e. single-homed failure primary path repaired Tlong – Advertise a longer/worse AS path � i. e. primary path fails

Major Convergence Results 47 Routing convergence requires an order of magnitude longer than expected

Major Convergence Results 47 Routing convergence requires an order of magnitude longer than expected � 10 s Routes converge more quickly following Tup/Repair than Tdown/Failure events � Bad of minutes news travels more slowly Withdrawals (Tdown) generate several more announcements than new routes (Tup)

Example 48 q q q BGP log of updates from AS 2117 for route

Example 48 q q q BGP log of updates from AS 2117 for route via AS 2129 One withdrawal triggers 6 announcements and one withdrawal from 2117 Increasing AS path length until final withdrawal

Why So Many Announcements? 49 Events from AS 2177 1. Route Fails: AS 2129

Why So Many Announcements? 49 Events from AS 2177 1. Route Fails: AS 2129 2. Announce: 5696 2129 3. Announce: 1 5696 2129 4. Announce: 2041 3508 2129 5. Announce: 1 2041 3508 2129 6. Route Withdrawn: 2129 AS 2041 AS 3508 AS 1 AS 5696 AS 2129 AS 2117

How Many Announcements Does it Take For an AS to Withdraw a Route? 50

How Many Announcements Does it Take For an AS to Withdraw a Route? 50 Answer: up to 19

ve r il. O Fa ng Lo t-> Fa ilu re or Sh Lon

ve r il. O Fa ng Lo t-> Fa ilu re or Sh Lon New g-> R Sho oute rt F ailov er BGP Routing Table Convergence Times q q q Less than half of Tdown events converge within two minutes Tup/Tshort and Tdown/Tlong form equivalence classes Long tailed distribution (up to 15 minutes)

Failures, Fail-overs and Repairs 52 Bad news does not travel fast… Repairs (Tup) exhibit

Failures, Fail-overs and Repairs 52 Bad news does not travel fast… Repairs (Tup) exhibit similar convergence as long-short AS path fail-over Failures (Tdown) and short-long fail-overs (e. g. primary to secondary path) also similar � Slower than Tup (e. g. a repair) � 80% take longer than two minutes � Fail-over times degrade the greater the degree of multi -homing

Intuition for Delayed Convergence 53 There exists possible ordering of messages such that BGP

Intuition for Delayed Convergence 53 There exists possible ordering of messages such that BGP will explore ALL possible AS paths of ALL possible lengths BGP is O(N!), where N number of default-free BGP routers in a complete graph with default policy

Impact of Delayed Convergence 54 Why do we care about routing table convergence? �

Impact of Delayed Convergence 54 Why do we care about routing table convergence? � It impacts end-to-end connectivity for Internet paths ICMP experiment results � Loss of connectivity, packet loss, latency, and packet re-ordering for an average of 3 -5 minutes after a fault Why? � Routers drop packets when next hop is unknown � Path switching spikes latency/delay � Multi-pathing causes reordering

In real life … 55 Discussed worst case BGP behavior In practice, BGP policy

In real life … 55 Discussed worst case BGP behavior In practice, BGP policy prevents worst case from happening BGP timers also provide synchronization and limits possible orderings of messages

56 q q Outline BGP Basics Stable Paths Problem BGP in the Real World

56 q q Outline BGP Basics Stable Paths Problem BGP in the Real World Debugging BGP Path Problems

Control plane vs. Data Plane 57 Control: � Make sure that if there’s a

Control plane vs. Data Plane 57 Control: � Make sure that if there’s a path available, data is forwarded over it � BGP sets up such paths at the AS-level Data: � For a destination, send packet to most-preferred next hop � Routers forward data along IP paths How does the control plane know if a data path is broken? � Direct-neighbor connectivity

Why Network Reliability Remains Hard Visibility � � Control � � IP provides no

Why Network Reliability Remains Hard Visibility � � Control � � IP provides no built-in monitoring Economic disincentives to share information publicly Routing protocols optimize for policy, not reliability Outage affecting your traffic may be caused by distant network Detecting, isolating and repairing network problems for Internet paths remains largely a slow, manual process

Improving Internet Availability � New Internet design � Monitoring everywhere in the network �

Improving Internet Availability � New Internet design � Monitoring everywhere in the network � Visibility into all available routes � Any operator can impact routes affecting her traffic � Challenges � What should we monitor? � What do we do with additional visibility? � How to use additional control?

A Practical Approach � We can do this already in today’s Internet � Crowdsourcing

A Practical Approach � We can do this already in today’s Internet � Crowdsourcing monitoring � Use existing protocols/systems in unintended ways � Allows � Also us to address problems today informs future Internet designs

Operators Struggle to Locate Failures “Traffic attempting to pass through Level 3’s network in

Operators Struggle to Locate Failures “Traffic attempting to pass through Level 3’s network in the Washington, DC area is getting lost in the abyss. Here's a trace from Verizon mailing Mailing. Outages List User 2 list, Dec. Mailing Listresidential User 1 to Level 3. ” 2010 1 Home router 2 Verizon in Baltimore 3 Verizon in Philly 4 Alter. net in DC 5 Level 3 in DC 6*** 7*** 1 Home router 2 Verizon in DC 3 Alter. net in DC 4 Level 3 in DC 5 Level 3 in Chicago 6 Level 3 in Denver 7*** 8***

Reasons for Long-Lasting Outages Long-term outages are: Repaired over slow, human timescales Not well

Reasons for Long-Lasting Outages Long-term outages are: Repaired over slow, human timescales Not well understood Caused by routers advertising paths that do not work � � E. g. , corrupted memory on line card causes black hole E. g. , bad cross-layer interactions cause failed MPLS tunnel

Key Challenges for Internet Repair Lack of visibility � � � Where is the

Key Challenges for Internet Repair Lack of visibility � � � Where is the outage? Which networks are (un)affected? Who caused the outage? Lack of control � � Reverse paths determined by possibly distant ASes Limited means to affect such paths

Goals and Approach Improve availability through: Failure isolation and remediation Identifying the AS(es) responsible

Goals and Approach Improve availability through: Failure isolation and remediation Identifying the AS(es) responsible for path changes Key techniques: � Visibility Active measurements from distributed vantage points Passive collection of BGP feeds � Control On-demand BGP prepending to route around outages Active BGP measurements to identify alternative paths

LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically Locate the ISP

LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically Locate the ISP / link causing the problem � � � 65 Building blocks Example Description of technique Suggest that other ISPs reroute around the problem

Building blocks for failure isolation LIFEGUARD can use: Ping to test reachability Traceroute to

Building blocks for failure isolation LIFEGUARD can use: Ping to test reachability Traceroute to measure forward path Distributed vantage points (VPs) � � Planet. Lab for our experiments Some can source spoof Reverse traceroute to measure reverse path (NSDI ’ 10) � I’ll teach you about this during the security lecture Atlas of historical forward/reverse paths between VPs and targets 66

How does LIFEGUARD locate a failure? Before outage: Historica l Current � Historical atlas

How does LIFEGUARD locate a failure? Before outage: Historica l Current � Historical atlas enables reasoning about changes � Traceroute yields only path from GMU to target � Reverse traceroute reveals path asymmetry 67

How does LIFEGUARD locate a failure? During outage: Problem with ZSTTK? Ping? Fr: VP

How does LIFEGUARD locate a failure? During outage: Problem with ZSTTK? Ping? Fr: VP Historica l Current � Forward 68 Ping! To: VP path works

How does LIFEGUARD locate a failure? During outage: NTT: Ping ? Fr: GMU Historica

How does LIFEGUARD locate a failure? During outage: NTT: Ping ? Fr: GMU Historica l Current � Forward GMU: Ping ! Fr: NTT path works

How does LIFEGUARD locate a failure? During outage: Rostele : Ping? Fr: GM U

How does LIFEGUARD locate a failure? During outage: Rostele : Ping? Fr: GM U Historica Current l � Forward path works � Rostelcom is not forwarding traffic towards GMU

How LIFEGUARD Locates Failures LIFEGUARD: Maintains background historical atlas 2. Isolates direction of failure,

How LIFEGUARD Locates Failures LIFEGUARD: Maintains background historical atlas 2. Isolates direction of failure, measures working direction 3. Tests historical paths in failing direction in order to prune candidate failure locations 4. Locates failure as being at the horizon of reachability 1. 71

Our Approach and Outline LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes

Our Approach and Outline LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically Locate the ISP / link causing the problem Suggest that other ISPs reroute around the problem � What would we like to add to BGP to enable this? � What can we deploy today, using only available protocols and router support? 72

Our Goal for Failure Avoidance Enable content / service providers to repair persistent routing

Our Goal for Failure Avoidance Enable content / service providers to repair persistent routing problems affecting them, regardless of which ISP is causing them Setting Assume we can locate problem Assume we are multi-homed / have multiple data centers Assume we speak BGP �We use Transit. Portal to speak BGP to the real

Self-Repair of Forward Paths

Self-Repair of Forward Paths

A Mechanism for Failure Avoidance Forward path: Choose route that avoids ISP or ISPISP

A Mechanism for Failure Avoidance Forward path: Choose route that avoids ISP or ISPISP link Reverse path: Want others to choose paths to my prefix P that avoid ISP or ISP-ISP link X Want a BGP announcement AVOID(X, P): � � 75 Any ISP with a route to P that avoids X uses such a route Any ISP not using X need only pass on the announcement

Ideal Self-Repair of Reverse Paths AVOID(L 3, WS)

Ideal Self-Repair of Reverse Paths AVOID(L 3, WS)

Do paths exist that AVOID problem? LIFEGUARD repairs outages by instructing others to avoid

Do paths exist that AVOID problem? LIFEGUARD repairs outages by instructing others to avoid particular routes. Q: Do alternative routes exist? A: Alternate policy-compliant paths exist in 90% of simulated AVOID(X, P) announcements. � Simulated 10 million AVOIDs on actual measured routes. 77

Practical Self-Repair of Reverse Paths UW → L 3 → ATT → WS Sprint

Practical Self-Repair of Reverse Paths UW → L 3 → ATT → WS Sprint → Qwest → WS AISP → Qwest → WS WS Qwest → WS

Practical Self-Repair of Reverse Paths UW → Sprint L 3 → ATT → Qwest

Practical Self-Repair of Reverse Paths UW → Sprint L 3 → ATT → Qwest → WS → L 3→ WS ? → ATT → WS L 3 ATT → WS → L 3→ WS Sprint → Qwest → WS→→WS L 3→ WS WS → L 3→ WS AVOID(L 3, WS) AISP → Qwest → L 3→ WS AISP→ →WS Qwest → WS → L 3→ WS BGP loop prevention encourages switch to working path.

Other results Results from real poisonings �Poisoning in the wild / poisoning anomalies �Case

Other results Results from real poisonings �Poisoning in the wild / poisoning anomalies �Case study of restoring connectivity Making poisoning flexible � Monitoring broken path while it is disabled � Allowing ISPs w/o alternatives to use disabled route LIFEGUARD’s scalability �Overhead and speed of failure location �Router update load if many ISPs deploy our approach Alternatives to poisoning �Compatibility with secure routing (BGPSEC, etc. ) �Comparing to other route control mechanisms

Can poisoning approximate AVOID effects? LIFEGUARD’s poisoning repairs outages by disabling routes to induce

Can poisoning approximate AVOID effects? LIFEGUARD’s poisoning repairs outages by disabling routes to induce route exploration. Q: Does poisoning disrupt working routes? A: No. As I will describe: (a) Under certain circumstances, we can disable a link without disabling the full ISP. (b) We can speed BGP convergence by carefully crafting announcements.

What if some routes in an ISP still work? � We 82 only want

What if some routes in an ISP still work? � We 82 only want C 3 to change its route, to avoid A-B 2

What if some routes in an ISP still work? � We only want C

What if some routes in an ISP still work? � We only want C 3 to change its route, to avoid A-B 2 � Forward direction is easy: choose a different route

What if some routes in an ISP still work? � We only want C

What if some routes in an ISP still work? � We only want C 3 to change its route, to avoid A-B 2 � Forward direction is easy: choose a different route

What if some routes in an ISP still work? � We only want C

What if some routes in an ISP still work? � We only want C 3 to change its route, to avoid A-B 2 � Poisoning seems blunt, disabling an entire ISP 85

What if some routes in an ISP still work? � We only want C

What if some routes in an ISP still work? � We only want C 3 to change its route, to avoid A-B 2 � Poisoning seems blunt, disabling an entire ISP

What if some routes in an ISP still work? � We only want C

What if some routes in an ISP still work? � We only want C 3 to change its route, to avoid A-B 2 � Poisoning seems blunt, disabling an entire ISP

What if some routes in an ISP still work? � We only want C

What if some routes in an ISP still work? � We only want C 3 to change its route, to avoid A-B 2 � Poisoning seems blunt, disabling an entire ISP � Selective advertising via just D 1 is also blunt

What if some routes in an ISP still work? � We only want C

What if some routes in an ISP still work? � We only want C 3 to change its route, to avoid A-B 2 � Poisoning seems blunt, disabling an entire ISP � Selective advertising via just D 1 is also blunt

What if some routes in an ISP still work? � We only want C

What if some routes in an ISP still work? � We only want C 3 to change its route, to avoid A-B 2 � Poisoning seems blunt, disabling an entire ISP � If D 1 and D 2 (transitively) connect to different Po. Ps of A, selectively poison via D 2 and not D 1

What if some routes in an ISP still work? � We only want C

What if some routes in an ISP still work? � We only want C 3 to change its route, to avoid A-B 2 � Poisoning seems blunt, disabling an entire ISP � If D 1 and D 2 (transitively) connect to different Po. Ps of A, selectively poison via D 2 and not D 1

What if some routes in an ISP still work? � We only want C

What if some routes in an ISP still work? � We only want C 3 to change its route, to avoid A-B 2 � Poisoning seems blunt, disabling an entire ISP � If D 1 and D 2 (transitively) connect to different Po. Ps of A, selectively poison via D 2 and not D 1

Can poisoning approximate AVOID effects? LIFEGUARD’s poisoning repairs outages by disabling routes to induce

Can poisoning approximate AVOID effects? LIFEGUARD’s poisoning repairs outages by disabling routes to induce route exploration. Q: Does poisoning disrupt working routes? A: No. As I will describe: (a) “Selective poisoning” can avoid 73% of links without disabling entire AS. ‣ Real-world results from 5 provider BGP-Mux testbed (b) We can speed BGP convergence by carefully crafting announcements. 93

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid problem ISP X � Naively, poisoning causes path exploration even for these ISPs � Path exploration causes transient loss 94 AVOID(X, P)

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid problem ISP X � Naively, poisoning causes path exploration even for these ISPs � Path exploration causes transient loss 95 AVOID(X, P)

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid problem ISP X � Naively, poisoning causes path exploration even for these ISPs � Path exploration causes transient loss 96 AVOID(X, P)

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid problem ISP X � Naively, poisoning causes path exploration even for these ISPs � Path exploration causes transient loss AVOID(X, P)

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid problem ISP X � Naively, poisoning causes path exploration even for these ISPs � Path exploration causes transient loss 98 AVOID(X, P)

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid problem ISP X � Naively, poisoning causes path exploration even for these ISPs � Path exploration causes transient loss 99 AVOID(X, P)

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid problem ISP X � Naively, poisoning causes path exploration even for these ISPs � Path exploration causes transient loss AVOID(X, P)

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid

Naive Poisoning Causes Transient Loss � Some ISPs may have working paths that avoid problem ISP X � Naively, poisoning causes path exploration even for these ISPs � Path exploration causes transient loss 1 AVOID(X, P)

Prepend to Reduce Path Exploration � Most routing decisions based on: (1) next hop

Prepend to Reduce Path Exploration � Most routing decisions based on: (1) next hop ISP (2) path length � Keep these fixed to speed convergence � Prepending prepares ISPs for later poison 1 AVOID(X, P)

Prepend to Reduce Path Exploration � Most routing decisions based on: (1) next hop

Prepend to Reduce Path Exploration � Most routing decisions based on: (1) next hop ISP (2) path length � Keep these fixed to speed convergence � Prepending prepares ISPs for later poison 1 AVOID(X, P)

Prepend to Reduce Path Exploration � Most routing decisions based on: (1) next hop

Prepend to Reduce Path Exploration � Most routing decisions based on: (1) next hop ISP (2) path length � Keep these fixed to speed convergence � Prepending prepares ISPs for later poison 1 AVOID(X, P)

Prepend to Reduce Path Exploration � Most routing decisions based on: (1) next hop

Prepend to Reduce Path Exploration � Most routing decisions based on: (1) next hop ISP (2) path length � Keep these fixed to speed convergence � Prepending prepares ISPs for later poison 1 AVOID(X, P)

Prepend to Reduce Path Exploration � Most routing decisions based on: (1) next hop

Prepend to Reduce Path Exploration � Most routing decisions based on: (1) next hop ISP (2) path length � Keep these fixed to speed convergence � Prepending prepares ISPs for later poison 1 AVOID(X, P)

Prepending Speeds Convergence � With no prepend, only 65% of unaffected ISPs converge instantly

Prepending Speeds Convergence � With no prepend, only 65% of unaffected ISPs converge instantly � With prepending, 95% of unaffected ISPs re-converge instantly, 98%<1/2 min. � Also speeds convergence to new paths for affected peers

LIFEGUARD Summary We increasingly depend on the Internet, but availability lags Much of Internet

LIFEGUARD Summary We increasingly depend on the Internet, but availability lags Much of Internet unavailability due to long-lasting outages LIFEGUARD: Let edge networks reroute around failures Location challenge: Find problem, given unidirectional failures and tools that depend on connectivity � Use reverse traceroute, isolate directions, use historical view Avoidance challenge: Reroute without participation of transit networks � BGP poisoning gives control to the destination

Inter-Domain Routing Summary 109 BGP 4 is the only inter-domain routing protocol currently in

Inter-Domain Routing Summary 109 BGP 4 is the only inter-domain routing protocol currently in use world-wide Issues? � Lack of security � Ease of misconfiguration � Poorly understood interaction between local policies � Poor convergence � Lack of appropriate information hiding � Non-determinism � Poor overload behavior

Lots of research into how to fix this 110 Security � BGPSEC, RPKI Misconfigurations,

Lots of research into how to fix this 110 Security � BGPSEC, RPKI Misconfigurations, inflexible policy � SDN Policy Interactions � Poi. Root (root cause analysis) Convergence � Consensus Routing Inconsistent behavior � LIFEGUARD, among others

Why are these still issues? 111 Backward compatibility Buy-in / incentives for operators Stubbornness

Why are these still issues? 111 Backward compatibility Buy-in / incentives for operators Stubbornness Very similar issues to IPv 6 deployment