Internet Routing COS 598 A Today Intradomain Routing

  • Slides: 32
Download presentation
Internet Routing (COS 598 A) Today: Intradomain Routing Convergence Jennifer Rexford http: //www. cs.

Internet Routing (COS 598 A) Today: Intradomain Routing Convergence Jennifer Rexford http: //www. cs. princeton. edu/~jrex/teaching/spring 2005 Tuesdays/Thursdays 11: 00 am-12: 20 pm

General Course Stuff • Tuesday March 1 – Lecture by Larry Peterson on Planet.

General Course Stuff • Tuesday March 1 – Lecture by Larry Peterson on Planet. Lab – No assignment for written reviews of papers – I’ll be attending a Computing Research Association (CRA) Board of Directors meeting in D. C. • Course projects – Good to start thinking up a project idea – Make an appointment to chat with me – Written report due on Dean’s Date, Tue May 10 – Oral presentations during exam period • No formal “leading a class discussion” item

Splitting Convergence into Two Lectures • One class is just not enough – Today

Splitting Convergence into Two Lectures • One class is just not enough – Today on intradomain routing – Next Thursday on interdomain routing • Impact on reading you already did – One paper intradomain, one on interdomain – Though class will focus just on intradomain • Impact on reading for Thursday March 3 – One paper intradomain, one on interdomain – Though class will focus just on interdomain • Push off the next topic to following class

Outline • Routing-protocol convergence – Steps in reacting to a link failure – Effects

Outline • Routing-protocol convergence – Steps in reacting to a link failure – Effects on the data packets • Intradomain routing protocols – Steps in protocol convergence – Implementation overhead and timers • Operational practices to reduce convergence – Multiple shortest paths between routers – Costing out during planned maintenance

Routing Convergence • The only constant is change – Equipment failures, or new deployment

Routing Convergence • The only constant is change – Equipment failures, or new deployment – Routing-protocol configuration changes – Planned maintenance on the network • Routing protocols adapt – Detect the change – Propagate messages – Compute new routes – Update the forwarding tables

Converging After a Failure • Failure detection – Router recognizes an incident link has

Converging After a Failure • Failure detection – Router recognizes an incident link has failed • Failure notification Routing convergence – Router informs other routers about the change Forwarding • Path re-computation convergence – Routers compute new paths avoiding the link • Forwarding-table update – Routers update their forwarding tables – Data traffic starts to flow over the new path

Bad Things Happen During Convergence • Transient inconsistencies – Routers have different views of

Bad Things Happen During Convergence • Transient inconsistencies – Routers have different views of the network – Forwarding decisions may be inconsistent • Effects on data traffic – Black-hole: packet loss – Loops: packets going in circles – Delay: packets going on very long paths – Out-of-order: new packets arrive before old ones • Want to minimize convergence delay – … and especially the effects on the data traffic

Example: Black-hole Causing Packet Loss • Router forwarding to dead link – Doesn’t know

Example: Black-hole Causing Packet Loss • Router forwarding to dead link – Doesn’t know (yet) that the link is dead – Or, hasn’t computed a new forwarding entry drop s d Fortunately, IP only promises “best effort” delivery!

Example: Forwarding Loop • Set of routers disagree – One router acting on old

Example: Forwarding Loop • Set of routers disagree – One router acting on old information – Another router acting on new information Loop! s d

Intradomain Routing Convergence

Intradomain Routing Convergence

Interior Gateway Protocols (IGPs) • Routers running OSPF or IS-IS: – Flood link-state advertisements

Interior Gateway Protocols (IGPs) • Routers running OSPF or IS-IS: – Flood link-state advertisements (LSAs) – Compute shortest paths from link weights – Determine “next hop” to other routers… 2 3 2 1 1 1 3 5 4 3

Knowing a Link is Dead: Heart-Beats • Periodic “hello” packets (hello_interval, 10 sec) hello

Knowing a Link is Dead: Heart-Beats • Periodic “hello” packets (hello_interval, 10 sec) hello – Timeout if not received (dead_interval, 40 sec) – Declare failure and flood the info to others • Small values lead to faster detection, but also: – Higher bandwidth consumption for “hellos” – False detection during congestion interval – False detection if router CPU falls a little behind

Knowing the Link is Dead: Interface Support • Smart interface hardware – Detects loss

Knowing the Link is Dead: Interface Support • Smart interface hardware – Detects loss of connectivity at lower layer – Interrupts the router CPU about the failure – Common in Packet Over SONET technology – E. g. Sprint paper sees delays less than 100 msec • But… – Some media don’t support it (e. g. , Ethernet, ATM) – … so, you often need heartbeats anyway – Also, want heartbeats to detect failures the hardware cannot detect on its own

Flooding the Link-State Advertisement • After detecting the failure – Router sends LSA out

Flooding the Link-State Advertisement • After detecting the failure – Router sends LSA out each link – Each router does the same – … and so on • Flooding delay – (CPU delay at each hop) * (diameter of the network)

Computing the Shortest Paths • Each router re-computes – Shortest-path tree rooted at this

Computing the Shortest Paths • Each router re-computes – Shortest-path tree rooted at this router – Determine next-hop to every other router A A 1 1 B 1 C B 1 D 1 1 1 2 E 1 3 F C 1 H 1 1 J 1 G 1 2 I E 1 1 F G 1 1 D 2 I H 1 J

Reducing the Computational Overhead • Good system – Fast processor – High-speed memory •

Reducing the Computational Overhead • Good system – Fast processor – High-speed memory • Good algorithms – Traditional approach computes from scratch – Incremental algorithms compute only the changes – Especially nice if only one edge changes • Pre-computation – Pre-compute effects of certain failure scenarios – E. g. , all single-link or single-router failures

Updating the Forwarding Table • Forwarding table – Map destination prefix to outgoing link(s)

Updating the Forwarding Table • Forwarding table – Map destination prefix to outgoing link(s) – Copy of table on each interface card – Highly optimized for fast lookups • Updating the forwarding table – Computing the new forwarding table – Making updates to the copy of the line card • Important source of delay – Sprint end-to-end study: around 1 second – AT&T router-level study: 100 msec – 300 msec

All Together: Looking Inside the Router LSA Processing Route Processor (CPU) OSPF Process LSA

All Together: Looking Inside the Router LSA Processing Route Processor (CPU) OSPF Process LSA Flooding Topology View SPF Calculation FIB Update FIB LSA LS Ack Data packet Forwarding Interface card LSA Switching Fabric Interface card Data packet

Significance of Protocol Timers • Hello and dead intervals – Failure-detection delay vs. false

Significance of Protocol Timers • Hello and dead intervals – Failure-detection delay vs. false diagnosis • Pacing the link-state advertisements – Combining LSAs vs. longer convergence delay – Some routers wait till after re-running Dijkstra! • Delaying start of shortest-path computation – Reducing # computations vs. convergence delay – Especially useful if failure affects multiple links

Operational Practices

Operational Practices

Reducing the Effects of Convergence • Long convergence delay is bad – Transient problems

Reducing the Effects of Convergence • Long convergence delay is bad – Transient problems with loss and delay – Disruptive for Vo. IP and online gaming • Solution #1: better equipment – Interfaces that detect failures automatically – Cranking down the values of the timers – Faster CPUs and path-computation algorithms • Solution #2: network design and operation – Improve forwarding-plane convergence – Improve convergence during maintenance

Equal-Cost Multi-Path (ECMP) • Multiple shortest paths – Router can compute multiple shortest paths

Equal-Cost Multi-Path (ECMP) • Multiple shortest paths – Router can compute multiple shortest paths – Forwarding table has multiple outgoing links – Router splits traffic evenly over the links 2 3 2 1 1 1 3 5 3 3

ECMP Reduces Forwarding-Plane Convergence • Suppose one of the outgoing link fails – Incident

ECMP Reduces Forwarding-Plane Convergence • Suppose one of the outgoing link fails – Incident router detects the failure – Quick recomputation of paths without this link – Local forwarding table updated to use other link – Other routers have no forwarding-table change!!! 2 3 2 Only red router changes its forwarding table! 1 1 1 3 5 3 3

Exploiting This Observation in Traffic Engineering • Traffic engineering – Given a topology and

Exploiting This Observation in Traffic Engineering • Traffic engineering – Given a topology and a traffic matrix – … set link weights to control the flow of traffic – … to minimize some objective function • Bias toward solutions with “ties” – Penalize solutions with just one shortest path – Favor solutions that lead to multiple paths – … even if the link loads are a little less balanced • Applied in some traffic-engineering tools – Demand from ISPs buying the tools – … with customers demanding fast convergence

Examples of Planned Failures • Upgrades – Changing link to higher capacity – Loading

Examples of Planned Failures • Upgrades – Changing link to higher capacity – Loading new operating system on a router – Swapping out an old interface card • Maintenance – Fixing a flaky optical amplifier – Configuration changes that require a reboot • Cable intrusions – Construction activities near a fiber

Planned Events Happen Often • Sprint study – Maintenance window • From 10 pm

Planned Events Happen Often • Sprint study – Maintenance window • From 10 pm to 6 am EST, covering east to west • Period of low network traffic, so less congestion • Not much business-critical traffic – Responsible for 50% of intradomain failures • Significance – Planned events should be easier to handle – The operator knows the failure(s) will happen – … but, how to tell the routing protocol? – … or, how to prepare the network in advance?

“Costing Out” of Equipment • Increase cost of link to high value – Triggers

“Costing Out” of Equipment • Increase cost of link to high value – Triggers immediate flooding of LSAs • Leads to new shortest paths avoiding the link – While the link still exists to forward during convergence • Then, can safely disconnect the link – New flooding of LSAs, but no influence on forwarding 2 3 2 1 1 1 3 5 4 3

Bigger Picture • Learn about a planned event – E. g. , replace optical

Bigger Picture • Learn about a planned event – E. g. , replace optical amplifier • Map the event to the IP equipment – E. g. , find link(s) that traverse the amplifier • Increase the weight on each link – Slowly, perhaps one at a time to reduce overhead • Disable the equipment – Disconnect amplifier and replace with new one • Reintroduce the links into the network – Slowly, change one link weight at a time

Even Bigger Picture • What if maintenance would cause congestion? – Reducing the capacity

Even Bigger Picture • What if maintenance would cause congestion? – Reducing the capacity of the network – Link weights not optimized to new topology • Compute weight changes to make – Re-optimize the setting of the link weights – … based on the soon-to-be new topology • Then, do the maintenance – Cost out the IP links – Fix/upgrade the equipment – Cost in the IP links • Then, go back to the old weight setting

Project Ideas • Multi-path routing – Protocols that allow more multi-path routing – …

Project Ideas • Multi-path routing – Protocols that allow more multi-path routing – … not just the equal-cost paths (as in ECMP) • Maintenance schedules – Compute a sequence of weight changes – Avoid link congestion in each step • Convergence models – What actually happens during convergence? – Simulation of forwarding-plane behavior • Effective pre-computation on routers – Routers precompute reactions to certain failures – E. g. , all single-link failures or single-router failures

For Next Time, on Tuesday: Planet. Lab • Guest lecture – Professor Larry Peterson

For Next Time, on Tuesday: Planet. Lab • Guest lecture – Professor Larry Peterson • Three papers (two short, one regular) – “A Blueprint for Introducing Disruptive Technology into the Internet” – “Overcoming the Internet Impasse through Virtualization” – “Operating System Support for Planetary-Scale Network Services” • No written reviews – But, be ready to ask hard questions • Planet. Lab is very useful for course projects

Next Thursday: Interdomain Convergence • Two papers (intradomain and interdomain) – “Experience in Black-box

Next Thursday: Interdomain Convergence • Two papers (intradomain and interdomain) – “Experience in Black-box OSPF Measurement” – “Route Flap Damping Exacerbates Internet Routing Convergence” • Written reviews – Summary – Reasons to accept – Reasons to reject – Avenues for future work • Optional – NANOG video about the second paper – Really great essay on “You and Your Research”