Network Architecture for Joint Failure Recovery and Traffic

































- Slides: 33
Network Architecture for Joint Failure Recovery and Traffic Engineering Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford
Failure Recovery and Traffic Engineering in IP Networks o Uninterrupted data delivery when links or routers fail o Failure recovery essential for n Backbone network operators n Large datacenters n Local enterprise networks o Goal: re-balance the network load after failure 2
Challenges of Failure Recovery o Existing solutions reroute traffic to avoid failures n Can use, e. g. , MPLS local or global protection primary tunnel global backup tunnel local backup tunnel o Drawbacks of the existing solutions n Local path protection highly suboptimal n Global path protection often requires dynamic recalculation of the tunnels 3
Overview I. Architecture: goals and proposed design II. Optimizations: of routing and load balancing III. Evaluation: using synthetic and realistic topologies IV. Conclusion 4
Architectural Goals 1. Simplify the network n Allow use of minimalist cheap routers n Simplify network management 2. Balance the load n Before, during, and after each failure 3. Detect and respond to failures 5
The Architecture – Components o Minimal functionality in routers n Path-level failure notification n Static configuration n No coordination with other routers o Management system n Knows topology, approximate traffic demands, potential failures n Sets up multiple paths and calculates load splitting ratios 6
The Architecture • topology design • list of shared risks • traffic demands • fixed paths • splitting ratios t 0. 25 s 0. 25 0. 5 7
The Architecture • fixed paths • splitting ratios t 0. 25 s 0. 25 0. 5 link cut 8
The Architecture • fixed paths • splitting ratios t 0. 25 s 0. 25 0. 5 path pr obing link cut 9
The Architecture • fixed paths • splitting ratios t 0. 5 s 0. 5 0 path pr obing link cut 10
Overview I. Architecture: goals and proposed design II. Optimizations: of routing and load balancing III. Evaluation: using synthetic and realistic topologies IV. Conclusion 11
Goal I: Find Paths Resilient to Failures o A working path needed for each allowed failure state (shared risk link group) R 1 e 4 e 2 e 3 R 2 e 5 Example of failure states: S = {e 1}, { e 2}, { e 3}, { e 4}, { e 5}, {e 1, e 2}, {e 1, e 5} 12
Goal II: Minimize Link Loads links indexed by e cost Φ(ues ) failure states indexed by s aggregate congestion cost weighted for all failures: ues =1 minimize ∑s ws∑e Φ(ues) while routing all traffic link utilization ues failure state weight Cost function is a penalty for approaching capacity 13
Possible Solutions congestion Suboptimal solution Good performance and practical? Solution not scalable capabilities of routers o Overly simple solutions do not do well o Diminishing returns when adding functionality 14
The Idealized Optimal Solution: Finding the Paths o Assume edge router can learn which links failed o Custom splitting ratios for each failure configuration: 0. 4 e 1 0. 4 e 3 e 5 0. 2 Failure Splitting Ratios - 0. 4, 0. 2 e 4 0. 7, 0, 0. 3 e 1 & e 2 … 0, 0. 6, 0. 4 e 2 one entry per failure … 0. 7 e 4 e 6 0. 3 15
The Idealized Optimal Solution: Finding the Paths o Solve a classical multicommodity flow for each failure case s: min s. t. load balancing objective flow conservation demand satisfaction edge flow non-negativity o Decompose edge flow into paths and splitting ratios 16
1. State-Dependent Splitting: Per Observable Failure o Edge router observes which paths failed o Custom splitting ratios for each observed combination of failed paths o NP-hard unless paths are fixed configuration: 0. 4 0. 2 Failure Splitting Ratios - 0. 4, 0. 2 p 2 0. 6, 0, 0. 4 … … p 1 0. 4 p 2 p 3 at most 2#paths entries 0. 6 0. 4 17
1. State-Dependent Splitting: Per Observable Failure o Heuristic: use the same paths as the idealized optimal solution o If paths fixed, can find optimal splitting ratios: min s. t. load balancing objective flow conservation demand satisfaction path flow non-negativity 18
2. State-Independent Splitting: Across All Failure Scenarios o Edge router observes which paths failed o Fixed splitting ratios for all observable failures o Non-convex optimization even with fixed paths configuration: p 1, p 2, p 3: 0. 4, 0. 2 0. 4 0. 2 p 1 0. 4 p 2 p 3 0. 667 0. 333 19
2. State-Independent Splitting: Across All Failure Scenarios o Heuristic to compute paths n Paths from the idealized optimal solution o Heuristic to compute splitting ratios n Averages of the idealized optimal solution weighted by all failure case weights ri = ∑ s w s r i s fraction of traffic on the i-th path 20
The Two Solutions 1. State-dependent splitting 2. State-independent splitting How do they compare to the idealized optimal solution? How well do they work in practice? 21
Overview I. Architecture: goals and proposed design II. Optimizations: of routing and load balancing III. Evaluation: using synthetic and realistic topologies IV. Conclusion 22
Simulations on a Range of Topologies Topology Nodes Edges Demands Tier-1 50 180 625 Abilene 11 28 110 Hierarchical 50 148 - 212 2, 450 Random 50 - 100 228 - 403 2, 450 – 9, 900 Waxman 50 169 - 230 2, 450 o Shared risk failures for Tier-1 topology n 954 failures, up to 20 links simultaneously o Single link failures 23
objective value Congestion Cost – Tier-1 IP Backbone with SRLG Failures State-independent splitting How do we compare to OSPF? not optimal but Usesimple optimized OSPF link weights [Fortz, Thorup ’ 02]. State-dependent splitting indistinguishable from optimum increasing load network traffic Additional router capabilities improve performance up to a point 24
objective value Congestion Cost – Tier-1 IP Backbone with SRLG Failures OSPF with optimized link weights can be suboptimal increasing load network traffic OSPF uses equal splitting on shortest paths. This restriction makes the performance worse. 25
Average Traffic Propagation Delay in Tier-1 Backbone o Service Level Agreements guarantee 37 ms mean traffic propagation delay o Need to ensure mean delay doesn’t increase much Algorithm Delay (ms) OSPF (current) 31. 38 Optimum 31. 75 State dep. splitting 31. 51 State indep. splitting 31. 76 26
cdf Number of Paths (Tier-1) For higher traffic load slightly more paths numberofofpaths Number of paths almost independent of the load 27
relative traffic volume (%) Traffic is Highly Variable (Tier-1) No single traffic peak (international traffic) number paths Time of (GMT) Can a static configuration cope with the variability? 28
objective value Performance with Static Router Configuration (Tier-1) number paths time of (GMT) State dependent splitting again nearly optimal 29
Overview I. Architecture: goals and proposed design II. Optimizations: of routing and load balancing III. Evaluation: using synthetic and realistic topologies IV. Conclusion 30
Conclusion o Simple mechanism combining path protection and traffic engineering o Favorable properties of state-dependent splitting algorithm: (i) Simplifies network architecture (ii) Optimal load balancing with static configurations (iii) Small number of paths (iv) Delay comparable to current OSPF o Path-level failure information is just as good as complete failure information 31
Thank You! 32
Size of Routing Tables (Tier-1) routing table size Largest routing table among all routers number of traffic paths network Size after compression: use the same label for routes that share path 33