Finding a Needle in a Haystack Pinpointing Significant

  • Slides: 23
Download presentation
Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP

Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University of Michigan) Jennifer Rexford (Princeton University) Jia Wang (AT&T Labs Research) 1

Motivation destination Failure AS 4 Disruption AS 2 AS 3 Congestion BR C A

Motivation destination Failure AS 4 Disruption AS 2 AS 3 Congestion BR C A BR C B AS 1 BR C C Mitigation A backbone network is vulnerable to routing BR C D changes that occur in other domains. source 2

Goal o Identify important routing anomalies n n n Lost reachability Persistent flapping Large

Goal o Identify important routing anomalies n n n Lost reachability Persistent flapping Large traffic shifts Contributions: • Build a tool to identify a small number of important routing disruptions from a large volume of raw BGP updates in real time. • Use the tool to characterize routing disruptions in an operational network 3

Interdomain Routing: Border Gateway Protocol “I can reach 12. 34. 158. 0/24” AS 1

Interdomain Routing: Border Gateway Protocol “I can reach 12. 34. 158. 0/24” AS 1 BR BR C 12. 34. 158. 0/24 e. BGP data traffic BR BR C AS 2 i. BGP “I can reach 12. 34. 158. 0/24 via AS 1” BR BR C e. BGP data traffic BR BR C AS 3 12. 34. 158. 5 o o Prefix-based: one route per prefix Path-vector: list of ASes in the path Incremental: every update indicates a change Policy-based: local ranking of routes 4

Capturing Routing Changes A large operational network (8/16/2004 – 10/10 -2004) Upd a e.

Capturing Routing Changes A large operational network (8/16/2004 – 10/10 -2004) Upd a e. BGP tes BR BR C U BR BR C s i. BGP Bes t ro utes st Be BGP CPE es t pda te rou BR BR C P e. BG P i. BG Monitor i. BGP BR BR C P e. BG 5

Challenges o Large volume of BGP updates n n o Millions daily, very bursty

Challenges o Large volume of BGP updates n n o Millions daily, very bursty Too much for an operator to manage Different from root-cause analysis n n n Identify changes and their effects Focus on actionable events rather than diagnosis Diagnose causes in/near the AS 6

System Architecture BGP (106) BR E Updates Events (105) BR E BGP Update Grouping

System Architecture BGP (106) BR E Updates Events (105) BR E BGP Update Grouping Persistent Flapping Prefixes (101) “Typed” Events Event Classification Clusters Event Correlation Frequent Flapping Prefixes (101) Large Disruptions (101) (103) Traffic Impact Prediction Netflow Data BR E From millions of updates to a few dozen reports 7

Grouping BGP Update into Events Challenge: A single routing change n n leads to

Grouping BGP Update into Events Challenge: A single routing change n n leads to multiple update messages affects routing decisions at multiple routers BR E Approach: BGP Updates Grouping Persistent Flapping Prefixes Events • Group together all updates for a prefix with inter-arrival < 70 seconds • Flag prefixes with changes lasting > 10 minutes. 8

Grouping Thresholds o o Based on our understanding of BGP and data analysis Event

Grouping Thresholds o o Based on our understanding of BGP and data analysis Event timeout: 70 seconds n n o 2 * MRAI timer + 10 seconds 98% inter-arrival time < 70 seconds Convergence timeout: 10 minutes n n BGP usually converges within a few minutes 99. 9% events < 10 minutes 9

Persistent Flapping Prefixes A surprising finding: 15. 2% of updates were caused by persistent-flapping

Persistent Flapping Prefixes A surprising finding: 15. 2% of updates were caused by persistent-flapping prefixes even though flap damping is enabled. o Types of persistent flapping n Conservative damping parameters (78. 6%) n Protocol oscillations due to MED (18. 3%) n Unstable interfaces or BGP sessions (3. 0%) 10

Example: Unstable e. BGP Session ISP AE DE Peer BE CE p Customer o

Example: Unstable e. BGP Session ISP AE DE Peer BE CE p Customer o o Flap damping parameters is session-based Damping not implemented for i. BGP sessions 11

Event Classification Challenge: Major concerns in network management n Changes in reachability n Heavy

Event Classification Challenge: Major concerns in network management n Changes in reachability n Heavy load of routing messages on the routers n Change of flow of the traffic through the network Events Event Classification “Typed” Events, e. g. , Loss/Gain of Reachability Solution: classify events by severity of their impact 12

Event Category – “No Disruption” p AS 2 AS 1 DE No Traffic Shift

Event Category – “No Disruption” p AS 2 AS 1 DE No Traffic Shift “No Disruption”: EE AE BE ISP no border routers have any traffic shift. (50. 3%) CE 13

Event Category – “Internal Disruption” p AS 2 AS 1 DE EE AE BE

Event Category – “Internal Disruption” p AS 2 AS 1 DE EE AE BE “Internal Disruption”: ISP all traffic shifts are internal. (15. 6%) CE Internal Traffic Shift 14

Event Category – “Single External Disruption” p AS 2 AS 1 DE external Traffic

Event Category – “Single External Disruption” p AS 2 AS 1 DE external Traffic Shift EE AE BE “Single External Disruption”: ISP only one of the traffic shifts is external (20. 7%) CE 15

Statistics on Event Classification Events Updates No Disruption 50. 3% 48. 6% Internal Disruption

Statistics on Event Classification Events Updates No Disruption 50. 3% 48. 6% Internal Disruption 15. 6% 3. 4% Single External Disruption 20. 7% 7. 9% 7. 4% 18. 2% Multiple External Disruption Loss/Gain of categories have 6. 0% o First 3 significant 21. 9% day-to-day Reachability variations o Updates per event depends on the type of events and the number of affected routers 16

Event Correlation Challenge: A single routing change n affects multiple destination prefixes “Typed” Events

Event Correlation Challenge: A single routing change n affects multiple destination prefixes “Typed” Events Event Correlation Clusters Solution: group the same-type, close-occurring events 17

EBGP Session Reset o o Caused most of “single external disruption” events Check if

EBGP Session Reset o o Caused most of “single external disruption” events Check if the number of prefixes using that session as the best route changes dramatically Number of prefixes session recovery session failure time o Validation with Syslog router report (95%) 18

Hot-Potato Changes o Hot-Potato Changes P AE 11 9 BE ISP 10 “Hot-potato routing”

Hot-Potato Changes o Hot-Potato Changes P AE 11 9 BE ISP 10 “Hot-potato routing” = route to closest egress point CE o o Caused “internal disruption” events Validation with OSPF measurement (95%) [Teixeira et al – SIGMETRICS’ 04] 19

Traffic Impact Prediction Challenge: Routing changes have different impacts on the network which depends

Traffic Impact Prediction Challenge: Routing changes have different impacts on the network which depends on the popularity of the destinations Traffic Impact Prediction Clusters Large Disruptions Netflow Data E BR Solution: weigh each cluster by traffic volume 20

Traffic Impact Prediction o Traffic weight n n o Per-prefix measurement from netflow 10%

Traffic Impact Prediction o Traffic weight n n o Per-prefix measurement from netflow 10% prefixes accounts for 90% of traffic Traffic weight of a cluster n n n the sum of “traffic weight” of the prefixes A small number of large clusters have large traffic weight Mostly session resets and hot-potato changes 21

Performance Evaluation o Memory n n o Static memory: “current routes”, 600 MB Dynamic

Performance Evaluation o Memory n n o Static memory: “current routes”, 600 MB Dynamic memory: “clusters”, 300 MB Speed n n n 99% of intervals of 1 second of updates can be process within 1 second Occasional execution lag Every interval of 70 seconds of updates can be processed within 70 seconds Measurements were based on 900 MHz CPU 22

Conclusion o BGP troubleshooting system n n n o Fast, online fashion Operators’ concerns

Conclusion o BGP troubleshooting system n n n o Fast, online fashion Operators’ concerns (reachability, flapping, traffic) Significant information reduction o millions of update a few dozens of large disruptions Uncovered important network behavior n n n Hot-Potato changes Session resets Persistent-flapping prefixes 23