Internet Routing COS 598 A Today RootCause Analysis

  • Slides: 40
Download presentation
Internet Routing (COS 598 A) Today: Root-Cause Analysis Jennifer Rexford http: //www. cs. princeton.

Internet Routing (COS 598 A) Today: Root-Cause Analysis Jennifer Rexford http: //www. cs. princeton. edu/~jrex/teaching/spring 2005 Tuesdays/Thursdays 11: 00 am-12: 20 pm

Outline • Network troubleshooting – Motivation for network troubleshooting – Investigating from the edge

Outline • Network troubleshooting – Motivation for network troubleshooting – Investigating from the edge vs. inside • Active probing – Traceroute – Mapping IP addresses to AS numbers • Passive monitoring – Analyzing BGP update streams – Identifying location and cause of routing change – Limitations of the approach

Network Troubleshooting “Why can’t I reach www. cnn. com? ” “Why is the performance

Network Troubleshooting “Why can’t I reach www. cnn. com? ” “Why is the performance bad? ” Internet www. cnn. com

Reachability Problems: What Could be Wrong? • End-host problem – Web server down –

Reachability Problems: What Could be Wrong? • End-host problem – Web server down – DNS server down, or misconfigured • Forwarding-path problem – Packet filter or firewall restricting access – Mismatch in Maximum Transmission Unit (MTU) • Routing problem – User or server disconnected from Internet – Blackhole dropping all packets – Persistent loop

Performance Problem: What Could be Wrong? • End-host problems – Overloaded Web server –

Performance Problem: What Could be Wrong? • End-host problems – Overloaded Web server – Overloaded DNS server – Overloaded user machine • Forwarding-path problem – High round-trip time – Link congestion • Routing problem – Long-term routing instability – Transient disruption during convergence

Motivation for Troubleshooting • Improving performance – Detect, diagnose, and fix the problem –

Motivation for Troubleshooting • Improving performance – Detect, diagnose, and fix the problem – Pick a path through another provider – Pick a different path in any overlay network • Establishing accountability – Enforce Service Level Agreements – Rate service providers • Characterizing the Internet – Understand causes of performance problems – Understand challenges of troubleshooting

Troubleshooting Outside vs. Inside • Outside: from network edge Today – Who: users and

Troubleshooting Outside vs. Inside • Outside: from network edge Today – Who: users and researchers, and operators troubleshooting problems outside their network – Data: ping/traceroute, public feeds of BGP updates, and public measurement platforms – Challenges: inference from very limited data • Inside: from inside the network – Who: operators running a network – Data: SNMP, fault data, traffic measurement, route monitors, and router configuration files – Challenges: collecting and joining the data

Active Probing

Active Probing

Pros and Cons of Active Probing • Advantages – Can run from any end

Pros and Cons of Active Probing • Advantages – Can run from any end system – Measure the actual forwarding path • See black-holes, loops, and delays directly • Disadvantages – Effects of routing changes, not the cause – Current path, not the path used in the past • Requires frequent probes to observe the changes – Shows only properties of round-trip path • Hard to tell if problem is on forward vs. reverse

Traceroute: Measuring the Forwarding Path • Time-To-Live field in IP packet header – Source

Traceroute: Measuring the Forwarding Path • Time-To-Live field in IP packet header – Source sends a packet with a TTL of n – Each router along the path decrements the TTL – “TTL exceeded” sent when TTL reaches 0 • Traceroute tool exploits this TTL behavior TTL=1 source TTL=2 Time exceeded destination Send packets with TTL=1, 2, 3, … and record source of “time exceeded” message

Example Traceroute Output (Berkeley to CNN) Hop number, IP address, DNS name No response

Example Traceroute Output (Berkeley to CNN) Hop number, IP address, DNS name No response from router 1 169. 229. 62. 1 inr-daedalus-0. CS. Berkeley. EDU 2 169. 229. 59. 225 soda-cr-1 -1 -soda-br-6 -2 3 128. 32. 255. 169 vlan 242. inr-202 -doecev. Berkeley. EDU 4 128. 32. 0. 249 gig. E 6 -0 -0. inr-666 -doecev. Berkeley. EDU 5 128. 32. 0. 66 qsv-juniper--ucb-gw. calren 2. net 6 209. 247. 159. 109 POS 1 -0. hsipaccess 1. San. Jose 1. Level 3. net 7 * ? 8 64. 159. 1. 46 ? 9 209. 247. 9. 170 pos 8 -0. hsa 2. Atlanta 2. Level 3. net 10 66. 185. 138. 33 pop 2 -atm-P 0 -2. atdn. net 11 * ? 12 66. 185. 136. 17 pop 1 -atl-P 4 -0. atdn. net 13 64. 236. 16. 52 www 4. cnn. com No name resolution

Example Troubleshooting Results • No packets go beyond your gateway – Gateway’s connection to

Example Troubleshooting Results • No packets go beyond your gateway – Gateway’s connection to Internet is dead • Traceroute stops at intermediate point – Perhaps a blackhole • Traceroute path has a loop – Transient or persistent forwarding loop • Traceroute shows a very long path – Routing anomaly, route hijacking, etc. • Traceroute shows very long delays – Delay or congestion on forward or reverse path

Problems with Traceroute • Missing responses – Routers might not send “Time-Exceeded” – Firewalls

Problems with Traceroute • Missing responses – Routers might not send “Time-Exceeded” – Firewalls may drop the probe packets – “Time-Exceeded” reply may be dropped • Misleading responses – Probes taken while the path is changing – Name not in DNS, or DNS entry misconfigured • Mapping IP addresses – Mapping interfaces to a common router – Mapping interface/router to Autonomous System

Map Traceroute Hops to ASes Traceroute output: (hop number, IP) 1 169. 229. 62.

Map Traceroute Hops to ASes Traceroute output: (hop number, IP) 1 169. 229. 62. 1 AS 25 2 169. 229. 59. 225 AS 25 Berkeley 3 128. 32. 255. 169 AS 25 4 128. 32. 0. 249 AS 25 5 128. 32. 0. 66 AS 11423 Calren 6 209. 247. 159. 109 AS 3356 7 * AS 3356 8 64. 159. 1. 46 AS 3356 9 209. 247. 9. 170 AS 3356 10 66. 185. 138. 33 AS 1668 11 * AS 1668 12 66. 185. 136. 17 AS 1668 13 64. 236. 16. 52 AS 5662 CNN Level 3 AOL Need accurate IP-to-AS mappings (for network equipment).

Candidate Ways to Get IP-to-AS Mapping • Routing address registry – Voluntary public registry

Candidate Ways to Get IP-to-AS Mapping • Routing address registry – Voluntary public registry such as whois. radb. net – Used by prtraceroute and “NANOG traceroute” – Incomplete and quite out-of-date • Mergers, acquisitions, delegation to customers • Origin AS in BGP paths – Public BGP routing tables such as Route. Views – Used to translate traceroute data to an AS graph – Incomplete and inaccurate… but usually right • Multiple Origin ASes, no mapping, wrong mapping

Example: BGP Table Network * 3. 0. 0. 0/8 * * *> * *

Example: BGP Table Network * 3. 0. 0. 0/8 * * *> * * 9. 184. 112. 0/20 * *> * * * (“show ip bgp” at Route. Views) Next Hop Metric Loc. Prf Weight Path 205. 215. 45. 50 0 4006 701 80 i 167. 142. 3. 6 0 5056 701 80 i 157. 22. 9. 7 0 715 1 701 80 i 195. 219. 96. 239 0 8297 6453 701 80 i 195. 211. 29. 254 0 5409 6667 6427 3356 701 80 i 12. 127. 0. 249 0 7018 701 80 i 213. 200. 87. 254 929 0 3257 701 80 i 205. 215. 45. 50 0 4006 6461 3786 i 195. 66. 225. 254 0 5459 6461 3786 i 203. 62. 248. 4 0 1221 3786 i 167. 142. 3. 6 0 5056 6461 3786 i 195. 219. 96. 239 0 8297 6461 3786 i 195. 211. 29. 254 0 5409 6461 3786 i AS 80 is General Electric, AS 701 is UUNET, AS 7018 is AT&T AS 3786 is DACOM (Korea), AS 1221 is Telstra

Why Would IP-to-AS Mapping Be Wrong? • IP addresses of equipment – Interfaces on

Why Would IP-to-AS Mapping Be Wrong? • IP addresses of equipment – Interfaces on the routers, not end hosts – Identifies equipment in routing protocols – Doesn’t need to be globally visible consistent • Three reasons the mappings may be “wrong” – Addresses of Internet Exchange Points – Sibling ASes that share address space – ASes that don’t announce their addresses • Look at traceroute path vs. BGP AS path – Traceroute path after IP-to-AS mapping – BGP AS path taken from the BGP table

Extra AS due to Internet e. Xchange Points • IXP: shared place where providers

Extra AS due to Internet e. Xchange Points • IXP: shared place where providers meet – E. g. , Mae-East, Mae-West, PAIX – Large number of fan-in and fan-out ASes A B C D E A E F B F G C G Traceroute AS path BGP AS path Ignore extra traceroute AS hop with high fan-in and fan-out

Extra AS due to Sibling ASes • Sibling: organizations with multiple ASes: – E.

Extra AS due to Sibling ASes • Sibling: organizations with multiple ASes: – E. g. , Sprint AS 1239 and AS 1791 – AS numbers equipment with addresses of another A B C H D E A F B G C Traceroute AS path E D F G BGP AS path Merge sibling ASes “belong together” as if they were one AS.

Unannounced Infrastructure Addresses 12. 0. 0. 0/8 A B C does not announce part

Unannounced Infrastructure Addresses 12. 0. 0. 0/8 A B C does not announce part of its address space in BGP (e. g. , 12. 1. 2. 0/24) ACAC C AC BC Fix the IP-to-AS map to associate 12. 1. 2. 0/24 with C

Refining Initial IP-to-AS Mapping • Start with initial IP-to-AS mapping – Mapping from BGP

Refining Initial IP-to-AS Mapping • Start with initial IP-to-AS mapping – Mapping from BGP tables is usually correct – Good starting point for computing the mapping • Collect many BGP and traceroute paths – Signaling and forwarding AS path usually match – Good way to identify mistakes in IP-to-AS map • Successively refine the IP-to-AS mapping – Find add/change/delete that makes big difference – Base these “edits” on operational realities http: //www. cs. princeton. edu/~jrex/papers/sigcomm 03. pdf http: //www. cs. princeton. edu/~jrex/papers/infocom 04. pdf

Research Areas • Better version of traceroute – Router support for active measurement –

Research Areas • Better version of traceroute – Router support for active measurement – IPPM (IP Performance Measurement) – http: //www 1. ietf. org/mail-archive/web/imrg/current/msg 00154. html • Peer-to-peer troubleshooting www. cnn. com “Yes” “No”

Passive Monitoring

Passive Monitoring

Limitations of Active Measurements • Active measurements: traceroute-like tools – Can’t probe in the

Limitations of Active Measurements • Active measurements: traceroute-like tools – Can’t probe in the past – Shows the effect, not the cause AS 2 User (s) AS 4 AS 1 AS 3 Web Server (d)

Appealing to Peek Inside • Passive measurements: public BGP data BGP update feeds Data

Appealing to Peek Inside • Passive measurements: public BGP data BGP update feeds Data Correlation Data Collection (Route. Views, RIPE) root cause

Inspect BGP Routing Changes • Changes in paths to reach destination d – AS

Inspect BGP Routing Changes • Changes in paths to reach destination d – AS 1: 2: 3: 4: “ 1 3 4” “ 1 2 4” “ 2 4” (no change) “ 3 4” “ 3 1 2 4” “ 4” (no change) AS 2 User (s) AS 4 AS 1 AS 3 Web Server (d)

Idea #1: ASes in Paths Undergoing Change • Key assumption – “The AS responsible

Idea #1: ASes in Paths Undergoing Change • Key assumption – “The AS responsible for the change appears in the old and/or the new AS path to the destination. ” • If an AS has a routing change – All ASes in old and new paths may be responsible – Call these ASes the “suspect set” • Combining across vantage points – Consider all ASes that had a routing change – Perform the intersection across the suspect sets

Idea #2: Excluding ASes in Non-Changing Paths • Key assumption – “If an AS

Idea #2: Excluding ASes in Non-Changing Paths • Key assumption – “If an AS has no routing change, the ASes in the path are not responsible and can be excluded. ” • Example – AS 1: “ 1 2 4” “ 1 2 3 4”: suspects {1, 2, 3, 4} – AS 2: “ 2 4” “ 2 3 4”: suspects {2, 3, 4} – AS 3: “ 3 4” (no change): non-suspects {3, 4} AS 3 AS 1 AS 2 AS 4

Idea #3: Blaming the ASes in the Better Path • Key assumption – “The

Idea #3: Blaming the ASes in the Better Path • Key assumption – “The better path is the one that contains the AS responsible for the change. ” • Example – “ 1 2 4” “ 1 2 3 4”: better path to worse path, with ASes {1, 2, 4} as the suspects (not AS 3) • Heuristics for identifying the “better” path – E. g. , the shorter AS path AS 3 AS 1 AS 2 AS 4

Idea #4: Combining Across Destinations • Key assumption – “All destinations experiencing routing changes

Idea #4: Combining Across Destinations • Key assumption – “All destinations experiencing routing changes in a short period of time have a common cause. ” • Exploiting the observation – Form suspect sets for each destination – Perform intersections of the sets across the destinations

Difficulties With Root-Cause Analysis • Misleading BGP routing changes – Responsible AS not on

Difficulties With Root-Cause Analysis • Misleading BGP routing changes – Responsible AS not on old or new path – Looking across destinations doesn’t resolve • Missing routing changes – Some routers in an AS don’t have a change – Some subnets are not visible in BGP – Some internal changes are not visible in BGP

Misleading BGP Changes Myth: The AS responsible for the change appears in the old

Misleading BGP Changes Myth: The AS responsible for the change appears in the old or the new AS path. BGP data collection old: 1, 2, 8, 9, 10 new: 1, 4, 5, 6, 7, 10 1 2 4 8 3 5 9 6 11 7 10

Misleading BGP Changes Myth: Looking at routing changes across prefixes resolves causes d 2

Misleading BGP Changes Myth: Looking at routing changes across prefixes resolves causes d 2 AS 3 d 1 d 3 AS 2 AS 1 A B 7 10 C Changes for d 2, but not for d 1 and d 3 12 BGP data collection

Missing Routing Changes Myth: The BGP updates from a single router accurately represent the

Missing Routing Changes Myth: The BGP updates from a single router accurately represent the AS dst AS 2 AS 1 A B 7 6 C 12 10 D BGP data collection No change

Missing Routing Changes Myth: BGP data from a router accurately represents changes on that

Missing Routing Changes Myth: BGP data from a router accurately represents changes on that router. 12. 1. 1. 0/24 BGP data collection A 12. 1. 0. 0/16

Missing Routing Changes Myth: Routing changes visible in e. BGP have greater impact end-to-end

Missing Routing Changes Myth: Routing changes visible in e. BGP have greater impact end-to-end impact than changes with local scope. dst AS 2 AS 1 A B 5 6 C 12 10 7 D BGP data collection

Hybrid of Active and Passive Monitoring Omni 2 AS 2 i User (s) AS

Hybrid of Active and Passive Monitoring Omni 2 AS 2 i User (s) AS 4 AS 1 Omni 1 j AS 3 (i, s, d, t) failure link (3, 4) (j, s, d, t’) failure link (3, 4) Omni 4 Omni 3 Web Server (d)

Research Questions • Understanding if root-cause analysis can work – How many vantage points

Research Questions • Understanding if root-cause analysis can work – How many vantage points are needed? – Do the assumptions usually hold? – Can algorithms tolerate occasional violations? – Can some additional information help? • Distributed algorithms for root-cause analysis – Can ASes cooperate in distributed fashion? – How to prevent or detect ASes that cheat? – Do all ASes have to participate? – Other hybrids of active and passive monitoring?

Conclusions • Troubleshooting is important – Detect, diagnose, and fix problems – Accountability and

Conclusions • Troubleshooting is important – Detect, diagnose, and fix problems – Accountability and service-level agreements • Troubleshooting is hard – Active measurement (e. g. , traceroute) not enough – Root-cause analysis techniques are not enough • New innovation necessary – Hybrid active/passive approaches – Router support for active measurement – Routing protocol extensions for troubleshooting

For Next Time: From Inside an AS • Two papers – “OSPF monitoring: Architecture,

For Next Time: From Inside an AS • Two papers – “OSPF monitoring: Architecture, design, and deployment experience” – “Finding a needle in a haystack: Pinpointing significant BGP routing changes in an IP network” • Optional reading – Materials from Packet Design and Ipsum Networks • Review only of first paper – Summary – Why accept – Why reject – Future work