Internet Routing COS 598 A Today RootCause Analysis

Outline • Network troubleshooting – Motivation for network troubleshooting – Investigating from the edge

Network Troubleshooting “Why can’t I reach www. cnn. com? ” “Why is the performance

Reachability Problems: What Could be Wrong? • End-host problem – Web server down –

Performance Problem: What Could be Wrong? • End-host problems – Overloaded Web server –

Motivation for Troubleshooting • Improving performance – Detect, diagnose, and fix the problem –

Troubleshooting Outside vs. Inside • Outside: from network edge Today – Who: users and

Pros and Cons of Active Probing • Advantages – Can run from any end

Traceroute: Measuring the Forwarding Path • Time-To-Live field in IP packet header – Source

Example Traceroute Output (Berkeley to CNN) Hop number, IP address, DNS name No response

Example Troubleshooting Results • No packets go beyond your gateway – Gateway’s connection to

Problems with Traceroute • Missing responses – Routers might not send “Time-Exceeded” – Firewalls

Map Traceroute Hops to ASes Traceroute output: (hop number, IP) 1 169. 229. 62.

Candidate Ways to Get IP-to-AS Mapping • Routing address registry – Voluntary public registry

Example: BGP Table Network * 3. 0. 0. 0/8 * * *> * *

Why Would IP-to-AS Mapping Be Wrong? • IP addresses of equipment – Interfaces on

Extra AS due to Internet e. Xchange Points • IXP: shared place where providers

Extra AS due to Sibling ASes • Sibling: organizations with multiple ASes: – E.

Unannounced Infrastructure Addresses 12. 0. 0. 0/8 A B C does not announce part

Refining Initial IP-to-AS Mapping • Start with initial IP-to-AS mapping – Mapping from BGP

Research Areas • Better version of traceroute – Router support for active measurement –

Limitations of Active Measurements • Active measurements: traceroute-like tools – Can’t probe in the

Appealing to Peek Inside • Passive measurements: public BGP data BGP update feeds Data

Inspect BGP Routing Changes • Changes in paths to reach destination d – AS

Idea #1: ASes in Paths Undergoing Change • Key assumption – “The AS responsible

Idea #2: Excluding ASes in Non-Changing Paths • Key assumption – “If an AS

Idea #3: Blaming the ASes in the Better Path • Key assumption – “The

Idea #4: Combining Across Destinations • Key assumption – “All destinations experiencing routing changes

Difficulties With Root-Cause Analysis • Misleading BGP routing changes – Responsible AS not on

Misleading BGP Changes Myth: The AS responsible for the change appears in the old

Misleading BGP Changes Myth: Looking at routing changes across prefixes resolves causes d 2

Missing Routing Changes Myth: The BGP updates from a single router accurately represent the

Missing Routing Changes Myth: BGP data from a router accurately represents changes on that

Missing Routing Changes Myth: Routing changes visible in e. BGP have greater impact end-to-end

Hybrid of Active and Passive Monitoring Omni 2 AS 2 i User (s) AS

Research Questions • Understanding if root-cause analysis can work – How many vantage points

Conclusions • Troubleshooting is important – Detect, diagnose, and fix problems – Accountability and

For Next Time: From Inside an AS • Two papers – “OSPF monitoring: Architecture,

Slides: 40

Download presentation

Internet Routing (COS 598 A) Today: Root-Cause Analysis Jennifer Rexford http: //www. cs. princeton. edu/~jrex/teaching/spring 2005 Tuesdays/Thursdays 11: 00 am-12: 20 pm

Outline • Network troubleshooting – Motivation for network troubleshooting – Investigating from the edge vs. inside • Active probing – Traceroute – Mapping IP addresses to AS numbers • Passive monitoring – Analyzing BGP update streams – Identifying location and cause of routing change – Limitations of the approach

Network Troubleshooting “Why can’t I reach www. cnn. com? ” “Why is the performance bad? ” Internet www. cnn. com

Reachability Problems: What Could be Wrong? • End-host problem – Web server down – DNS server down, or misconfigured • Forwarding-path problem – Packet filter or firewall restricting access – Mismatch in Maximum Transmission Unit (MTU) • Routing problem – User or server disconnected from Internet – Blackhole dropping all packets – Persistent loop

Performance Problem: What Could be Wrong? • End-host problems – Overloaded Web server – Overloaded DNS server – Overloaded user machine • Forwarding-path problem – High round-trip time – Link congestion • Routing problem – Long-term routing instability – Transient disruption during convergence

Motivation for Troubleshooting • Improving performance – Detect, diagnose, and fix the problem – Pick a path through another provider – Pick a different path in any overlay network • Establishing accountability – Enforce Service Level Agreements – Rate service providers • Characterizing the Internet – Understand causes of performance problems – Understand challenges of troubleshooting

Troubleshooting Outside vs. Inside • Outside: from network edge Today – Who: users and researchers, and operators troubleshooting problems outside their network – Data: ping/traceroute, public feeds of BGP updates, and public measurement platforms – Challenges: inference from very limited data • Inside: from inside the network – Who: operators running a network – Data: SNMP, fault data, traffic measurement, route monitors, and router configuration files – Challenges: collecting and joining the data

Active Probing

Pros and Cons of Active Probing • Advantages – Can run from any end system – Measure the actual forwarding path • See black-holes, loops, and delays directly • Disadvantages – Effects of routing changes, not the cause – Current path, not the path used in the past • Requires frequent probes to observe the changes – Shows only properties of round-trip path • Hard to tell if problem is on forward vs. reverse

Traceroute: Measuring the Forwarding Path • Time-To-Live field in IP packet header – Source sends a packet with a TTL of n – Each router along the path decrements the TTL – “TTL exceeded” sent when TTL reaches 0 • Traceroute tool exploits this TTL behavior TTL=1 source TTL=2 Time exceeded destination Send packets with TTL=1, 2, 3, … and record source of “time exceeded” message

Example Traceroute Output (Berkeley to CNN) Hop number, IP address, DNS name No response from router 1 169. 229. 62. 1 inr-daedalus-0. CS. Berkeley. EDU 2 169. 229. 59. 225 soda-cr-1 -1 -soda-br-6 -2 3 128. 32. 255. 169 vlan 242. inr-202 -doecev. Berkeley. EDU 4 128. 32. 0. 249 gig. E 6 -0 -0. inr-666 -doecev. Berkeley. EDU 5 128. 32. 0. 66 qsv-juniper--ucb-gw. calren 2. net 6 209. 247. 159. 109 POS 1 -0. hsipaccess 1. San. Jose 1. Level 3. net 7 * ? 8 64. 159. 1. 46 ? 9 209. 247. 9. 170 pos 8 -0. hsa 2. Atlanta 2. Level 3. net 10 66. 185. 138. 33 pop 2 -atm-P 0 -2. atdn. net 11 * ? 12 66. 185. 136. 17 pop 1 -atl-P 4 -0. atdn. net 13 64. 236. 16. 52 www 4. cnn. com No name resolution

Example Troubleshooting Results • No packets go beyond your gateway – Gateway’s connection to Internet is dead • Traceroute stops at intermediate point – Perhaps a blackhole • Traceroute path has a loop – Transient or persistent forwarding loop • Traceroute shows a very long path – Routing anomaly, route hijacking, etc. • Traceroute shows very long delays – Delay or congestion on forward or reverse path

Problems with Traceroute • Missing responses – Routers might not send “Time-Exceeded” – Firewalls may drop the probe packets – “Time-Exceeded” reply may be dropped • Misleading responses – Probes taken while the path is changing – Name not in DNS, or DNS entry misconfigured • Mapping IP addresses – Mapping interfaces to a common router – Mapping interface/router to Autonomous System

Map Traceroute Hops to ASes Traceroute output: (hop number, IP) 1 169. 229. 62. 1 AS 25 2 169. 229. 59. 225 AS 25 Berkeley 3 128. 32. 255. 169 AS 25 4 128. 32. 0. 249 AS 25 5 128. 32. 0. 66 AS 11423 Calren 6 209. 247. 159. 109 AS 3356 7 * AS 3356 8 64. 159. 1. 46 AS 3356 9 209. 247. 9. 170 AS 3356 10 66. 185. 138. 33 AS 1668 11 * AS 1668 12 66. 185. 136. 17 AS 1668 13 64. 236. 16. 52 AS 5662 CNN Level 3 AOL Need accurate IP-to-AS mappings (for network equipment).

Candidate Ways to Get IP-to-AS Mapping • Routing address registry – Voluntary public registry such as whois. radb. net – Used by prtraceroute and “NANOG traceroute” – Incomplete and quite out-of-date • Mergers, acquisitions, delegation to customers • Origin AS in BGP paths – Public BGP routing tables such as Route. Views – Used to translate traceroute data to an AS graph – Incomplete and inaccurate… but usually right • Multiple Origin ASes, no mapping, wrong mapping

Example: BGP Table Network * 3. 0. 0. 0/8 * * *> * * 9. 184. 112. 0/20 * *> * * * (“show ip bgp” at Route. Views) Next Hop Metric Loc. Prf Weight Path 205. 215. 45. 50 0 4006 701 80 i 167. 142. 3. 6 0 5056 701 80 i 157. 22. 9. 7 0 715 1 701 80 i 195. 219. 96. 239 0 8297 6453 701 80 i 195. 211. 29. 254 0 5409 6667 6427 3356 701 80 i 12. 127. 0. 249 0 7018 701 80 i 213. 200. 87. 254 929 0 3257 701 80 i 205. 215. 45. 50 0 4006 6461 3786 i 195. 66. 225. 254 0 5459 6461 3786 i 203. 62. 248. 4 0 1221 3786 i 167. 142. 3. 6 0 5056 6461 3786 i 195. 219. 96. 239 0 8297 6461 3786 i 195. 211. 29. 254 0 5409 6461 3786 i AS 80 is General Electric, AS 701 is UUNET, AS 7018 is AT&T AS 3786 is DACOM (Korea), AS 1221 is Telstra

Why Would IP-to-AS Mapping Be Wrong? • IP addresses of equipment – Interfaces on the routers, not end hosts – Identifies equipment in routing protocols – Doesn’t need to be globally visible consistent • Three reasons the mappings may be “wrong” – Addresses of Internet Exchange Points – Sibling ASes that share address space – ASes that don’t announce their addresses • Look at traceroute path vs. BGP AS path – Traceroute path after IP-to-AS mapping – BGP AS path taken from the BGP table

Extra AS due to Internet e. Xchange Points • IXP: shared place where providers meet – E. g. , Mae-East, Mae-West, PAIX – Large number of fan-in and fan-out ASes A B C D E A E F B F G C G Traceroute AS path BGP AS path Ignore extra traceroute AS hop with high fan-in and fan-out

Extra AS due to Sibling ASes • Sibling: organizations with multiple ASes: – E. g. , Sprint AS 1239 and AS 1791 – AS numbers equipment with addresses of another A B C H D E A F B G C Traceroute AS path E D F G BGP AS path Merge sibling ASes “belong together” as if they were one AS.

Unannounced Infrastructure Addresses 12. 0. 0. 0/8 A B C does not announce part of its address space in BGP (e. g. , 12. 1. 2. 0/24) ACAC C AC BC Fix the IP-to-AS map to associate 12. 1. 2. 0/24 with C

Refining Initial IP-to-AS Mapping • Start with initial IP-to-AS mapping – Mapping from BGP tables is usually correct – Good starting point for computing the mapping • Collect many BGP and traceroute paths – Signaling and forwarding AS path usually match – Good way to identify mistakes in IP-to-AS map • Successively refine the IP-to-AS mapping – Find add/change/delete that makes big difference – Base these “edits” on operational realities http: //www. cs. princeton. edu/~jrex/papers/sigcomm 03. pdf http: //www. cs. princeton. edu/~jrex/papers/infocom 04. pdf

Research Areas • Better version of traceroute – Router support for active measurement – IPPM (IP Performance Measurement) – http: //www 1. ietf. org/mail-archive/web/imrg/current/msg 00154. html • Peer-to-peer troubleshooting www. cnn. com “Yes” “No”

Passive Monitoring

Limitations of Active Measurements • Active measurements: traceroute-like tools – Can’t probe in the past – Shows the effect, not the cause AS 2 User (s) AS 4 AS 1 AS 3 Web Server (d)

Appealing to Peek Inside • Passive measurements: public BGP data BGP update feeds Data Correlation Data Collection (Route. Views, RIPE) root cause

Inspect BGP Routing Changes • Changes in paths to reach destination d – AS 1: 2: 3: 4: “ 1 3 4” “ 1 2 4” “ 2 4” (no change) “ 3 4” “ 3 1 2 4” “ 4” (no change) AS 2 User (s) AS 4 AS 1 AS 3 Web Server (d)

Idea #1: ASes in Paths Undergoing Change • Key assumption – “The AS responsible for the change appears in the old and/or the new AS path to the destination. ” • If an AS has a routing change – All ASes in old and new paths may be responsible – Call these ASes the “suspect set” • Combining across vantage points – Consider all ASes that had a routing change – Perform the intersection across the suspect sets

Idea #2: Excluding ASes in Non-Changing Paths • Key assumption – “If an AS has no routing change, the ASes in the path are not responsible and can be excluded. ” • Example – AS 1: “ 1 2 4” “ 1 2 3 4”: suspects {1, 2, 3, 4} – AS 2: “ 2 4” “ 2 3 4”: suspects {2, 3, 4} – AS 3: “ 3 4” (no change): non-suspects {3, 4} AS 3 AS 1 AS 2 AS 4

Idea #3: Blaming the ASes in the Better Path • Key assumption – “The better path is the one that contains the AS responsible for the change. ” • Example – “ 1 2 4” “ 1 2 3 4”: better path to worse path, with ASes {1, 2, 4} as the suspects (not AS 3) • Heuristics for identifying the “better” path – E. g. , the shorter AS path AS 3 AS 1 AS 2 AS 4

Idea #4: Combining Across Destinations • Key assumption – “All destinations experiencing routing changes in a short period of time have a common cause. ” • Exploiting the observation – Form suspect sets for each destination – Perform intersections of the sets across the destinations

Difficulties With Root-Cause Analysis • Misleading BGP routing changes – Responsible AS not on old or new path – Looking across destinations doesn’t resolve • Missing routing changes – Some routers in an AS don’t have a change – Some subnets are not visible in BGP – Some internal changes are not visible in BGP

Misleading BGP Changes Myth: The AS responsible for the change appears in the old or the new AS path. BGP data collection old: 1, 2, 8, 9, 10 new: 1, 4, 5, 6, 7, 10 1 2 4 8 3 5 9 6 11 7 10

Misleading BGP Changes Myth: Looking at routing changes across prefixes resolves causes d 2 AS 3 d 1 d 3 AS 2 AS 1 A B 7 10 C Changes for d 2, but not for d 1 and d 3 12 BGP data collection

Missing Routing Changes Myth: The BGP updates from a single router accurately represent the AS dst AS 2 AS 1 A B 7 6 C 12 10 D BGP data collection No change

Missing Routing Changes Myth: BGP data from a router accurately represents changes on that router. 12. 1. 1. 0/24 BGP data collection A 12. 1. 0. 0/16

Missing Routing Changes Myth: Routing changes visible in e. BGP have greater impact end-to-end impact than changes with local scope. dst AS 2 AS 1 A B 5 6 C 12 10 7 D BGP data collection

Hybrid of Active and Passive Monitoring Omni 2 AS 2 i User (s) AS 4 AS 1 Omni 1 j AS 3 (i, s, d, t) failure link (3, 4) (j, s, d, t’) failure link (3, 4) Omni 4 Omni 3 Web Server (d)

Research Questions • Understanding if root-cause analysis can work – How many vantage points are needed? – Do the assumptions usually hold? – Can algorithms tolerate occasional violations? – Can some additional information help? • Distributed algorithms for root-cause analysis – Can ASes cooperate in distributed fashion? – How to prevent or detect ASes that cheat? – Do all ASes have to participate? – Other hybrids of active and passive monitoring?

Conclusions • Troubleshooting is important – Detect, diagnose, and fix problems – Accountability and service-level agreements • Troubleshooting is hard – Active measurement (e. g. , traceroute) not enough – Root-cause analysis techniques are not enough • New innovation necessary – Hybrid active/passive approaches – Router support for active measurement – Routing protocol extensions for troubleshooting

For Next Time: From Inside an AS • Two papers – “OSPF monitoring: Architecture, design, and deployment experience” – “Finding a needle in a haystack: Pinpointing significant BGP routing changes in an IP network” • Optional reading – Materials from Packet Design and Ipsum Networks • Review only of first paper – Summary – Why accept – Why reject – Future work