Towards an Internet that Never Fails Hari Balakrishnan
Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru
What We Should Aim Toward • Carrier airlines (2002 FAA Fact Book) q 41 accidents, 6. 7 million flights (five “nines” availability) • 911 phone service (1993 NRIC report) q 29 minutes downtime per year per line (four “nines” availability) • Standard phone service (various sources) q 53 minutes downtime per year per line (four “nines” availability) • The Internet? q One to two “nines”
Example Catastrophic Failures “…a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services. . . passed bad router information one of its customers onto Sprint. ” “Microsoft's from websites were offline for up to 23 -- news. com, 25, 1997 hours. . . because of a [router]April misconfiguration…it took nearly a day to determine what was wrong and undo the “World. Com a widespread changes. ” Inc…suffered -- wired. com, January 25, outage 2001 on its Internet backbone that affected roughly 20 percent of its U. S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed outage customers to "a route went table out issue. " "A numberthe of Covad from 5 pm today -- cnn. com, October 3, 2002 denial of service due to, supposedly, a DDOS (distributed attack) on a key Level 3 data center, which later was described as a route leak (misconfiguration). ” -- dslreports. com, February 23, 2004
NANOG List Failure “Analysis” More than 70% of threads discussing failures related to router configuration or route announcement problems Note: Only includes problems openly discussed on this list.
Faults and Failures • Fault = Underlying defect in a component that causes it to violate a specification q Latent or Active (i. e. , cause errors) • Unmasked faults (errors) cause failures q Failure of subsystem (spec violation) causes fault in system • Internet faults occur for complex reasons q Hardware, software, protocol, design, implementation, operational faults: could be triggered by malice • Internet failure: A cannot communicate with B
Three Directions • Configuration as programming q Defines BGP behavior q Tools to cope with routing complexity • Coping with protocol faults: failure-atomic interdomain routing q Prefix-based routing considered harmful • End-to-end routing q Exposing multiple paths to end systems (and stubs)
Today: Reactive Operation What happens if I tweak this policy…? • Problems cause downtime • Problems often not immediately apparent
Coping with Complexity • View configuration as (distributed) programming q Large-scale: over 1 M lines of code in some networks • Programming tools to reduce fault frequency q Static analysis can detect many faults [rcc] q Sandboxing to overcome current “stimulusresponse” reasoning [FR 03] • Centralize configuration platform q More “intentional” config specs q Push configs to routers q Push routes to routers [RCP: F+04] q Use static analysis and sandboxing tools
Proactive Operation with rcc http: //nms. csail. mit. edu/rcc Configure Distributed router configurations (Single AS) Detect Faults Correctness Specification Deploy Constraints rcc Normalized Representation • Represent complex, distributed configuration • Define a correctness specification • Map specification to constraints Faults
Correctness Specification Path Visibility Every. Ifdestination with usable there exists aa path, paththen has athere routeexists advertisement a route Example violation: Signaling partition Route Validity Every. Ifroute thereadvertisement exists a route, corresponds to aexists usablea path then there path Example violation: Routing loop
Results: Faults across 17 ASes Every AS had faults, regardless of network size Most faults can be attributed to distributed configuration Route Validity Path Visibility
Three Directions • Configuration as programming q Tools to cope with routing complexity • Coping with protocol faults: failure-atomic interdomain routing q Prefix-based routing considered harmful • End-to-end routing q Exposing multiple paths to end systems
Prefixes are too coarse-grained Validity: If a failure occurs that makes a network unreachable via a given path, then the route corresponding to that path must be withdrawn 70% of intra-AS failures not visible in BGP [FABK 03]
…but they are also too fine-grained! • ~70% of discontiguous prefix pairs from the same AS are announced from the same location • Allocation explains about 60% of these cases: q Registries often allocate discontiguous address blocks to a single AS on the same day • Routes for these prefixes will “flap” together. q 135. 36. 0. 0/16 (Agere) and 135. 12. 0. 0/14 (Lucent) Route objects should correspond to an “atom” of hosts that share fate
Proposal: Atomic Interdomain Protocol (AIP) • Exterminate prefixes • Name “atomic domains” (AD) directly q Addressing, forwarding and routing on ADs q Like current AS numbers, but finer-grained q Example: MIT, Microsoft Redmond, one Po. P of a large ISP, … • Flat AD IDs can carry cryptographic meaning q Self-certifying (hash of public key) • End-system addresses have the form [AD : Local. ID]
Summary It’s worth shooting for a two or three order-ofmagnitude improvement in Internet availability It’s possible to get four or five nines of Internet availability, if we: q Develop tools to cope with configuration complexity q Develop a failure-atomic routing system q Expose multiple IP-layer paths to higher layers
- Slides: 16