Net Pilot Automating Datacenter Network Failure Mitigation Xin

Failures are Common and Harmful • Network failures are common 10, 000+ switches 2

Failures are Common and Harmful logsare of production • Six-month Networkfailures common datacenters 25%

Failures are Common and Harmful • Failures are common due to VERY large datacenters

Previous Work • Conventional failure recovery takes 3 steps Detection Diagnosis Repair • Failure

Automating Failure Diagnosis is Challenging • Root causes are deep in network stack •

Category Failure types Software 21% Link layer loop Imbalance overload Diagnosis & Repair Find

Can we do something other than failure diagnosis?

Net. Pilot: Mitigating rather than Diagnosing Failures • Mitigate failure symptoms ASAP, at the

Net. Pilot Benefits • Short recovery time • Small network disruption • Low operation

Failure Mitigation is Effective • Most failures can be mitigated by simple actions •

Category Failure types Mitigation Repair % Software 21% Link layer loop Deactivate port Find

Mitigation Made Possible by Redundancy • Redundancy deactivation unlikely to partition / overload the

Outline • Automating failure diagnosis is challenging • Failure mitigation is effective • How

A Strawman Net. Pilot: Trial-and-error Network failure Localization Roll back if necessary Execute an

Net. Pilot: Challenges & Solutions Network failure Localization Roll back if necessary No 1.

Net. Pilot: Challenges & Solutions Network failure Localization Estimate impact Roll back if necessary

Failure Specific Localization • Limited # of failure types • Domain knowledge improves accuracy

Example: Frame Check Sequence (FCS) Errors • 13% of all the failures • Cut-through

Localizing FCS Errors error frames seen on L frames corrupted by other links &

Net. Pilot Overview Network failure Localization Estimate impact Roll back if necessary No Rank

Impact Metrics • Derived from Service Level Agreement (SLA) – Availability: online_server_ratio – Packet

Estimating Link Utilization Action Traffic Topology Impact Estimator Link utilization • # of flows

Link Utilization Estimation is Highly Accurate • 1 -month traffic from a 8000 -server

Net. Pilot Overview Network failure Localization Estimate impact Roll back if necessary No Choose

Load Imbalance • Agga stops receiving traffic • Localize to 4 suspects corea coreb

Mitigating Load Imbalance corea Billions Agga coreb Aggb lag core_a->AR_a core a -> agga

Fast FCS Error Mitigation 3. 5 hours 15 minutes Human operator: after 11 trials

Mitigating Link Overload • Mitigate overload by deactivating healthy links 1. 5 core 1

Mitigating Link Overload • Mitigate overload by deactivating healthy links – Many candidate links

Action Ranking Lowers Link Utilization • Replay 97 overload incidents due to link failures

Conclusion • Mitigation shortens failure recovery time – Simple actions are effective – Made

Thank You! Detection Net. Pilot: Automated Mitigation Diagnosis Repair netpilot@microsoft. com 36

Slides: 36

Download presentation

Net. Pilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang Presented by: Chen Li

Failures are Common and Harmful • Network failures are common 10, 000+ switches 2

Failures are Common and Harmful logsare of production • Six-month Networkfailures common datacenters 25% of failures take 13+ hours to repair • Failures cause long down times Time from detection to repair (minutes) 3

Failures are Common and Harmful • Failures are common due to VERY large datacenters • Failures cause long down times • Long failure duration large revenue loss 4

How to Shorten Failure Recovery Time?

Previous Work • Conventional failure recovery takes 3 steps Detection Diagnosis Repair • Failure localization/diagnosis – – – [M. K. Aguilera, SOSP’ 03] [M. Y. Chen, NSDI’ 04] [R. R Kompella, NSDI ’ 05] [P. Bahl, SIGCOMM’ 07] [S. Kandula, SIGCOMM’ 09]… passive ping active 6

Automating Failure Diagnosis is Challenging • Root causes are deep in network stack • Diagnosis involves multiple parties 7

Category Failure types Software 21% Link layer loop Imbalance overload Diagnosis & Repair Find and fix bugs Hardware 18% FCS error Replace cable Unstable power Repair power 2. forwarding Diagnosis. N/A involves Unknown 23% Switch stops Imbalance multiple overload parties % 19% 2% 13% 5% 9% 7% Failure Diagnosis Requires Lost configuration 5% Human Intervention ! 2% High CPU utilization Configuration 38% Errors on multiple Update 32% 1. Root causes are deep switches configuration in switch the network stack 6% Errors on one • Six-month failure logs from several production DCNs 8

Can we do something other than failure diagnosis?

Net. Pilot: Mitigating rather than Diagnosing Failures • Mitigate failure symptoms ASAP, at the cost of reduced capacity Automated Detection Diagnosis Mitigation Repair 10

Net. Pilot Benefits • Short recovery time • Small network disruption • Low operation cost Detection Automated Mitigation Diagnosis Repair 11

Failure Mitigation is Effective • Most failures can be mitigated by simple actions • Mitigation is feasible due to redundancy 12

Category Failure types Mitigation Repair % Software 21% Link layer loop Deactivate port Find and fix bugs 19% Hardware 18% Imbalance. Restart switch triggered overload 2% FCS error Deactivate port Replace cable 13% Unstable power Deactivate switch Repair power 5% Lost configuration Restart switch 5% High CPU utilization Restart switch 2% Switch stops Restart switch N/A 9% 68% of failures can be forwarding Imbalance. Restart switch 7% mitigated by simple actions triggered overload Unknown 23% Configurati Errors on multiple on 38% switches Errors on single switch n/a Deactivate switch Update configuration 32% 6% 13

Mitigation Made Possible by Redundancy • Redundancy deactivation unlikely to partition / overload the network Internet CORE AGG To. R 14

Outline • Automating failure diagnosis is challenging • Failure mitigation is effective • How to automate mitigation? • Net. Pilot evaluations • Conclusion 15

A Strawman Net. Pilot: Trial-and-error Network failure Localization Roll back if necessary Execute an action No Failure mitigated? End Yes 16

Net. Pilot: Challenges & Solutions Network failure Localization Roll back if necessary No 1. Blind trial-and-error takes a long time Failure specific localization Execute an action Failure mitigated? Yes End 17

Net. Pilot: Challenges & Solutions Network failure Localization Estimate impact Roll back if necessary No 2. Partition/overload network Impact estimation Execute an action Failure mitigated? Yes End 18

Net. Pilot: Challenges & Solutions Network failure Localization Estimate impact Roll back if necessary No 3. Different actions have different side-effects Rank actions Execute an action Failure mitigated? Rank actions based on impact Yes End 19

Failure Specific Localization • Limited # of failure types • Domain knowledge improves accuracy Failure types 1. Link layer loop 2. Imbalance-triggered overload 3. FCS error 4. Unstable power 5. Switch stops forwarding 6. Imbalance-triggered overload 7. Lost configuration 8. High CPU utilization 9. Errors on multiple switches 10. Errors on single switch 20

Example: Frame Check Sequence (FCS) Errors • 13% of all the failures • Cut-through switching – Forward frames before checksums are verified • Increase application latency 21

Localizing FCS Errors error frames seen on L frames corrupted by other links & traverse L • x. L: link corruption rate • # of variables = # of equations = # of links • Corrupted links: x. L> 0 22

Net. Pilot Overview Network failure Localization Estimate impact Roll back if necessary No Rank actions Execute an action Failure mitigated? Yes End 23

Impact Metrics • Derived from Service Level Agreement (SLA) – Availability: online_server_ratio – Packet loss: total_lost_pkt – latency: max_link_utilization • Small link utilization small (queuing) delay • Total_lost_pkt & max_link_utilization derived from utilization of individual links 24

Estimating Link Utilization Action Traffic Topology Impact Estimator Link utilization • # of flows >> redundant paths – Traffic evenly distributed under ECMP • Estimate the load contributed by each flow on each link • Sum up the loads to compute utilization 25

Link Utilization Estimation is Highly Accurate • 1 -month traffic from a 8000 -server network – Log socket events on each server • Ground truth: SNMP counters 26

Net. Pilot Overview Network failure Localization Estimate impact Roll back if necessary No Choose the action with the least impact Rank actions Execute an action Failure mitigated? Yes End 27

Outline • Automating failure diagnosis is challenging • Failure mitigation is effective • How to automate mitigation? – Localization impact estimation ranking • Net. Pilot evaluations – Mitigating load imbalance – Mitigating FCS errors – Mitigating overload • Conclusion 28

Load Imbalance • Agga stops receiving traffic • Localize to 4 suspects corea coreb Agga Aggb 29

Mitigating Load Imbalance corea Billions Agga coreb Aggb lag core_a->AR_a core a -> agga lag core_b->AR_a core b -> agga 35 30 Detected & Load evenly Agga stops 25 reboot coreb splitted receiving traffic 20 15 10 5 0 0: 05 0: 10 0: 15 core lag core_a->AR_b a -> aggb lag core_b->AR_b core b -> aggb Mitigation Reboot corea Reboot Agga confirmed 0: 20 Time (minutes) 0: 25 30

Fast FCS Error Mitigation 3. 5 hours 15 minutes Human operator: after 11 trials in 3. 5 hours, 2 out of 28 ports are deactivated Net. Pilot: deactivates 2 links in 1 trial within 15 minutes 31

Mitigating Link Overload • Mitigate overload by deactivating healthy links 1. 5 core 1 core 2 agg 3 core 1 agg 32

Mitigating Link Overload • Mitigate overload by deactivating healthy links – Many candidate links in production networks – Choose the link(s) with the least impact 1. 5 1 1. 5 0 3 core 1 core 2 agg 3 lost 0. 5 agg 3 3 33

Action Ranking Lowers Link Utilization • Replay 97 overload incidents due to link failures 34

Conclusion • Mitigation shortens failure recovery time – Simple actions are effective – Made possible by redundancy • Net. Pilot: automating failure mitigation – Recovery time: hour minutes – Several mitigation scenarios deployed in Bing 35

Thank You! Detection Net. Pilot: Automated Mitigation Diagnosis Repair netpilot@microsoft. com 36