Understanding Network Failures in Data Centers Michael Over

Questions to be Answered � Which devices/links are most unreliable? � What causes failures?

Purpose of Study � Demand for dynamic scaling and benefits from economies of scale

Goals of the Study � Characterize network failure patterns in data centers and understand

Network Reliability � Network reliability is studied along three dimensions: ◦ Characterizing the most

Data Sources � Multiple monitoring tools are put in place by network operators. �

Difficulties with Data Sources � Logs track low level network events and do not

Key Observations of Study � Data center networks show high reliability ◦ More than

Key Observations of Study � Failures have potential to cause loss of many small

Limitations of Study � Best effort: Possible missed events or multiply -logged events �

Network Composition � To. Rs are the most prevalent device type in the network

Workload Characteristics � Large volume of short-lived latency-sensitive “mice” flows � Few long-lived throughput-sensitive

Methodology & Data Sets � Network Event Logs (SNMP/syslog) ◦ Operators filter the logs

Defining and Identifying Failures � Network devices can send multiple notifications even though a

Cleaning the Data �A single link or device may experience multiple “down” events simultaneously

Identifying Failures with Impact � Goal: Identify failures with impact without access to application

Identifying Failures with Impact � For device failures, additional steps are taken to filter

Link Failure Analysis – Failures with Impact

Failure Analysis � Links experience about an order of magnitude more failures than devices

Probability of Failure � Top of Rack switches (To. Rs) have the lowest failure

Grouping Link Failures � In order to correlate multiple link failures: ◦ The link

Estimating Failure Impact � In the absence of application performance data, they estimate the

Is Redundancy Effective? � There are several reasons why redundancy may not be 100%

Redundancy at Different Layers � Links highest in the topology benefit most from redundancy

Discussion �Low end switches exhibit high reliability �Improve reliability of middleboxes �Improve the effectiveness

Related Work � Application failures ◦ Netmedic aims to diagnose application failures in enterprise

Conclusions � Large-scale analysis of network failure events in data centers � Characterize failures

Conclusions � Commodity switches exhibit high reliability � Middle boxes need to be better

Future Work � This study considered the occurrence of interface level failures – only

Slides: 44

Download presentation

Understanding Network Failures in Data Centers Michael Over

Questions to be Answered � Which devices/links are most unreliable? � What causes failures? � How do failures impact network traffic? � How effective is network redundancy? � Questions will be answered using multiple data sources commonly collected by network operators.

Purpose of Study � Demand for dynamic scaling and benefits from economies of scale are driving the creation of mega data centers. � The data center networks need to be scalable, efficient, fault tolerant, and easy to manage. � The issue of reliability has not been addressed � In this paper, reliability is studied “by analyzing network error logs collected from over a year from thousands of network devices across tens of geographically distributed data centers. ”

Goals of the Study � Characterize network failure patterns in data centers and understand overall reliability of the network � Leverage lessons learned from this study to guide the design of future data centers

Network Reliability � Network reliability is studied along three dimensions: ◦ Characterizing the most failure prone network elements �Those that fail with high frequency or that incur high downtime ◦ Estimating the impact of failures �Correlate event logs with recent network traffic observed on links involved in the event ◦ Analyzing the effectiveness of network redundancy �Compare traffic on a per-link basis during failure events to traffic across all links in the network redundancy group where the failure occurred

Data Sources � Multiple monitoring tools are put in place by network operators. � Static View ◦ Router configuration files ◦ Device procurement data � Dynamic View ◦ SNMP polling ◦ Syslog ◦ Trouble tickets

Difficulties with Data Sources � Logs track low level network events and do not necessarily imply application performance impact or service outage � Separate failures that potentially impact network connectivity from high volume and noisy network logs � Analyzing the effectiveness of network redundancy requires correlating multiple data sources across redundant devices and links

Key Observations of Study � Data center networks show high reliability ◦ More than four 9’s for 80% of the links and 60% of the devices � Low-cost, commodity switches such as To. Rs and Agg. S are highly reliable ◦ Top of Rack switches (To. Rs) and aggregation switches (Agg. S) exhibit the highest reliability � Load balancers dominate in terms of failure occurrences with many short-lived software related faults ◦ 1 in 5 load balancers exhibit a failure

Key Observations of Study � Failures have potential to cause loss of many small packets such as keep alive messages and ACKs ◦ Most failures lose a large number of packets relative to the number of lost bytes � Network redundancy is only 40% effective in reducing the median impact of failure ◦ Ideally, network redundancy should completely mask all failures from applications

Limitations of Study � Best effort: Possible missed events or multiply -logged events � Data cleaned, but some events may still be lost due to software faults or disconnections � Human bias may arise in failure annotations � Network errors do not always impact network traffic or service availability � Thus… failure rates in this study should not be interpreted as necessarily all impacting applications

Background

Network Composition � To. Rs are the most prevalent device type in the network comprising about 75% of devices � Load balancers are the next most prevalent at approximately 10% of devices � The remaining 15% are Agg. S, Core, and Acc. R � Despite To. Rs being highly reliable, To. Rs account for a large amount of downtime � LBs account for few devices but are extremely failure prone, making them a leading contributor of failures

Workload Characteristics � Large volume of short-lived latency-sensitive “mice” flows � Few long-lived throughput-sensitive “elephant” flows � There are higher utilization rates at upper layers of the topology as a result of aggregation and high bandwidth oversubscription

Methodology & Data Sets � Network Event Logs (SNMP/syslog) ◦ Operators filter the logs and produce a smaller set of actionable events which are assigned to NOC tickets � NOC Tickets ◦ Operators employ a ticketing system to track the resolution of issues � Network traffic data � Network topology data ◦ Five minute averages of bytes/packets into and out of each network interface ◦ Static snapshot of network

Defining and Identifying Failures � Network devices can send multiple notifications even though a link is operational � They monitor all logged “down” events for devices and links leading to two types of failures: ◦ Link failures – connection between two devices is down ◦ Device failures – device is not functioning for routing/forwarding traffic � Observe multiple components notifications related to a single high level failure or a correlated event � Correlate failure events with network traffic logs to filter failures with impact that potentially result in loss of traffic

Cleaning the Data �A single link or device may experience multiple “down” events simultaneously ◦ These are grouped together � An element may experience another “down” event before the previous event has been resolved ◦ These are also grouped together

Identifying Failures with Impact � Goal: Identify failures with impact without access to application monitoring logs � Cannot exactly quantify application impact such as throughput loss or increased response times ◦ Therefore, estimate the impact of failures on network traffic � Correlate each link failure with traffic observed on the link in the recent past before the time of the failure ◦ Traffic less than before the failure implies impact

Identifying Failures with Impact

Identifying Failures with Impact � For device failures, additional steps are taken to filter spurious messages � If a device is down, neighboring devices connected to it will observe failures on interconnecting links. � Verify that at least one link failure with impact has been noted for links incident on the device � This significantly reduces the number of device failures observed

Link Failure Analysis – All Failures

Link Failure Analysis – Failures with Impact

Failure Analysis � Links experience about an order of magnitude more failures than devices � Link failures are variable and bursty � Device failures are usually caused by maintenance

Probability of Failure � Top of Rack switches (To. Rs) have the lowest failure rates � Load balancers (LBs) have the highest failure rate

Agg. Impact of Failures - Devices

Properties of Failures

Grouping Link Failures � In order to correlate multiple link failures: ◦ The link failures must occur in the same data center ◦ The failures must occur within some predefined time threshold � Observed that link failures tend to be isolated

Root Causes of Failures

Estimating Failure Impact � In the absence of application performance data, they estimate the amount of traffic that would have been routed on a failed link had it been available for the duration of a failure � The amount of data that was potentially lost during a failure event is estimated as: ◦ Loss = (medb – medd) x duration � Link failures incur loss of many packets, but relatively few bytes ◦ Suggests packets lost during failures are mostly keep alive packets used by applications

Is Redundancy Effective?

Is Redundancy Effective? � There are several reasons why redundancy may not be 100% effective: ◦ Bugs in fail-over mechanisms can arise if there is an uncertainty as to which link or component is the backup ◦ If the redundant components are not configured correctly, they will not be able to re-route traffic away from the failed component ◦ Protocol issues such as TCP backoff, timeouts, and spanning tree reconfigurations may result in loss of traffic

Redundancy at Different Layers � Links highest in the topology benefit most from redundancy ◦ A reliable network core is critical to traffic flow ◦ Redundancy is effective at reducing failure impact � Links from To. Rs to aggregation switches benefit the least from redundancy, but have low failure impact ◦ However, on a per link basis, these links do not experience significant impact from failures so there is less room for redundancy to benefit them

Discussion �Low end switches exhibit high reliability �Improve reliability of middleboxes �Improve the effectiveness of network redundancy

Related Work � Application failures ◦ Netmedic aims to diagnose application failures in enterprise networks � Network failures ◦ These studies also observed that the majority of failures in data centers are isolated � Failures in cloud computing ◦ Increased focus on understanding component failures

Conclusions � Large-scale analysis of network failure events in data centers � Characterize failures of network links and devices � Estimate failure impact � Analyze effectiveness of network redundancy in masking failures � Methodology of correlating network traffic logs with logs of actionable events to filter spurious notifications

Conclusions � Commodity switches exhibit high reliability � Middle boxes need to be better managed � Effectiveness of redundancy at network and application layers needs further investigation

Future Work � This study considered the occurrence of interface level failures – only one aspect of reliability in data center networks � Future: Correlate logs from application-level monitors � Understand what fraction of application failures can be attributed to network failures.

Questions? ? ?