Why does the Cloud stop computing Lessons from



































- Slides: 35
Why does the Cloud stop computing? Lessons from hundreds of service outages Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffrey Adityatama, and Kurnia J. Eliazar
COS @ So. CC '16 Oct '16 2
Oct '16 COS @ So. CC '16 Outages Bugs 2 years ago @ So. CC ’ 14 Study of bugs in datacenter distributed systems (Hadoop, 3
COS @ So. CC '16 Public reports! q Headline news and post-mortem reports § Providers’ transparency § Untapped information q Pros/cons + Detailed root causes + Detailed chain of failures + Downtime durations + Zero false positive -- (Very) incomplete Oct '16 4
COS @ So. CC '16 COS: Cloud Outage Study q 32 services q 597 outages § between 2009 -2015 § ~70% report downtimes § ~60% report root causes 5 Oct '16 ?
COS @ So. CC '16 Oct '16 6
Oct '16 COS @ So. CC '16 Downtime/year q On average § 6% services do not reach 99% availability (>88 hours) § 78% not reach 99. 9% (>8. 8 hours) year § 31% not reach 99% § 81% not reach 99. 9% q 5 -nine availability? Hours q Worst 7
COS @ So. CC '16 Root causes (sorted by count) Oct '16 8
COS @ So. CC '16 Interesting Root Causes q Upgrade § Involves multi-layers - “a code push behaved differently in widespread use than it had during testing” § To understand/reproduce, need full ecosystem Oct '16 9
COS @ So. CC '16 Interesting Root Causes q Human mistakes § Rare now (vs. 10 years ago) q Config/Upgrade software bugs § Bugs in automation process § Similar issues? - But root cause origins are different Oct '16 10
11 Oct '16 COS @ So. CC '16 Config vs. Upgrade Research q Upgrade #1, need more research? q Paper count in last few years q Challenges: § Multi-layer - Full ecosystem needed Multi-year? § Reproducible bugs from industry (benchmarks)? Conference Config papers Upgrade papers ASPLOS 0 1 ATC 6 2 DSN 8 2 Euro. Sys 3 2 NSDI 3 0 OSDI 4 0 SOSP 3 1 27 8 … Total
COS @ So. CC '16 Interesting Root Causes q Bugs § What types of bugs lead to outages? Why are not masked? § (pls. see paper) § “Cascading” bugs Oct '16 12
q “Dynamo. DB Storage servers query the metadata service for their membership” q “But, on Sunday morning, the metadata service responses exceeded the retrieval time allowed by storage servers [busy timeout]” q “As a result, the storage servers were unable to obtain their membership data, and removed themselves from taking requests” 13 Oct '16 COS @ So. CC '16 Storage servers Remove self Timeout Busy Metadata service
q “Each EBS storage server contacts data collection servers and reports information that is used for fleet maintenance” q “data collection servers … had a failure” q “this inability to contact a data collection server triggered a latent memory leak bug in the storage servers … q “EBS servers continued trying in a way that slowly consumed system memory” 14 Oct '16 COS @ So. CC '16 EBS storage servers Memory leak Failure Data collection servers
Oct '16 COS @ So. CC '16 (more in the paper) 15
Oct '16 COS @ So. CC '16 Where is the SPOF? Redundancies, redundancies! Yes, we did that So, why do outages still happen? 16
Oct '16 COS @ So. CC '16 Failure recovery chain Failure Detection Failover Backups 17
Oct '16 COS @ So. CC '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail 18
Oct '16 COS @ So. CC '16 Imperfect failure recovery chain Incomplete Failure Detection q Incomplete Failover that Fails Backups that also Fail error/failure detection § Undetected (specific type of) memory leaks § Load spikes of authentication requests § “an unexpected hardware behavior” 19
Oct '16 COS @ So. CC '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails q Failover/recovery § § Backups that also Fail that fails Bad PLC fails to activate backup power generators Failed network switch failover DC failover fails due to cold cache problems Recovery/re-mirroring storm 20
Oct '16 COS @ So. CC '16 Imperfect failure recovery chain Incomplete Failure Detection q Multiple Failover that Fails Backups that also Fail failures! § Double failures of power, network, storage or server components § Diverse failures: network+server; storage+fibre cut q Cascading bugs … § … that caused many/all redundancies to fail 21
Oct '16 COS @ So. CC '16 COS Database: ? q Email us / Check our website q More correlations between … § § § Root cause & downtime Service maturity & downtime Root cause & impacts Root cause & fixes Etc. 22
COS @ So. CC '16 Oct '16 23 Conclusion q Features and failures are racing with each other q “Biggest/worst cloud outages of 20 YY” – a new year’s tradition q Hope COS tells the cause § Many more examples/details in the papers
Oct '16 COS @ So. CC '16 Thank you! Questions? ucare. cs. uchicago. edu ceres. cs. uchicago. ed u 24
EXTRA
COS @ So. CC '16 q Manually Oct '16 extract outage “metadata” q Classifications: 26
COS @ So. CC '16 Oct '16 A service outage implies an unplanned unavailability of partial or full features of the service that affects all or a significant number of users, in such a way that the outage is reported publicly. Data loss, staleness, and late deliveries that lead to loss of productivity are also considered an outage. 27
COS @ So. CC '16 #Outages/year q On average § 1/3 of the services, at least 3 unplanned outages per year q Worst Year § (between ’ 09 -’ 14) § ½ of the services, at least 4 unplanned outages per year Oct '16 28
COS @ So. CC '16 Downtime by root cause q (sorted by median downtime) Oct '16 29
COS @ So. CC '16 Oct '16 Maturity helps? q Does service maturity help? q Based on outage count: § In 2014, 24 outages occurred from 9 -yr old services 30
COS @ So. CC '16 Oct '16 31 Maturity helps? q Based on downtime: § In 2014, 267 hours of downtime from 17 -yr old services mature more popular more users more complex q More
COS @ So. CC '16 Interesting Root Causes q Load § Spikes of non-monitored requests - User requests (monitored) Database index accesses Authentication requests (cryptographic consumption) § Misconfiguration - Ex: traffic redirection Take-away: be careful with traffic-related code/configs § Recovery feedback loop Oct '16 32
COS @ So. CC '16 Interesting Root Causes q Cross (dependencies) § Amazon Web Services - Airbnb, Bitbucket, Dropbox, Foursquare, Github, Heroku, Instagram, Minecraft, Netflix, Pinterest, Quora, Reddit, Vine § Azure - Xbox Live and “ 52 other services” § Google DC (co-location) - Google Gmail, Search, Drive, Youtube (40% drop of internet traffic for 5 mins) Oct '16 33
COS @ So. CC '16 Oct '16 Studies of failures, enough? 34
COS @ So. CC '16 Oct '16 Studies of failures, enough? Most study only a few services (data behind company walls) Not all report “d”owntimes 35