Characterizing Application Memory Error Vulnerability to Optimize Datacenter

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash Sharma, Mark Santaniello, Justin Meza, Aman Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, Onur Mutlu

Executive Summary • Problem: Reliable memory hardware increases cost • Our Goal: Reduce datacenter cost; meet availability target • Observation: Data-intensive applications’ data exhibit a diverse spectrum of tolerance to memory errors ‐ Across applications and within an application ‐ We characterized 3 modern data‐intensive applications • Our Proposal: Heterogeneous-reliability memory (HRM) ‐ Store error‐tolerant data in less‐reliable lower‐cost memory ‐ Store error‐vulnerable data in more‐reliable memory • Major results: ‐ Reduce server hardware cost by 4. 7 % ‐ Achieve single server availability target of 99. 90 % 2

Outline • Motivation • Characterizing application memory error tolerance • Key observations ‐ Observation 1: Memory error tolerance varies across applications and within an application ‐ Observation 2: Data can be recovered by software • Heterogeneous‐Reliability Memory (HRM) • Evaluation 3

Outline • Motivation • Characterizing application memory error tolerance • Key observations ‐ Observation 1: Memory error tolerance varies across applications and within an application ‐ Observation 2: Data can be recovered by software • Heterogeneous‐Reliability Memory (HRM) • Evaluation 4

Server Memory Cost is High • Server hardware cost dominates datacenter Total Cost of Ownership (TCO) [Barroso ‘ 09] • As server memory capacity grows, memory cost becomes the most important component of server hardware costs [Kozyrakis ‘ 10] 128 GB Memory cost ~$140(per 16 GB)× 8 = ~$1120 * 2 CPUs cost ~$500(per CPU)× 2 = ~$1000 * * Numbers in the year of 2014 5

Memory Reliability is Important System/app hang or slowdown System/app crash Silent data corruption or incorrect app output 6

Existing Error Mitigation Techniques (I) Testing cost/Mem cost (%) • Quality assurance tests increase manufacturing cost 12 10 [Doc. Memory '00] Predicted as trend 8 6 4 2 0 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb DRAM chip capacity 4 Gb Memory testing cost can be a significant fraction of memory cost as memory capacity grows 7

Existing Error Mitigation Techniques (II) Increasing strength • Error detection and correction increases system cost Added Technique Detection Correction capacity logic No. ECC N/A 0. 00% No Parity 1 bit N/A 1. 56% Low SEC‐DED 2 bit 12. 5% Low Chipkill 2 chip 12. 5% High Stronger error protection techniques have higher cost 8

Shortcomings of Existing Approaches • Uniformly improve memory reliability ‐ Observation 1: Memory error tolerance varies across applications and with an application • Rely on hardware-level techniques ‐ Observation 2: Once a memory error is detected, most corrupted data can be recovered by software Goal: Design a new cost‐efficient memory system that flexibly matches memory reliability with application memory error tolerance 9

Outline • Motivation • Characterizing application memory error tolerance • Key observations ‐ Observation 1: Memory error tolerance varies across applications and within an application ‐ Observation 2: Data can be recovered by software • Heterogeneous‐Reliability Memory (HRM) • Evaluation 10

Characterization Goal Quantify application memory error tolerance Memory Error Store x = … 000… if (x != 0) … Masked by Overwrite Masked by Logic corrupted x = … 110… … 010… Load Consumed by Application Incorrect Response System/App Crash Correct Result Incorrect Result Memory Error Outcomes return x; or *x; 11

Characterization Methodology • 3 modern data-intensive applications Application Web. Search Memcached Graph. Lab Memory footprint 46 GB 35 GB 4 GB • 3 dominant memory regions ‐ Heap – dynamically allocated data ‐ Stack – function parameters and local variables ‐ Private – private heap managed by user • Injected a total of 23, 718 memory errors using software debuggers (Win. Dbg and GDB) • Examined correctness for over 4 billion queries 12

Outline • Motivation • Characterizing application memory error tolerance • Key observations ‐ Observation 1: Memory error tolerance varies across applications and within an application ‐ Observation 2: Data can be recovered by software • Heterogeneous‐Reliability Memory (HRM) • Evaluation 13

Observation 1: Memory Error Tolerance Varies Across Applications Probability of Crash (%) 14 System/Application Crash 12 10 8 6 >10× difference 4 2 0 Web. Search Memcached Graph. Lab Showing results for single‐bit soft errors Results for other memory error types can be found in the paper with similar conclusion 14

# Incorrect/Billion Queries Observation 1: Memory Error Tolerance Varies Across Applications 1 E+8 Incorrect Responses 1 E+7 1 E+6 >105× difference 1 E+5 1 E+4 1 E+3 1 E+2 1 E+1 Web. Search Memcached Graph. Lab Showing results for single‐bit soft errors Results for other memory error types can be found in the paper 15

Observation 1: Memory Error Tolerance Varies Across Applications and Within an Application Probability of Crash (%) 1, 4 System/Application Crash 1, 2 1 >4× difference 0, 8 0, 6 0, 4 0, 2 0 Private Heap Showing results for Web. Search Results for other workloads can be found in the paper Stack 16

# Incorrect/Billion Queries Observation 1: Memory Error Tolerance Varies Across Applications and Within an Application 1 E+4 Incorrect Responses 1 E+3 1 E+2 All averaged at a very low rate 15 1 E+1 1 E+0 Private Heap Showing results for Web. Search Results for other workloads can be found in the paper Stack 17

Outline • Motivation • Characterizing application memory error tolerance • Key observations ‐ Observation 1: Memory error tolerance varies across applications and within an application ‐ Observation 2: Data can be recovered by software • Heterogeneous‐Reliability Memory (HRM) • Evaluation 18

Observation 2: Data Can be Recovered by Software Implicitly and Explicitly • Implicitly recoverable – application intrinsically has a clean copy of the data on disk • Explicitly recoverable – application can create a copy of the data at a low cost (if it has very low write frequency) Web. Search Recoverability 88% 82% 63% 59% 56% 28% 1% Private Heap Explicitly recoverable 16% Stack Implicitly recoverable Overall 19

Outline • Motivation • Characterizing application memory error tolerance • Key observations ‐ Observation 1: Memory error tolerance varies across applications and within an application ‐ Observation 2: Data can be recovered by software • Heterogeneous‐Reliability Memory (HRM) • Evaluation 20

Memory error vulnerability Exploiting Memory Error Tolerance Vulnerable data Tolerant data Reliable memory Low‐cost memory Vulnerable • ECC protected data • Well‐tested chips • No. ECC or Tolerant Parity • Less‐testeddata chips App/Data A App/Data B App/Data C Heterogeneous‐Reliability Memory 21

Par+R: Parity Detection + Software Recovery Implicit Recovery Explicit Recovery Memory Error Copy Disk Page A Intrinsic copy Page A Memory Error Page B Write Copy Disk Write non‐ intensive 22

Heterogeneous‐Reliability Memory App 1 data A App 1 data B App 2 data A App 2 data B App 3 data A App 3 data B Step 1: Characterize and classify application memory error tolerance App 1 data A Vulnerable App 1 data B App 3 data A App 3 data B App 2 data A App 2 data B Tolerant Step 2: Map application data to the HRM system enabled by SW/HW cooperative solutions Reliable memory Parity memory + software recovery (Par+R) Unreliable Low‐cost memory 23

Outline • Motivation • Characterizing application memory error tolerance • Key observations ‐ Observation 1: Memory error tolerance varies across applications and within an application ‐ Observation 2: Data can be recovered by software • Heterogeneous‐Reliability Memory (HRM) • Evaluation 24

Evaluated Systems Configuration Typical Server Consumer PC HRM Less-Tested (L) HRM/L Mapping Private Heap Stack (36 GB) (9 GB) (60 MB) ECC No. ECC Par+R No. ECC Baseline systems ECC No. ECC Par+R Pros and Cons ECC Reliable but expensive No. ECC Low‐cost but unreliable No. ECC Parity only No. ECC Least expensive and reliable No. ECC Low‐cost and reliable HRM systems 25

Design Parameters DRAM/server HW cost [Kozyrakis ‘ 10] No. ECC memory cost savings Parity memory cost savings Less‐tested memory cost savings Crash recovery time Par+R flush threshold Errors/server/month [Schroeder ‘ 09] Target single server availability 30% 11. 1% 9. 7% 18%± 12% 10 mins 5 mins 2000 99. 90% 26

Evaluation Metrics • Cost ‐ Memory cost savings ‐ Server HW cost savings (both compared with Typical Server) • Reliability ‐ Crashes/server/month ‐ Single server availability ‐ # incorrect/million queries 27

Server HW cost savings (%) Improving Server HW Cost Savings 9 8 7 6 5 4 3 2 1 0 8, 1 4, 7 Typical Server 3, 3 2, 9 Consumer PC HRM Less‐Tested HRM/L Reducing the use of memory error mitigation techniques in part of memory space can save noticeable amount of server HW cost 28

Single server availability (%) Achieving Target Availability 100 99, 5 99 98, 5 Single server availability target: 99. 90% 98 97, 5 97 Typical Server Consumer PC HRM Less‐Tested HRM systems are flexible to adjust and can achieve availability target HRM/L 29

# incorrect/million queries Achieving Acceptable Correctness 1000 163 100 33 12 9 10 1 Typical Server Consumer PC HRM Less‐Tested HRM/L HRM systems can achieve acceptable correctness 30

Evaluation Results Typical Server Consumer PC HRM Less-Tested (L) HRM/L Inner is worse Outer is better Bigger area means better tradeoff 31

Other Results and Findings in the Paper • Characterization of applications’ reactions to memory errors ‐ Finding: Quick‐to‐crash vs. periodically incorrect behavior • Characterization of most common types of memory errors including single-bit soft/hard errors, multi-bit hard errors ‐ Finding: More severe errors mainly decrease correctness • Characterization of how errors are masked ‐ Finding: Some memory regions are safer than others • Discussion about heterogeneous reliability design dimensions, techniques, and their benefits and tradeoffs 32

Conclusion • Our Goal: Reduce datacenter cost; meet availability target • Characterized application-level memory error tolerance of 3 modern data-intensive workloads • Proposed Heterogeneous-Reliability Memory (HRM) ‐ Store error‐tolerant data in less‐reliable lower‐cost memory ‐ Store error‐vulnerable data in more‐reliable memory • Evaluated example HRM systems ‐ Reduce server hardware cost by 4. 7 % ‐ Achieve single‐server availability target 99. 90 % 33

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash Sharma, Mark Santaniello, Justin Meza, Aman Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, Onur Mutlu

Why use a software debugger? • Speed ‐ Our workloads are relatively long running • Web. Search – 30 minutes • Memcached – 10 minutes • Graph. Lab – 10 minutes ‐ Our workloads have large memory footprint • Web. Search – 46 GB • Memcached – 35 GB • Graph. Lab – 4 GB 35

What are the workload properties? • Web. Search ‐ Repeat a real‐world trace of 200, 000 queries, with 400 qps ‐ Correctness: Top 4 most relevant documents • Document id • Relevance and popularity • Memcached ‐ 30 GB of twitter dataset ‐ Synthetic client workload, at 5, 000 rps ‐ 90% read requests and 10% write requests • Graph. Lab ‐ 11 million twitter users’ following relations, 1. 3 GB dataset ‐ Tunk. Rank algorithm ‐ Correctness: 100 most influential users and their scores 36

How many errors are injected to each application and each memory region? • Web. Search – 20, 576 • Memcached – 983 • Graph. Lab – 2, 159 • Errors injected to each memory region is proportional to their sizes Application Web. Search Memcached Graph. Lab Private 36 GB N/A Heap 9 GB 35 GB 4 GB Stack 60 MB 132 KB Total 46 GB 35 GB 4 GB 37

Does HRM require HW changes Mem Channel 0 DIMM ECC Ctrl 0 * Channel 1 Mem DIMM CPU Ctrl 1* * Channel 2 Mem DIMM * Ctrl 2 * Memory controller/Channel without ECC support 38

What is the injection/monitoring process? Start Repeat 1 2 3 4 5 (Re)Start App Inject Errors (Soft/Hard) Run Client Workload App Crash? NO Compare Result with Expected Output YES 39

Comparison with previous works? • Virtualized and flexible ECC [Yoon ‘ 10] ‐ Requires changes to the MMU in the processor ‐ Performance overhead ~10% over No. ECC • Our work: HRM ‐ Minimal changes to memory controller to enable different ECC on different channels ‐ Low performance overhead ‐ Enables the use of less‐tested memory 40