Flipping Bits in Memory Without Accessing Them An

Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors *Work done while at Carnegie Mellon University ISCA 2014 Presented by Allan Benelli ETH Zürich 07 November 2018 1

Problem 2

Problem n n The continued scaling of DRAM process technology has enabled smaller cells to be placed closer to each other This gives us: q q n Increase of cells per unit area Decrease of cost per bit memory But also: q q q Reduced noise margin, more vulnerable to data loss Electromagnetic coupling effects between cells Higher variation in process technology increases number of outlier cells 3

Problem n n As a result, high-density DRAM is more likely to suffer from disturbance, a phenomenon in which different cells interfere with each other’s operation. If a cell is disturbed beyond its noise margin, it malfunctions and experiences a disturbance error. 4

Background 5

DRAM Cell 6

DRAM Access & Refresh n n Open Row: raise wordline, transfer data into row-buffer Read/Write: access row-buffer's data Close Row: lower wordline, clear row-buffer Refresh: restore the charge in cells (DDR 3 ~ 64 ms, in fact same as opening a row) 7

Goal 8

Goal n Expose the existence and the widespread nature of disturbance errors in commodity DRAM chips sold and used "today" (2014). 9

Novelty, Key Approach, and Ideas 10

Novelty n It demonstrates the existence of DRAM disturbance errors on real DRAM devices from three major manufacturers and real systems using such devices (with user-level assembly code) q n n Known as "Row. Hammer" It characterizes in detail the characteristics and symptoms of DRAM disturbance errors using an FPGA-based DRAM testing platform It proposes and explores various solutions to prevent DRAM disturbance errors. It shows a novel, low-cost system-level approach as a viable solution to the Row. Hammer problem 11

Key-Ideas & Approach n Causes of Disturbance Errors q Electromagnetic coupling n q q Toggling the wordline voltage briefly increases the voltage of adjacent wordlines, this slightly opens adjacent rows -> Leakage of charge Conductive bridges Hot-carrier injection 12

Toggling the wordline n Repeated toggling of the wordline causes the nearby cells to leak charge Aggressor row 1 1 1 1 1 Victim rows 13

Toggling the wordline n Repeated toggling of the wordline causes the nearby cells to leak charge Aggressor row 1 1 1 0 1 1 1 1 0 0 1 1 Victim rows 14

Mechanisms 15

How to Induce Errors x 86 CPU DRAM Module DDR 3 loop: mov (X), %eax mov (Y), %ebx clflush (X) clflush (Y) mfence jmp loop X Y 00111 111111111 101111101 110001011 111111111 011011110 11111 Y. Kim’s Talk on: “Flipping Bits in Memory Without Accessing Them 16

Key Results: Methodology and Evaluation 17

Methodology n n 8 FPGA boards with DDR 3 DRAM memory controller Tested 129 DRAM modules from manufactures A, B and C, with capacities from 512 MB-2 GB and production year 08 -14 Access Interval (AI) - Time between two accesses Refresh Interval (RI) - Time between two refreshes Data Pattern (DP) - Data stored in DRAM - e. g. Row. Stripe (~Row. Stripe) alternate rows 1 s and 0 s 18

Disturbance Errors are Widespread n Most modules are at risk q n In 110 / 129 tested modules they were able to induce errors The modules without errors were built before 2012 (except one) 19

Error = Charge Loss n Two types of errors q n Two types of cells (chosen by manufacture) q q n n - '1' -> '0' and '0' -> '1' True-cell: Charged = 1 -> only '1' -> '0' errors Anti-cell: Charged = 0 -> only '0' -> '1' errors A given cell suffers only one type Errors are a loss of charge 20

Address Correlation n n Peaks at +/- 1 But why this distribution? q q q Physical address may differ from logical address Fault rows are often re-mapped to spare rows Aggressor row can affect more than two rows 21

Sensitivity n n Shorter RI -> fewer errors To eliminate all disturbance errors the refresh interval must be shortened by 7 x for the worst module 22

Sensitivity n Longer AI -> fewer errors 23

Sensitivity n n Errors also dependent on data stored in other cells Solid Row. Stripe Col. Stripe Checkered 111111 111111 000000 101010 1010101 Row. Stripe causes ~10 x more errors than Solid 24

Error Correction Code - ECC n Couldn’t we just use simple Error Correction Codes as SECDED? q SECDED (: = Single Error Correction, Double Error-Detection) detects up to two errors and can correct one error n How many errors per row? n SECDED is not safe! 25

Other results n Victim Cells != Weak Cells q n n n Weak cells : = Cells with the shortest retention times Errors are repeatable, but needs a lot of testing time Errors are almost independent by temperature change Some cells have two aggressors 26

Possible Solutions n Make better chips q n Correct errors q n … exhaustive search, many spare cells required Retire cells (end-user) q n … shorten RI -> overhead Retire cells (manufacture) q n … multibit errors and overhead Refresh all rows frequently q n … depends on process technology … end-user pays for identifying and remapping Identify hot rows, refresh neighbours q … counters needed, complex 27

Proposed Solution n PARA (Probabilistic Adjacent Row Activation) q Idea: n q Mechanism: n n q When a row is closed, flip a biased coin (p<<1) If head, refresh one of the two adjacent row Problem: n q When a row is open/closed, an adjacent row is opened with small probability Needs to know how logical mapping is done by manufacturer Advantages: n n Refreshes row infrequently (low power & performance-overhead) Stateless (low cost & low complexity) 28

Summary 29

Summary n Disturbance errors are widespread in commodity DRAM chips from recent years. q n n 110 of 129 modules are vulnerable When a row is opened repeatedly, adjacent rows leak charge q "It's like breaking into an apartment by repeatedly slamming a neighbor's door until the vibrations open the door you were after“ (Slides of O. Mutlu) The paper. . . q q characterizes disturbance errors using FPGA-based testing examines seven solutions to tolerate, prevent or mitigate these errors and proposes PARA, as most efficient and lowoverhead solution 30

Strengths 31

Strengths n The first paper to expose the widespread existence of disturbance errors in DRAM chips q n Identifies a new reliability problem and a security vulnerability, Row. Hammer, that affects an entire generation of computing systems being used today q n n n Is the basis for a lot of further work (321 citations) Row. Hammer is still relevant today! Real-system approach, not only theoretical With PARA a neat solution is provided Clear structured paper with a good flow 32

Weaknesses 33

Weaknesses n n Assumes the existence of security exploits, but just touches the topic and doesn't provide a working example. Paper is limited to x 86 -architecture. Paper relies on the memory controller flipping a coin. If the outcome of these coinflips could be predicted, an attacker may circumvent PARA. It's not explained how the coin could be implemented and how such problems would be avoided. It is not discussed why the Intel processors have more bit flips than the AMD 34

Thoughts and Ideas 35

Thoughts and Ideas n What about Row. Hammer today? q q q n n Google Project Zero exploited the DRAM Row. Hamemer bug to gain kernel privileges Recent studies and reports also suggest vulnerability of DDR 4 Ram, mobilephones (ARM), GPU of mobilephones and Row. Hammer Attacks over the Network. “Solutions”: Shorten RI to 32 ms, ECC, TTR and restrict clflush Worth reading, if you want to understand further papers on Row. Hammer What about ARM / Mobile plattforms? What about SRAM, flash and harddisk? q q ARM --> next presentation NAND Flash --> Paper 36

Takeaways 37

Key Takeaways n n n Row. Hammer is a real issue - Disturbance errors are widespread! The fact that computer parts are getting smaller and the associated problems including Row. Hammer should receive much more attention than it currently enjoys. Technological progress in manufacturing technology and the scale down to smaller dimensions can produce unexpected errors that one wouldn't think of. 38

Questions/Open Discussion 39

Discussion n n Is it very likely for a normal application to hammer a row accidentally? Is shortening the refresh interval (and lengthen the activation interval) a practical approach? n Is PARA enough? Do you have other solutions in mind? n How would you implement such a coin flip used in PARA? n Was this paper a roadmap for hackers? 40

Additional Slides 41

Additional papers and webpages n n n n Rowhammer. js: A Remote Software-Induced Fault Attack in Java. Script [D. Gruss et al. 2015] Throwhammer: Rowhammer Attacks over the Network and Defenses [A. Tatar et al. 2018] DDR 4: http: //www. thirdio. com/rowhammer. pdf Exploiting the DRAM rowhammer bug to gain kernel privileges [Mark Seaborn, et al. 2015] Read Disturb Errors in MLC NAND Flash Mermory: … [Y. Cai, O. Mutlu, et al. 2015] ANVIL: Software-Based Protection Agains Next-Generation Rowhammer Attacks [Z. Aweke et al. , 2016] Grand Pwning Unit: Accelerating Microarchitectural Attacks with the GPU [P. Frigo et al. 2018] 42
- Slides: 42