Embedded System Lab Understanding the Robustness of SSDs
Embedded System Lab. Understanding the Robustness of SSDs under Power Fault 서동화 dhdh 0113@gmail. com Embedded System Lab.
Contents l Introduction l Background l Testing Framework l Experimental result 서동화 Embedded System Lab.
Introduction l Flash-based solid state disks(SSDs) £ a “truly revolutionary and disruptive” technology. l l £ the behavior of flash memory in adverse conditions has only been studied at a component level. l l 서동화 Greater performance. Lower power draw. Given the opaque and confidential nature of FTL. The behavior of full devices in unusual conditions is still a mystery to public. £ This paper considers the behavior of SSDs under power fault. £ Although loss of power seems like an easy fault to prevent, recent experience shows that a simple loss of power is still a distressingly frequent occurrence. Embedded System Lab.
Introduction l Power fault case £ HOSTING l £ Amazon l £ 서동화 May 2010 “Car Crash Triggers Amazon POWER OUTAGE…” i. Web l £ Jun. 2012 “Amazon Data Center LOSES POWER During Storm. . ” Amazon l £ Jul. 2012 “… human error was responsible for a data center POWER OUTAGES…” 2010 “About 3, 000 servers at Montreal web hos i. Web experienced an OUTAGES …” And so on… Embedded System Lab.
Background l NAND Flash Low-Level Details £ The floating gate inside a NAND flash cell is susceptible to a variety of faults that may cause data corruption. l l 서동화 Write endurance Program disturb Read disturb aging Embedded System Lab.
Background l NAND Flash Low-Level Details <write> <erase> <read> 서동화 Embedded System Lab.
Reference l Write disturb l Program disturb <Characterizing Flash Memory: Anomalies, Observations, and Applications> 서동화 Embedded System Lab.
Reference l Read disturb <Characterizing Flash Memory: Anomalies, Observations, and Applications> 서동화 Embedded System Lab.
Background l SSD High-Level Concerns £ 서동화 SSD using firmware called “FTL” to make device appear as if it can do update-in-place. l The primary responsibility of an FTL is to maintain a mapping between logical and physical addresses. l Remapping table are typically stored in a volatile write back cache. l Due to cost considerations, manufactures typically attempt to minimize the size of the write-back cache as well as the capacitor backing it. l Loss of power during program operations can make the flash cells more susceptible to other faults. l Erase operations are also susceptible to power loss, since they take much longer to complete than program operations. Embedded System Lab.
Testing Framework l Types of failures 서동화 £ Bit Corruption £ Metadata Corruption £ Dead Device before power fault after power fault Embedded System Lab.
Testing Framework l Types of failures before power fault 서동화 £ Shorn Writes £ Flying Writes after power fault Embedded System Lab.
Testing Framework l Types of failures £ Bit corruption l £ Flying writes l £ Because an FTL is a complex piece of software and corruption of its internal state could be problematic. Unserializable writes l 서동화 Because single operations may be internally remapped to multiple flash chips to improve throughput. Metadata corruption l £ due to corruption and missing updates in the FTL’s remapping tables. Shorn writes l £ Half-programmed flash cells are susceptible to bit errors. Due to high degree of parallelism inside an SSD. Embedded System Lab.
Testing Framework l Types of failures £ Local consistency l l £ Most of the faults can be detected using local-only data. Either a record is correct or it its not. Global consistency l Unserializability is more complex property. £ Whether the result of a workload is serializable depends not only on individual records, but on how they can fit into a total order of all the operations. 서동화 Embedded System Lab.
Testing Framework l Detecting local failures £ 서동화 In order to detect local failures, we need to write records that can be checked for consistency. Embedded System Lab.
Testing Framework l Dealing with complex FTLs £ Naive padding Random number padding Pad with copies of the header £ Advanced FTL’s compression £ £ l 서동화 In order to avoid such compression, we further perform rando -mization on the regular record format Embedded System Lab.
Testing Framework l Detecting global failures £ Unserializability is not a property of a single record and thus cannot be tested with fairly local information. £ During a power fault, we expect that some FTLs may fail to persist outstanding writes to the flash, or may lose mapping table updates. l 서동화 We call such misordered or missing operations unseiralized writes. Embedded System Lab.
Testing Framework l Detecting global failures 서동화 £ To detect unserializability, we need information about the completion time of each write. £ We make use of the time when the records were created. Embedded System Lab.
Testing Framework l Applying workloads £ £ £ Random writes Concurrent sequential writes Single-threaded sequential writes l Power fault injection l Putting it together 서동화 Embedded System Lab.
Experimental result l Experimental Environment £ We selected fifteen representative SSDs from five different vendors. l £ The SSDs and the hard drives are used as raw devices. l £ l Which means each write operation does not return until its data is flushed to the devices. Bypass the buffer cache. Scenarios l l l 서동화 No file system is created on the devices. We use synchronized I/O. l £ For comparison purposes, we also evaluated two traditional hard drives. Power fault during concurrent random writes. Power fault during concurrent sequential writes. Power fault during single-threaded sequential writes. Embedded System Lab.
Experimental result l Overall Results 서동화 What the hell … £ We found that 13 out of 15 devices exhibit failure. £ In SSD#3, about one third of data was lost due to one third of the device becoming inaccessible. £ In SSD#1, all of its data was lost. Embedded System Lab.
Experimental result l l Bit corruption £ One common way to deal with bit errors is using ECC. £ Number of chip-level bit errors under power failure could exceed the correction capability of ECC. Shorn writes 서동화 £ This shows that shorn writes is not a rare failure mode under power fault. £ Subpage programming Embedded System Lab.
Experimental result l Unserializable writes £ No relationship between the number of serialization errors and a SSD’s unit price stands out except for the fact that the most expensive SLC. £ Scenario l 서동화 1) uncompleted program 2) FTL 3) old record Embedded System Lab.
Experimental result l l Metadata corruption £ After 8 injected power faults, only 69. 5% of all the records can be retrieved from SSD#3. £ This corruption makes 30. 5% of the flash memory space unavailable. £ We assume corruption of metadata. Dead device £ After 136 injected power faults, SSD#1 became completely useless. £ All of the data stored on it was lost. l l 서동화 Loss of metadata Power spike during power loss Embedded System Lab.
서동화 Embedded System Lab.
- Slides: 24