UCLA MEMRES A Fast Memory System Reliability Simulator

UCLA MEMRES: A Fast Memory System Reliability Simulator Shaodi Wang 1, Henry (Chaohong) Hu 2, Hongzhong 2 Zheng and Puneet Gupta 1 1 University of California, Los Angeles 2 Samsung Semiconductor, Milpitas, USA Nano. CAD Lab

UCLA Motivations • Memory faults are becoming a critical reliability problem – Technology scaling and emerging non-volatile memory – Fault recovery increases cost – Data-heavy applications grow the need of reliable memory • Memory failure is caused by different types of faults – Permanent fault (frequently produces errors) – Transient fault (temporary error) – Data-link fault (temporary error in DDR bus) Nano. CAD Lab shaodiwang@g. ucla. edu 2

UCLA Outline • • Overview of memory faults and repair techniques MEMRES Validation & Case Study Conclusion Nano. CAD Lab shaodiwang@g. ucla. edu 3

UCLA Overview of Memory Faults • Fault modes and types are studied and identified in a field study [1]. Example of single-row fault (up to 30% of bits in a row are faulty) [1] Example of single-bank fault (up to 30% columns and 100% rows are faulty in a bank) [1] V. Sridharan and D. Liberty, “A study of dram failures in the field, ”, SC 12, 2012, pp. 1 -11 Nano. CAD Lab shaodiwang@g. ucla. edu 4

Channel fault, rank fault, multibank fault: logic circuit failure UCLA Overview of Memory Faults Row fault Single-bit fault: cell defect and Bit fault soft error Bank Nano. CAD Lab Data link error fault Bank fault … Rank fault Bank shaodiwang@g. ucla. edu Single-bank fault: decoder or I/O fault . . . Columns Column Chip Rank Chip . . . Single-row fault: decoder stuck-at-0 or 1 fault … Rank Channel Single-column fault: sense amplifier failure Node Channel fault … Data-link fault: inter-symbolinterference, jitter, voltage noise Multibanks Fault 5

UCLA Overview of Memory Fault Repair Techniques • Error-correcting code (ECC): Nano. CAD Lab Modeling – Add additional bits to original data bits to improve fault tolerance – E. g. SECDED (64+8 bits): single error correction and double error detections. – E. g. Chipkill: multiple-error detection and correction – In-memory ECC and incontroller ECC shaodiwang@g. ucla. edu ECC algorithms SECDED Chipkill ECC designs In-memory In-controller ECC 6

UCLA Overview of Memory Fault Repair Techniques • Memory reliability Memory fault simulator is required! management: Memory reliability management Simulation Nano. CAD Lab Application dependence Modeling – Hardware sparing: activate a sparing hardware to replace failed one – Memory mirroring: use a memory space as a mirror of the other space – Memory scrubbing: Correct transient errors – Memory page retirement: remove a page with correctable errors from use shaodiwang@g. ucla. edu Hardware Sparing Memory Scrubbing Memory Mirroring Memory Page Retirement ECC algorithms SECDED Chipkill ECC designs In-memory In-controller ECC 7
![UCLA Prior Works Prior works Time/ Cost [1], [2] (in-field study) [3] (model) Memory UCLA Prior Works Prior works Time/ Cost [1], [2] (in-field study) [3] (model) Memory](http://slidetodoc.com/presentation_image_h2/982d29ee2f6f266eb4ba11482040947b/image-8.jpg)
UCLA Prior Works Prior works Time/ Cost [1], [2] (in-field study) [3] (model) Memory architecture ECC designs Memory reliability management Application dependence Years/ Fixed high Fixed Supported <Sec/l ow Configurable memory Fixed Not supported [4 -5] (architecture Infinit simulator) y/low Configurable memory Configurable Supported [6] (Faultsim) Hrs/ low Configurable memory Configur Not supported able Not supported Our work Hrs/ low Configurable memory Configurable Supported [1] B. Schroeder, E. Pinheiro, and W. -D. Weber, “Dram errors in the wild: a large-scale field study, ” in ACM SIGMETRICS Performance Evaluation Review, 2009 [2] V. Sridharan and D. Liberty, “A study of dram failures in the field, ” SC 12 [3] X. Jian, et. al. , Analyzing reliability of memory sub-systems with double-chipkill detect correct, ” PRDC, Dec 2013 [4] C. -F. Wu, C. -T. Huang, and C. -W. Wu, “Ramses: a fast memory fault simulator, ” DFT’ 99 , 1999. [5] T. M. Niermann, W. -T. Cheng, and J. H. Patel, “Proofs: A fast, memoryefficient sequential circuit fault simulator, ” TCAD, 1992. [6] D. Roberts and P. Nair, “Faultsim: A fast, configurable memoryresilience simulator, ” The Memory Forum: In conjunction with ISCA-41, Tech. Rep. , 2014 Nano. CAD Lab shaodiwang@g. ucla. edu 8

UCLA Data Structure • Memory accessing model - Accessing Range (AR) – Memory accesses in a period are a range of addresses – Application memory traces in a interval (hours) are mapped to accessing range (a data structure specifying coverage, location, and accessing rate) • Fault modeling – Fault Range (FR) – Faults are mapped to different fault types – We represent a fault by a fault range ( coverage, location, and #. faulty bits) – Failure rate is represented by Failure-in-time (FIT) – Fault FIT differs from fault type, accessing range and memory device model Nano. CAD Lab shaodiwang@g. ucla. edu 9

UCLA Data Structure Column 000 001 010 011 100 101 110 111 000 C 8 001 010 Row 011 7 100 B 1 6 A 2 3 Mask: 111000 MASK 101 110 111 Nano. CAD Lab Traditional representation (Row_addr+Col_addr) for faults or addresses: 1: 001100 … 8: 001001 A single-column fault may have over 100 faulty bits! Millions of addresses are accessed per second! Our representations (Fault range, access range): ADDRESS Coverrate ROW+COL 4 5 A 111 000 100 B 000 011 000 C 000 001 Addr: 000100 shaodiwang@g. ucla. edu 0. 625 0. 5 1 10

UCLA MEMRES Overview Application Memory Trace Memory Device Fault Model Pre-Sim Processing Fault model (FIT) Memory ECC Design & Algo Reliability Management Config Nano. CAD Lab Memory access model (AR) Monte-Carlo Simulator ECC: Fault detection & injection correction Memory System Arch Memory mirroring Memory page retirement Hardware sparing Memory scrubbing System Failure Pareto Failure Rate shaodiwang@g. ucla. edu 11

UCLA Operations - INTERSECT 011 100 8 Bits B C … 101 A … 64 bits Usage: 000 001 010 011 100 101 110 111 1. Decide whether an access range ( a Fault serial of addresses to be accessed) 000 A activates a fault 2. Decide whether two faults from 8 bits 001 different chips can be accessed in a word read (64 bits) 010 Bit-wise operations: A = (111000, 000100, 0. 6) 110 111 & Fault. B Nano. CAD Lab + * B = (000111, 011000, 0. 5) C = (000000, 011100, 0. 3) shaodiwang@g. ucla. edu 12

UCLA Operations - REMOVE Usage: 000 001 010 011 100 101 110 111 1. Clear transient faults from memory (memory scrubbing) 000 2. Block access to certain addresses (memory page retirement) 001 Operations (remove B from A): Iteration starts: 010 1)Remove the target 2)Combine the remaining parts 011 into least blocks with the size 100 equaling to the power of 2 E A G B 101 110 111 Nano. CAD Lab C Collection after removal shaodiwang@g. ucla. edu 13

UCLA Operations - MERGE 000 001 010 011 100 101 110 Usage: 000 001 010 011 100 101 110 111 1. merging two faults to obtain the combined fault. D A E GB CHI F 0. 5 0. 6 0. 7 0. 5 0. 6 Operations (Merge A & B, Coverrates of A and B are 0. 5, 0. 6) : 1) Find intersection between A&B and merge the intersection from A (0. 5) & B(0. 6). 2) Store the intersection C(0. 7) into the collection. 3) Remove the intersection from A and B. 4) Store all remaining parts in to the collection. Collection after removal 111 Nano. CAD Lab 0. 7 0. 5 0. 6 shaodiwang@g. ucla. edu 14

UCLA Validation • MEMRES matches well with analytical model – Only SECDED is applied because of the limitation of analytical model – Memory reliability management can only be validated by existing slow simulators or in-field experiments Nano. CAD Lab shaodiwang@g. ucla. edu 15

UCLA Normalized system failure rate Case Study – Impact of Data-link Error 1 System failure rate vs. Data-link error rate 0, 8 0, 6 0, 4 In-controller SECDED 0, 2 0 1 E-15 1 E-14 1 E-13 1 E-12 Data-link error rate 1 E-11 • Increasing data-link error increases overall memory failure rate – Nearly all data-link errors are single-bit transient error and can be fixed by SECDED. – Correctable single-bit memory fault can collide with data-link error to create uncorrectable double-bit error. Nano. CAD Lab shaodiwang@g. ucla. edu 16

UCLA Case Study – Impact of Fault Repair Techniques 128 GB memory architecture configuration Col. Row. Mats Banks Drams Ranks Dimms Channels Node 8192 512 128 8 9 4 4 2 1 Memory reliability management configuration Configurations Config. 1 Config. 2 Config. 3 SECDED √ √ √ Memory scrubbing √ √ √ Rank sparing Memory page retirement Nano. CAD Lab shaodiwang@g. ucla. edu √ 17

UCLA Histogram of system critical Normalized Memory failure Rate faults 1 0, 8 0, 6 0, 4 0, 2 0 normalized Failure Rate Normalized failure probability Case Study – Impact of Fault Repair Techniques Config. 1 Config. 2 Config. 3 100% 80% 60% 40% 20% Single-bank fault Single-row fault 0% + Data-link error Config. 1 Multi-banks fault Config. 2 Single-lane fault Config. 3 Single-word fault • Config. 1 → Config. 2: Rank sparing (>10% faulty bits in a rank trigger rank repair) fixes only big faults ( Multi-banks) • Config. 2 → Config. 3: Memory page retirement fixes most faults except single-lane and multi-bank faults Nano. CAD Lab shaodiwang@g. ucla. edu 18

UCLA Conclusions • MEMRES is a simulation framework to evaluate reliability effectiveness of fault repair techniques, which can evaluate a memory system with – – Different application memory traces Different memory technologies Different memory system architectures Different memory reliability management • Complexity of run time and memory consumption are O(N*log(M)3) and O(N*log(M)2) respectively, where N is the number of faults and M is the memory size. – Run time of the validation (100, 000 runs of 7 -years simulation with SECDED) is 26 mins on a single eight-core machine. Nano. CAD Lab shaodiwang@g. ucla. edu 19

UCLA Thank you for your attention Nano. CAD Lab shaodiwang@g. ucla. edu 20

UCLA Monte-Carlo Simulator • Monte-Carlo simulation – Simulation time is divided into short interval – In each interval, maximum one fault is injected for each fault type • Randomly determined based on fault FIT. • Long interval improves speed but loses accuracy – Fault is checked against ECC after injection • Detectable fault may trigger memory reliability managements Nano. CAD Lab shaodiwang@g. ucla. edu 21
- Slides: 21