SOLVING THE DRAM SCALING CHALLENGE Samira Khan MEMORY
- Slides: 42
SOLVING THE DRAM SCALING CHALLENGE Samira Khan
MEMORY IN TODAY’S SYSTEM Processor DRAM Memory Storage DRAM is critical for performance 2
TREND: DATA-INTENSIVE APPLICATIONS DNA/PROTEIN SYNTHESIS IMAGE ANALYSIS VIRTUAL REALITY IN-MEMORY FRAMEWORKS Increasing demand for high-capacity, high-performance, energy-efficient main memory 3
MEGABITS/CHIP DRAM SCALING TREND 10000 1000 2 X/1. 5 YEARS 2 X/3 YEARS 100 10 1 1985 1995 2005 YEAR 2015 DRAM scaling is getting difficult Source: Flash Memory Summit 2013, Memcon 2014 4
DRAM SCALING CHALLENGE Technology Scaling DRAM Cells WHY? DRAM Cells Manufacturing reliable cells at low cost is getting difficult 5
WHY IS IT DIFFICULT TO SCALE? DRAM Cells In order to answer this we need to take a closer look to a DRAM cell 6
DRAM CELL OPERATION 1. A DRAM cell stores data as charge 2. A DRAM cell is refreshed every 64 ms Transistor Capacitor Bitline Contact Capacitor LOGICAL VIEW Transistor VERTICAL CROSS SECTION A DRAM cell 7
DRAM RETENTION FAILURE Retention time: The time when we can still access a cell reliably Cells need to be refreshed before that to avoid failure Retention Time Capacitor Refresh Interval 64 ms Time Retention time is greater than refresh interval less than refresh interval Failure depends on the amount of charge 8
SCALING CHALLENGE: CELL-TO-CELL INTERFERENCE Cell-to-cell interference affects the charge in neighboring cells Technology Scaling Less Interference Indirect path More interference results in more failures 9
IMPLICATION: DRAM ERRORS IN THE FIELD 1. 52% of DRAM modules failed in Google Servers 1. 6% of DRAM modules failed in LANL 1. 8 X more failures in new generation DRAMs in Facebook SIGMETRICS’ 09, SC’ 12, DSN’ 15 10
GOAL Enable high-capacity, low-latency memory without sacrificing reliability 11
Traditional DRAM Scaling is Ending MEMORY Difficult to scale NON-VOLATILE MEMORY Highly scalable MAKE DRAM SCALABLE LEVERAGE NEW TECHNOLOGIES TOLERATE FAILURES WEED’ 13, MICRO’ 15, HPCA’ 18, ONGOING WAX’ 18, ONGOING SIGMETRICS’ 14, DSN’ 15, HPCA’ 15, SIGMETRICS’ 16, DSN’ 16, CAL’ 16, SIGMETRICS’ 17, HPCA’ 17, MICRO’ 17 Solution Space MEMORY Difficult to scale 12
Traditional DRAM Scaling is Ending MEMORY MAKE DRAM SCALABLE System-Level Detection and Mitigation of Failures NON-VOLATILE MEMORY LEVERAGE NEW TECHNOLOGIES Unifying Memory and Storage with NVM Solution Space MEMORY TOLERATE FAILURES Restricted Approximatio n 13
TRADITIONAL APPROACH TO ENABLE DRAM SCALING Make DRAM Reliable Unreliable DRAM Cells Reliable DRAM Cells Manufacturing Time Reliable System in the Field DRAM has strict reliability guarantee 14
MY APPROACH Make DRAM Reliable Unreliable DRAM Cells Reliable DRAM Cells Manufacturing Time Reliable System in the Field Shift the responsibility to systems 15
VISION: SYSTEM-LEVEL DETECTION AND MITIGATION Ship modules Not fully tested during 2 with possible failures manufacture-time 1 PASS FAIL Detect and mitigate failures online Detect and mitigate errors after the system has become operational ONLINE PROFILING 3 16
BENEFITS OF ONLINE PROFILING Technology Scaling Reliable DRAM Cells Unreliable DRAM Cells 1. Improves yield, reduces cost, enables scaling Vendors can make cells smaller without a strong reliability guarantee 17
Reduce Refresh Reliable DRAM Cells LO-REF HI-REF BENEFITS OF ONLINE PROFILING Unreliable DRAM Cells 1. Improves yield, reduces cost, enables scaling Vendors can make cells smaller without a strong reliability guarantee 2. Improves performance and energy efficiency 18
DRAM CELLS ARE NOT EQUAL Ideal Real Smallest cell Largest cell Same size Same charge Different size Different charge Large variation in DRAM cells
DRAM CELLS ARE NOT EQUAL Ideal Real Smallest cell FAST Large variation in retention time Most cells have high retention time can be refreshed at a lower rate without any failure Smaller cells will fail to retain data at a lower refresh rate
BENEFITS OF ONLINE PROFILING LO-REF HI-REF LO-REF Unreliable DRAM Cells Reduce refresh count by using a lower refresh rate, but use higher refresh rate for faulty cells 1. Improves yield, reduces cost, enables scaling Vendors can make cells smaller without a strong reliability guarantee 2. Improves performance and energy efficiency Reduce refresh rate, refresh faulty rows more frequently 21
In order to enable these benefits, we need to detect the failures at the system level 22
CHALLENGE: INTERMITTENT FAILURES Detect and Mitigate Unreliable DRAM Cells Reliable System Depends on accurately detecting DRAM failures If failures were permanent, a simple boot up test would have worked, but there are intermittent failures What are these intermittent failures? 23
CELL-TO-CELL INTERFERENCE: DATA-DEPENDENT FAILURES 1 1 1 NO FAILURE 0 1 0 FAILURE Indirect path Some cells can fail depending on the data stored in neighboring cells How to detect these failures at the system?
DRAM MAKE DRAM SCALABLE System-Level Detection and Mitigation of Failures CHALLENGE: Data-Dependent Failures Efficacy of Testing Data-Dependent Failures SIGMETRICS’ 14 MEMCON: DRAM-Internal Independent Detection CAL’ 16, MICRO’ 17 25
Experimental Methodology Custom FPGA-based infrastructure PCIe DDR 3 PC FPGA C++ programs to specify commands Generate command sequence DIMM Tested more than hundred chips from three different manufacturers 26
DRAM Testing Infrastructure Temperature Controller FPGAs Heater FPGAs PC Open-source infrastructure to test real DRAM chips Characterization data publicly available HPCA’ 17, SIGMETRICS’ 14, DSN’ 15, HPCA’ 15, SIGMETRICS’ 16, DSN’ 16, CAL’ 16, SIGM
DETECT FAILURES WITH TESTING Write some pattern in the module 1 Repeat Read 3 and verify Wait until 2 refresh interval Test with different data patterns 28
DETECTING DATA-DEPENDENT FAILURES Cumulative Number of Failures Even after hundreds of rounds, a small number of new cells keep failing TENS ZEROES FIVES ONES 150000 RAND 100000 50000 0 1 10 100 Number of Rounds of Tests 1000 Conclusion: Tests with many rounds of random patterns cannot detect all failures 29
WHY SO MANY ROUNDS OF TESTS? DATA-DEPENDENT FAILURE Fails when specific pattern in the neighboring cell 0 L D 0 1 R LINEAR MAPPING X-1 X X+1 0 NOT EXPOSED SCRAMBLED X-1 MAPPING 0 1 0 0 TOX-4 THE SYSTEM X X+2 X+1 Even many rounds of random patterns cannot detect all failures 30
CHALLENGE IN DETECTION SCRAMBLED MAPPING 0 1 0 X-? X X+? How to detect data-dependent failures when we even do not know which cells are neighbors? 31
DRAM MAKE DRAM SCALABLE System-Level Detection and Mitigation of Failures CHALLENGE: Data-Dependent Failures Efficacy of Testing Data-Dependent Failure SIGMETRICS’ 14 MEMCON: DRAM-Internal Independent Detection CAL’ 16, MICRO’ 17 32
CURRENT DETECTION MECHANISM Initial Failure Detection and Mitigation Execution of Applications Detection is done with some initial testing isolated from system execution 1. Detect and mitigate all failures with every possible content 2. Only after that start program execution 33
CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution Unreliable DRAM Cells (All possible failing cell) List of Failures Initial Failure Detection and Mitigation 34
CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 0 0 1 1 0 0 0 0 Unreliable DRAM Cells 0 0 Pattern x, Cell A (All possible failing cell) List of Failures Initial Failure Detection and Mitigation 35
CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 0 0 0 0 1 1 0 0 Unreliable DRAM Cells 0 0 Pattern x, Cell A Pattern y, Cell B (All possible failing cell) List of Failures Initial Failure Detection and Mitigation 36
CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 0 0 0 1 1 Unreliable DRAM Cells 0 0 Pattern x, Cell A Pattern y, Cell B Pattern z, Cell C … (All possible failing cell) List of Failures Initial Failure Detection and Mitigation Applications Execution of Applications 37
CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 0 0 0 1 1 Unreliable DRAM Cells 0 0 ? ? List of Failures Initial Failure Detection and Mitigation No Reliability Guarantee Applications Execution of Applications Online profiling cannot detect all failures as the address mapping is not visible to the system 38
MEMCON: MEMORY CONTENT-BASED DETECTION AND MITIGATION 0 0 1 0 NO NEED TO DETECT EVERY POSSIBLE FAILURE 1 0 1 0 0 0 Current content, Cell A 0 0 1 Unreliable DRAM Cells with Program Content List of Failures Application Simultaneous Detection and Execution Based on current memory content of running applications Need to detect and mitigate only with the current content 39
MEMCON: HIGH-LEVEL DESIGN Simultaneous Detection and Execution 0 0 1 1 0 0 0 0 1 Unreliable DRAM Cells HI-REF LO-REF HI-REF Current content, Cell A HI-REF LO-REF Application 1. No initial detection and mitigation 2. Start running the application with a high refresh rate 3. Detect failures with the current memory content • If no failure found, use a low refresh rate 40
SUMMARY: ONLINE PROFILING Detect and Mitigate Unreliable DRAM Cells Reliable System Detection at the system-level is challenging due to data-dependent failures It is possible to detect and mitigate data-dependent failures simultaneously with program execution 65%-74% Reduction in refresh count 40%-50% Performance improvement 41
SOLVING THE DRAM SCALING CHALLENGE Samira Khan
- Samira khan uva
- Dr samira khan
- Samira khan uva
- Arch shield
- Dram scaling challenges
- Dram scaling
- Dram memory mapping
- Dram memory mapping
- Artemis
- Samira kazan origin
- Samira zegrari
- Samira w block
- Samira kazan
- How to counter samira
- Samira soltani
- Teen challenge nottingham
- Primary memory and secondary memory
- Virtual memory in memory hierarchy consists of
- Excplicit memory
- Physical address vs logical address
- Eidetic memory vs iconic memory
- Long term memory vs short term memory
- Which memory is the actual working memory?
- Shared memory vs distributed memory
- Internal memory and external memory
- Virtual memory
- Episodic vs semantic memory
- Sejarah perkembangan ram
- Dram block diagram
- Overcoming challenges essay
- Semiconductor ram memories in computer organization
- What is internal processor memory
- Dram
- Dram cell
- Dram organization
- Dram
- Dram
- Dram types
- Azerbaycanda ilk dram eseri
- What is dram
- Sram vs dram
- Dram background
- Internal memory in computer architecture