SOLVING THE DRAM SCALING CHALLENGE Samira Khan MEMORY

  • Slides: 42
Download presentation
SOLVING THE DRAM SCALING CHALLENGE Samira Khan

SOLVING THE DRAM SCALING CHALLENGE Samira Khan

MEMORY IN TODAY’S SYSTEM Processor DRAM Memory Storage DRAM is critical for performance 2

MEMORY IN TODAY’S SYSTEM Processor DRAM Memory Storage DRAM is critical for performance 2

TREND: DATA-INTENSIVE APPLICATIONS DNA/PROTEIN SYNTHESIS IMAGE ANALYSIS VIRTUAL REALITY IN-MEMORY FRAMEWORKS Increasing demand for

TREND: DATA-INTENSIVE APPLICATIONS DNA/PROTEIN SYNTHESIS IMAGE ANALYSIS VIRTUAL REALITY IN-MEMORY FRAMEWORKS Increasing demand for high-capacity, high-performance, energy-efficient main memory 3

MEGABITS/CHIP DRAM SCALING TREND 10000 1000 2 X/1. 5 YEARS 2 X/3 YEARS 100

MEGABITS/CHIP DRAM SCALING TREND 10000 1000 2 X/1. 5 YEARS 2 X/3 YEARS 100 10 1 1985 1995 2005 YEAR 2015 DRAM scaling is getting difficult Source: Flash Memory Summit 2013, Memcon 2014 4

DRAM SCALING CHALLENGE Technology Scaling DRAM Cells WHY? DRAM Cells Manufacturing reliable cells at

DRAM SCALING CHALLENGE Technology Scaling DRAM Cells WHY? DRAM Cells Manufacturing reliable cells at low cost is getting difficult 5

WHY IS IT DIFFICULT TO SCALE? DRAM Cells In order to answer this we

WHY IS IT DIFFICULT TO SCALE? DRAM Cells In order to answer this we need to take a closer look to a DRAM cell 6

DRAM CELL OPERATION 1. A DRAM cell stores data as charge 2. A DRAM

DRAM CELL OPERATION 1. A DRAM cell stores data as charge 2. A DRAM cell is refreshed every 64 ms Transistor Capacitor Bitline Contact Capacitor LOGICAL VIEW Transistor VERTICAL CROSS SECTION A DRAM cell 7

DRAM RETENTION FAILURE Retention time: The time when we can still access a cell

DRAM RETENTION FAILURE Retention time: The time when we can still access a cell reliably Cells need to be refreshed before that to avoid failure Retention Time Capacitor Refresh Interval 64 ms Time Retention time is greater than refresh interval less than refresh interval Failure depends on the amount of charge 8

SCALING CHALLENGE: CELL-TO-CELL INTERFERENCE Cell-to-cell interference affects the charge in neighboring cells Technology Scaling

SCALING CHALLENGE: CELL-TO-CELL INTERFERENCE Cell-to-cell interference affects the charge in neighboring cells Technology Scaling Less Interference Indirect path More interference results in more failures 9

IMPLICATION: DRAM ERRORS IN THE FIELD 1. 52% of DRAM modules failed in Google

IMPLICATION: DRAM ERRORS IN THE FIELD 1. 52% of DRAM modules failed in Google Servers 1. 6% of DRAM modules failed in LANL 1. 8 X more failures in new generation DRAMs in Facebook SIGMETRICS’ 09, SC’ 12, DSN’ 15 10

GOAL Enable high-capacity, low-latency memory without sacrificing reliability 11

GOAL Enable high-capacity, low-latency memory without sacrificing reliability 11

Traditional DRAM Scaling is Ending MEMORY Difficult to scale NON-VOLATILE MEMORY Highly scalable MAKE

Traditional DRAM Scaling is Ending MEMORY Difficult to scale NON-VOLATILE MEMORY Highly scalable MAKE DRAM SCALABLE LEVERAGE NEW TECHNOLOGIES TOLERATE FAILURES WEED’ 13, MICRO’ 15, HPCA’ 18, ONGOING WAX’ 18, ONGOING SIGMETRICS’ 14, DSN’ 15, HPCA’ 15, SIGMETRICS’ 16, DSN’ 16, CAL’ 16, SIGMETRICS’ 17, HPCA’ 17, MICRO’ 17 Solution Space MEMORY Difficult to scale 12

Traditional DRAM Scaling is Ending MEMORY MAKE DRAM SCALABLE System-Level Detection and Mitigation of

Traditional DRAM Scaling is Ending MEMORY MAKE DRAM SCALABLE System-Level Detection and Mitigation of Failures NON-VOLATILE MEMORY LEVERAGE NEW TECHNOLOGIES Unifying Memory and Storage with NVM Solution Space MEMORY TOLERATE FAILURES Restricted Approximatio n 13

TRADITIONAL APPROACH TO ENABLE DRAM SCALING Make DRAM Reliable Unreliable DRAM Cells Reliable DRAM

TRADITIONAL APPROACH TO ENABLE DRAM SCALING Make DRAM Reliable Unreliable DRAM Cells Reliable DRAM Cells Manufacturing Time Reliable System in the Field DRAM has strict reliability guarantee 14

MY APPROACH Make DRAM Reliable Unreliable DRAM Cells Reliable DRAM Cells Manufacturing Time Reliable

MY APPROACH Make DRAM Reliable Unreliable DRAM Cells Reliable DRAM Cells Manufacturing Time Reliable System in the Field Shift the responsibility to systems 15

VISION: SYSTEM-LEVEL DETECTION AND MITIGATION Ship modules Not fully tested during 2 with possible

VISION: SYSTEM-LEVEL DETECTION AND MITIGATION Ship modules Not fully tested during 2 with possible failures manufacture-time 1 PASS FAIL Detect and mitigate failures online Detect and mitigate errors after the system has become operational ONLINE PROFILING 3 16

BENEFITS OF ONLINE PROFILING Technology Scaling Reliable DRAM Cells Unreliable DRAM Cells 1. Improves

BENEFITS OF ONLINE PROFILING Technology Scaling Reliable DRAM Cells Unreliable DRAM Cells 1. Improves yield, reduces cost, enables scaling Vendors can make cells smaller without a strong reliability guarantee 17

Reduce Refresh Reliable DRAM Cells LO-REF HI-REF BENEFITS OF ONLINE PROFILING Unreliable DRAM Cells

Reduce Refresh Reliable DRAM Cells LO-REF HI-REF BENEFITS OF ONLINE PROFILING Unreliable DRAM Cells 1. Improves yield, reduces cost, enables scaling Vendors can make cells smaller without a strong reliability guarantee 2. Improves performance and energy efficiency 18

DRAM CELLS ARE NOT EQUAL Ideal Real Smallest cell Largest cell Same size Same

DRAM CELLS ARE NOT EQUAL Ideal Real Smallest cell Largest cell Same size Same charge Different size Different charge Large variation in DRAM cells

DRAM CELLS ARE NOT EQUAL Ideal Real Smallest cell FAST Large variation in retention

DRAM CELLS ARE NOT EQUAL Ideal Real Smallest cell FAST Large variation in retention time Most cells have high retention time can be refreshed at a lower rate without any failure Smaller cells will fail to retain data at a lower refresh rate

BENEFITS OF ONLINE PROFILING LO-REF HI-REF LO-REF Unreliable DRAM Cells Reduce refresh count by

BENEFITS OF ONLINE PROFILING LO-REF HI-REF LO-REF Unreliable DRAM Cells Reduce refresh count by using a lower refresh rate, but use higher refresh rate for faulty cells 1. Improves yield, reduces cost, enables scaling Vendors can make cells smaller without a strong reliability guarantee 2. Improves performance and energy efficiency Reduce refresh rate, refresh faulty rows more frequently 21

In order to enable these benefits, we need to detect the failures at the

In order to enable these benefits, we need to detect the failures at the system level 22

CHALLENGE: INTERMITTENT FAILURES Detect and Mitigate Unreliable DRAM Cells Reliable System Depends on accurately

CHALLENGE: INTERMITTENT FAILURES Detect and Mitigate Unreliable DRAM Cells Reliable System Depends on accurately detecting DRAM failures If failures were permanent, a simple boot up test would have worked, but there are intermittent failures What are these intermittent failures? 23

CELL-TO-CELL INTERFERENCE: DATA-DEPENDENT FAILURES 1 1 1 NO FAILURE 0 1 0 FAILURE Indirect

CELL-TO-CELL INTERFERENCE: DATA-DEPENDENT FAILURES 1 1 1 NO FAILURE 0 1 0 FAILURE Indirect path Some cells can fail depending on the data stored in neighboring cells How to detect these failures at the system?

DRAM MAKE DRAM SCALABLE System-Level Detection and Mitigation of Failures CHALLENGE: Data-Dependent Failures Efficacy

DRAM MAKE DRAM SCALABLE System-Level Detection and Mitigation of Failures CHALLENGE: Data-Dependent Failures Efficacy of Testing Data-Dependent Failures SIGMETRICS’ 14 MEMCON: DRAM-Internal Independent Detection CAL’ 16, MICRO’ 17 25

Experimental Methodology Custom FPGA-based infrastructure PCIe DDR 3 PC FPGA C++ programs to specify

Experimental Methodology Custom FPGA-based infrastructure PCIe DDR 3 PC FPGA C++ programs to specify commands Generate command sequence DIMM Tested more than hundred chips from three different manufacturers 26

DRAM Testing Infrastructure Temperature Controller FPGAs Heater FPGAs PC Open-source infrastructure to test real

DRAM Testing Infrastructure Temperature Controller FPGAs Heater FPGAs PC Open-source infrastructure to test real DRAM chips Characterization data publicly available HPCA’ 17, SIGMETRICS’ 14, DSN’ 15, HPCA’ 15, SIGMETRICS’ 16, DSN’ 16, CAL’ 16, SIGM

DETECT FAILURES WITH TESTING Write some pattern in the module 1 Repeat Read 3

DETECT FAILURES WITH TESTING Write some pattern in the module 1 Repeat Read 3 and verify Wait until 2 refresh interval Test with different data patterns 28

DETECTING DATA-DEPENDENT FAILURES Cumulative Number of Failures Even after hundreds of rounds, a small

DETECTING DATA-DEPENDENT FAILURES Cumulative Number of Failures Even after hundreds of rounds, a small number of new cells keep failing TENS ZEROES FIVES ONES 150000 RAND 100000 50000 0 1 10 100 Number of Rounds of Tests 1000 Conclusion: Tests with many rounds of random patterns cannot detect all failures 29

WHY SO MANY ROUNDS OF TESTS? DATA-DEPENDENT FAILURE Fails when specific pattern in the

WHY SO MANY ROUNDS OF TESTS? DATA-DEPENDENT FAILURE Fails when specific pattern in the neighboring cell 0 L D 0 1 R LINEAR MAPPING X-1 X X+1 0 NOT EXPOSED SCRAMBLED X-1 MAPPING 0 1 0 0 TOX-4 THE SYSTEM X X+2 X+1 Even many rounds of random patterns cannot detect all failures 30

CHALLENGE IN DETECTION SCRAMBLED MAPPING 0 1 0 X-? X X+? How to detect

CHALLENGE IN DETECTION SCRAMBLED MAPPING 0 1 0 X-? X X+? How to detect data-dependent failures when we even do not know which cells are neighbors? 31

DRAM MAKE DRAM SCALABLE System-Level Detection and Mitigation of Failures CHALLENGE: Data-Dependent Failures Efficacy

DRAM MAKE DRAM SCALABLE System-Level Detection and Mitigation of Failures CHALLENGE: Data-Dependent Failures Efficacy of Testing Data-Dependent Failure SIGMETRICS’ 14 MEMCON: DRAM-Internal Independent Detection CAL’ 16, MICRO’ 17 32

CURRENT DETECTION MECHANISM Initial Failure Detection and Mitigation Execution of Applications Detection is done

CURRENT DETECTION MECHANISM Initial Failure Detection and Mitigation Execution of Applications Detection is done with some initial testing isolated from system execution 1. Detect and mitigate all failures with every possible content 2. Only after that start program execution 33

CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution Unreliable DRAM

CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution Unreliable DRAM Cells (All possible failing cell) List of Failures Initial Failure Detection and Mitigation 34

CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 0 0

CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 0 0 1 1 0 0 0 0 Unreliable DRAM Cells 0 0 Pattern x, Cell A (All possible failing cell) List of Failures Initial Failure Detection and Mitigation 35

CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 0 0

CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 0 0 0 0 1 1 0 0 Unreliable DRAM Cells 0 0 Pattern x, Cell A Pattern y, Cell B (All possible failing cell) List of Failures Initial Failure Detection and Mitigation 36

CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 0 0

CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 0 0 0 1 1 Unreliable DRAM Cells 0 0 Pattern x, Cell A Pattern y, Cell B Pattern z, Cell C … (All possible failing cell) List of Failures Initial Failure Detection and Mitigation Applications Execution of Applications 37

CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 0 0

CURRENT DETECTION MECHANISM Detect every possible failure with all content before execution 0 0 0 1 1 Unreliable DRAM Cells 0 0 ? ? List of Failures Initial Failure Detection and Mitigation No Reliability Guarantee Applications Execution of Applications Online profiling cannot detect all failures as the address mapping is not visible to the system 38

MEMCON: MEMORY CONTENT-BASED DETECTION AND MITIGATION 0 0 1 0 NO NEED TO DETECT

MEMCON: MEMORY CONTENT-BASED DETECTION AND MITIGATION 0 0 1 0 NO NEED TO DETECT EVERY POSSIBLE FAILURE 1 0 1 0 0 0 Current content, Cell A 0 0 1 Unreliable DRAM Cells with Program Content List of Failures Application Simultaneous Detection and Execution Based on current memory content of running applications Need to detect and mitigate only with the current content 39

MEMCON: HIGH-LEVEL DESIGN Simultaneous Detection and Execution 0 0 1 1 0 0 0

MEMCON: HIGH-LEVEL DESIGN Simultaneous Detection and Execution 0 0 1 1 0 0 0 0 1 Unreliable DRAM Cells HI-REF LO-REF HI-REF Current content, Cell A HI-REF LO-REF Application 1. No initial detection and mitigation 2. Start running the application with a high refresh rate 3. Detect failures with the current memory content • If no failure found, use a low refresh rate 40

SUMMARY: ONLINE PROFILING Detect and Mitigate Unreliable DRAM Cells Reliable System Detection at the

SUMMARY: ONLINE PROFILING Detect and Mitigate Unreliable DRAM Cells Reliable System Detection at the system-level is challenging due to data-dependent failures It is possible to detect and mitigate data-dependent failures simultaneously with program execution 65%-74% Reduction in refresh count 40%-50% Performance improvement 41

SOLVING THE DRAM SCALING CHALLENGE Samira Khan

SOLVING THE DRAM SCALING CHALLENGE Samira Khan