Efficient Scrub Mechanisms for Error Prone Emerging Memories
Efficient Scrub Mechanisms for Error -Prone Emerging Memories Manu Awasthiǂ, Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran‡, Viji Srinivasan‡ ǂMicron, ⁺University of Utah, ‡IBM Research
Executive Summary • Future memory hierarchies will likely include NVMs – Focus of this work: Phase Change Memory (PCM) • Multi level cells in PCM appear imminent • A number of proposals exist to handle hard errors and lifetime issues of PCM devices • Resistance Drift is a lesser explored phenomenon – Will become primary cause of “soft errors” -- Need to explore holistic solutions to counter drift 2
Phase Change Memory - MLC • Chalcogenide material can exist in crystalline or amorphous states • The material can also be programmed into intermediate states – Leads to many intermediate states, paving way for Multi Level Cells (MLCs) (11) Crystalline (10) Resistance (01) (00) Amorphous (111) (110) (101) (100) (011) (010) (001) (000) 3
What is Resistance Drift? Time 11 10 01 00 ERROR!! Tn B T 0 A Resistance 4
Resistance Drift - Issues • Programmed resistance drifts according to power law equation Rdrift(t) = R 0 х (t)α • R 0, α usually follow a Gaussian distribution • Time to drift (error) depends on – Programmed resistance (R 0), and – Drift Coefficient (α) – Is highly unpredictable!! 5
Resistance Drift - How it happens ERROR!! Number of Cells 11 10 01 00 Drift Median case cell • Typical R 0 • Typical α R 0 Rt Worst case cell • High R 0 • High α Scrub rate will be dictated by the Worst Case R 0 and Worst Case α 6
Resistance Drift Data Cell Type Median 11 cell Worst 11 Case cell Median 10 cell Drift Time at Room temperature (secs) 10499 1015 1024 Worst Case 10 cell 5. 94 Median 01 cell 108 Worst Case 01 cell 1. 81 (11) (10) (01) (00) 7
Uncorrectable Errors 8
Problem Severity 9
Inadequacy of Device Level Solutions • Precise writes –> write – compare – write – Can be effective, requires multiple iterations – Increases write time, total write energy, decreases lifetime • Correcting values during read not effective • Coding techniques – Can help, but cannot eliminate the problem by themselves 10
Naïve Solution : Scrubbing • Drift resets with every cell reprogram (write) • Leverage existing error correction mechanisms, e. g. , ECC - has its own drawbacks • A Full Refresh/Scrub (read-compare-write) is extremely costly in PCM – Each PCM write takes 100 - 1000 ns – Writing to a 2 -bit cell may consume as much as 1. 6 n. J – Requires 600 refreshes in parallel Refresh should be reactionary NOT precautionary! 11
Solution Outline • A three pronged approach – Incorporate stronger ECC codes – Incorporate low-cost error detection mechanisms – Adapt scrub algorithms that are conservative and adaptive 12
Additional ECC support • With stronger ECC support, scrub operations can be made infrequent – If E errors can be corrected and E+1 detected, then scrub operation can be postponed till the Eth error – Can provide support for hard error tolerance as well • We advocate ECC-8 (E-8) – Correct eight errors, detect nine – Leads to 14. 25% storage overhead, for 512 bits 13
LARDD • Light Array Reads for Drift Detection Read Line Check for Errors < E False Scrub Line True After N cycles – Support for E Error-correcting, E+1 error detecting codes assumed – Lines are read periodically and checked for correctness – Only after the number of errors reaches a threshold, scrubbing is performed – Drift detection has to be simple, on-chip 14
Read Pipeline with LARDD & Parity • Simple error detection mechanism to bypass complex BCH circuit – 256 cell line divided up into eight 32 -cell fields – Count number of drift prone states, store as parity information – For every LARDD, check if parity has changed – If true, invoke BCH circuitry (11) (10) (01) (00) 15 15
Headroom • Headroom-h scheme – scrub is triggered if E-h errors are detected Check for Errors † Decreases probability of errors slipping through – Increases frequency of full True Errors < E-h After N scrub and hence cycles decreases life time False • Presents trade-off Scrub Line between Hard and Soft errors Read Line 16
Headroom Gradual Read Line Check for Errors <= E-h-g False Double LARDD rate True • Start with a small LARDD frequency After N cycles • Increase frequency as drift-based errors increase – Decreases energy overhead – Might cause slight increase in error rate 17
Adaptive Scrub • Dynamic events can affect reliability – Temperature increases can increase α and decrease drift time – Cell lifetime/wearout is also an issue – Soft error rate depends on prevalence of drift prone states – Hard errors show up with aging cells • Start with headroom h policy, get rid of headroom when at least E-h-1 hard errors • Can increase LARDD frequency when total number of hard errors increases preset threshold 18
Results Stronger ECC + LARDD support leads to decrease in unrecoverable errors 19
Results - Headroom Schemes • Compared to ECC-1, for 512 s LARDD interval – ECC-8 -headroom-3 leads to 99. 6% reduction in error rate, 35. 4% reduction in scrub energy – ECC-8 -headroom-3 -gradual-1 leads to 96. 5% reduction in error rate, 37. 8% reduction in scrub energy 20
Device Level Solution – Non Uniform Banding Before Mean R 0 11 10 01 00 After Resistance 21
Comparing with Device Level Solutions 2 E+05 1 E+05 8 E+04 6 E+04 4 E+04 2 E+04 0 E+00 σ) σ) EC C -8 -h ea dr oo EC C m -3 -8 (2 . 7 5 (2. 7 5 g di n ba n rm No nu + -1 EC C- 1 + Pr Ba ec is e se lin e W ni fo rit e EC C- 1 (2. (1. 3 75 75 σ) σ) Number of Uncorrectable Errors ECC-8 -headroom-3 provides for least number of uncorrectable errors 22
Conclusions • Resistance drift will exacerbate with MLC scaling • Naïve solutions based on ECC support are costly for PCM – Increased write energy, decreased lifetimes • Holistic solutions need to be explored to counter drift at device, architectural and system levels – 39% reduction in energy, 4 x less errors, 102 x increase in lifetime 23
Thanks!! www. cs. utah. edu/arch-research 24
Backup Slides 25
- Slides: 25