CROW A LowCost Substrate for Improving DRAM Performance
CROW A Low-Cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability Hasan Hassan Minesh Patel Jeremie S. Kim A. Giray Yaglikci Nika Mansouri Ghiasi Saugata Ghose Nandita Vijaykumar Onur Mutlu
Summary Source code available in July: github. com/CMU-SAFARI/CROW Challenges of DRAM scaling: • High access latency → bottleneck for improving system performance/energy • Refresh overhead → reduces performance and consume high energy • Exposure to vulnerabilities (e. g. , Row. Hammer) Copy-Row DRAM (CROW) • Introduces copy rows into a subarray • The benefits of a copy row: • Efficiently duplicating data from regular row to a copy row • Quick access to a duplicated row • Remapping a regular row to a copy row CROW is a flexible substrate with many use cases: • CROW-cache & CROW-ref (20% speedup and consumes 22% less DRAM energy) • Mitigating Row. Hammer • We hope CROW enables many other use cases going forward 2
Outline 1. DRAM Operation Basics 2. The CROW Substrate CROW-cache: Reducing DRAM Latency CROW-ref: Reducing DRAM Refresh Mitigating Row. Hammer 3. Evaluation 4. Conclusion 3
DRAM Organization DRAM Subarray DRAM Cell DRAM Row Memory Bus Memory Controller CPU Sense Amplifier 4
Accessing DRAM Subarray DRAM Cell Activate DRAM Row Precharge Read Sense Amplifier 5
Outline 1. DRAM Operation Basics 2. The CROW Substrate CROW-cache: Reducing DRAM Latency CROW-ref: Reducing DRAM Refresh Mitigating Row. Hammer 3. Evaluation 4. Conclusion 6
Challenges of DRAM Scaling 1 access latency 2 refresh overhead 3 exposure to vulnerabilities 7
Our Goal We want a substrate that enables the duplication and remapping of data within a subarray 8
The Components of CROW DRAM Subarray CROW-table Memory Controller 9
CROW Operation 1: Row Copy DRAM Subarray ACT-c (copy) Memory Controller 10
Row Copy: Steps source row: 1 Activation of the source row 2 Charge sharing destination row: 3 Beginning of restoration 4 Activation of the destination row 5 Sense Amplifier Restoration of both rows to source data 11
Row Copy: Steps source row: 1 Activation of the source row 2 Charge sharing destination row: 3 Beginning of restoration 4 Activation of the destination row Enables quickly copying Restoration a regular row of both rows 5 to source data into a copy row Sense Amplifier 12
CROW Operation 2: Two-Row Activation DRAM Subarray ACT-t (two row) Memory Controller 13
Two-Row Activation: Steps both charged or discharged 1 Activation of two rows 2 Charge sharing fast 3 Restoration Sense Amplifier 14
Two-Row Activation: Steps both charged or discharged 1 Activation of two rows 2 Charge sharing fast 3 Restoration Enables fast access to data that is duplicated Sense across a regular row and a copy row Amplifier 15
Outline 1. DRAM Operation Basics 2. The CROW Substrate CROW-cache: Reducing DRAM Latency CROW-ref: Reducing DRAM Refresh Mitigating Row. Hammer 3. Evaluation 4. Conclusion 16
CROW-cache Problem: High access latency Key idea: Use copy rows to enable low-latency access to most-recently-activated regular rows in a subarray CROW-cache combines: • row copy → copy a newly activated regular row into a copy row • two-row activation → activate the regular row and copy row together on the next access Reduces activation latency by 38% 17
CROW-cache Operation DRAM Subarray Request Queue load row X [bank conflict] load row X 1 CROW-table miss ACT-c ACT-t Memory Controller 2 Allocate a copy row CROW-table copy row 0 row X 3 Issue ACT-c (copy) 1 CROW-table hit 2 Issue ACT-t (two row) 18
CROW-cache Operation DRAM Subarray Request Queue load row X [bank conflict] load row X 1 CROW-table miss 2 Allocate a copy row ACT-t CROW-table 3 Issue ACT-c 1 CROW-table hit Second activation of row X is faster Memory Controller copy row 0 row X 2 Issue ACT-t 19
Outline 1. DRAM Operation Basics 2. The CROW Substrate CROW-cache: Reducing DRAM Latency CROW-ref: Reducing DRAM Refresh Mitigating Row. Hammer 3. Evaluation 4. Conclusion 20
CROW-ref Problem: Refresh has high overheads. Weak rows lead to high refresh rate • weak row: at least one of the row’s cells cannot retain data correctly when refresh rate is decreased Key idea: Safely reduce refresh rate by remapping a weak regular row to a strong copy row CROW-ref uses: • row copy → copy a weak regular row to a strong copy row CROW-ref eliminates more than half of the refresh requests 21
CROW-ref Operation strong weak strong Retention Time strong Profiler Perform retention time 1 profiling Remap weak rows to strong 2 copy rows 3 On ACT, check the CROW-table If remapped, activate a copy 4 row 22
CROW-ref Operation strong weak strong Retention Time strong Profiler Perform retention time 1 profiling Remap weak rows to strong 2 copy rows 3 On ACT, check the CROW-table If remapped, activate a copy 4 row How many weak rows exist in a DRAM chip? 23
Weak cells are rare [Liu+, ISCA’ 13] weak cell: retention < 256 ms ~1000/238 (32 Gi. B) failing cells DRAM Retention Time Profiler • REAPER [Patel+, ISCA’ 17] PARBOR [Khan+, DSN’ 16] AVATAR [Qureshi+, DSN’ 15] • At system boot or during runtime Probability Identifying Weak Rows Weak rows in a subarray 24
Weak cells are rare [Liu+, ISCA’ 13] weak cell: retention < 256 ms ~1000/238 (32 Gi. B) failing cells DRAM Retention Time Profiler Probability Identifying Weak Rows • REAPER [Patel+, ISCA’ 17] PARBOR [Khan+, DSN’ 16] AVATAR [Qureshi+, DSN’ 15] • At system boot or during runtime Weak rows in a subarray A few copy rows are sufficient to halve the refresh rate 25
Outline 1. DRAM Operation Basics 2. The CROW Substrate CROW-cache: Reducing DRAM Latency CROW-ref: Reducing DRAM Refresh Mitigating Row. Hammer 3. Evaluation 4. Conclusion 26
Mitigating Row. Hammer victim aggressor victim activate precharge Key idea: remap victim rows to copy rows 27
Outline 1. DRAM Operation Basics 2. The CROW Substrate CROW-cache: Reducing DRAM Latency CROW-ref: Reducing DRAM Refresh Mitigating Row. Hammer 3. Evaluation 4. Conclusion 28
Methodology • Simulator • DRAM Simulator (Ramulator [Kim+, CAL’ 15]) https: //github. com/CMU-SAFARI/ramulator Source code available in July: github. com/CMU-SAFARI/CROW • Workloads • 44 single-core workloads • SPEC CPU 2006, TPC, STREAM, Media. Bench • 160 multi-programmed four-core workloads • By randomly choosing from single-core workloads • Execute at least 200 million representative instructions per core • System Parameters • 1/4 core system with 8 Mi. B LLC • LPDDR 4 main memory • 8 copy rows per 512 -row subarray 29
CROW-cache Results 1. 08 7. 5% 7. 1% 1. 06 1. 04 1. 02 1. 00 single-core Normalized DRAM Energy Speedup 1. 10 four-core 1. 00 0. 98 0. 96 8. 2% 6. 9% 0. 94 0. 92 0. 90 single-core four-core * with 8 copy rows and a 64 Gb DRAM chip (sensitivity in paper) 30
CROW-cache Results 1. 08 1. 06 1. 04 1. 02 1. 00 7. 5% 7. 1% Normalized DRAM Energy Speedup 1. 10 1. 00 0. 98 0. 96 8. 2% 6. 9% 0. 94 0. 92 0. 90 single-core four-core CROW-cache improves single-/four-core performance and energy * with 8 copy rows a 64 Gb DRAM chip (sensitivity in paper) 31
1. 13 1. 12 1. 11 1. 09 1. 08 1. 07 1. 06 1. 05 1. 04 11. 9% 7. 1% single-core Normalized DRAM Energy Speedup CROW-ref Results four-core 1. 00 7. 8% 0. 95 0. 90 17. 2% 0. 85 0. 80 0. 75 0. 70 single-core four-core * with 8 copy rows and a 64 Gb DRAM chip (sensitivity in paper) 32
1. 13 1. 12 1. 11 1. 09 1. 08 1. 07 1. 06 1. 05 1. 04 11. 9% 7. 1% Normalized DRAM Energy Speedup CROW-ref Results 1. 00 7. 8% 0. 95 0. 90 17. 2% 0. 85 0. 80 0. 75 CROW-ref significantly reduces the performance 0. 70 single-core four-core and energy overhead of DRAM refresh * with 8 copy rows a 64 Gb DRAM chip (sensitivity in paper) 33
1. 30 1. 20 1. 10 1. 00 0. 90 0. 80 0. 70 CROW-(cache+ref) Ideal CROW-cache + no refresh 17% single-core 20% four-core Normalized DRAM Energy Speedup Combining CROW-cache and CROW-ref CROW-(cache+ref) Ideal CROW-cache + no refresh 0. 80 0. 78 23% 22% 0. 76 0. 74 0. 72 0. 70 single-core four-core 34
Combining CROW-cache and CROW-ref CROW-(cache+ref) Ideal CROW-cache + no refresh 0. 80 Normalized DRAM Energy Speedup 1. 30 20% 22% 17% 23% 1. 20 0. 78 1. 10 0. 76 1. 00 0. 74 0. 90 0. 72 0. 80 CROW-(cache+ref) provides more performance and 0. 70 four-core than each mechanism DRAM single-core energy benefits alone single-core four-core 35
Hardware Overhead For 8 copy rows and 16 Gi. B DRAM: • 0. 5% DRAM chip area • 1. 6% DRAM capacity • 11. 3 Ki. B memory controller storage CROW is a low-cost substrate 36
Other Results in the Paper • Performance and energy sensitivity to: • Number of copy-rows per subarray • DRAM chip density • Last-level cache capacity • CROW-cache with prefetching • CROW-cache compared to other in-DRAM caching mechanisms: • TL-DRAM [Lee+, HPCA’ 13] • SALP [Kim+, ISCA’ 12] 37
Outline 1. DRAM Operation Basics 2. The CROW Substrate CROW-cache: Reducing DRAM Latency CROW-ref: Reducing DRAM Refresh Mitigating Row. Hammer 3. Evaluation 4. Conclusion 38
Conclusion Source code available in July: github. com/CMU-SAFARI/CROW Challenges of DRAM scaling: • High access latency → bottleneck for improving system performance/energy • Refresh overhead → reduces performance and consume high energy • Exposure to vulnerabilities (e. g. , Row. Hammer) Copy-Row DRAM (CROW) • Introduces copy rows into a subarray • The benefits of a copy row: • Efficiently duplicating data from regular row to a copy row • Quick access to a duplicated row • Remapping a regular row to a copy row CROW is a flexible substrate with many use cases: • CROW-cache & CROW-ref (20% speedup and consumes 22% less DRAM energy) • Mitigating Row. Hammer • We hope CROW enables many other use cases going forward 39
CROW A Low-Cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability Hasan Hassan Minesh Patel Jeremie S. Kim A. Giray Yaglikci Nika Mansouri Ghiasi Saugata Ghose Nandita Vijaykumar Onur Mutlu
Backup Slides
Latency Reduction with MRA 42
Mitigating Row. Hammer victim aggressor victim activate precharge Key idea: remap victim rows to copy rows 43
CROW-cache Performance 1. 20 1. 15 1. 10 1. 05 1. 00 0. 95 0. 90 single-core HHHH AVERAGE (1 core) h 264 -dec libq stream-cp mcf lbm zeus tpch 2 6. 6% 7. 5% 7. 1% 0. 7% leslie 3 d Speedup CROW-1 CROW-8 CROW-64 CROW-128 Ideal CROW-cache (100% Hit Rate) four-core 44
CROW-ref Performance 16 Gbit 1. 20 1. 15 1. 10 1. 05 1. 00 0. 95 0. 90 32 Gbit 64 Gbit 11. 9% HH HH re m am _c p ca ct us tp ch 1 le 7 sli e 3 jp d 2 en c lib q AV zeu ER s AG E lb st ilc m cf 7. 1% m Speedup 8 Gbit single-core four-core 45
CROW-ref Energy Savings 16 Gbit 1. 00 0. 95 0. 90 0. 85 0. 80 0. 75 0. 70 32 Gbit 64 Gbit HH HH re m am _c ca p ct u tp s ch le 17 sli e 3 jp d 2 en c lib q AV zeu ER s AG E 7. 8% st lb ilc m cf 17. 2% m Normalized DRAM Energy 8 Gbit single-core four-core 46
Speedup - CROW-cache Single-core 47
Speedup - CROW-cache Four-core 48
Energy – CROW-cache 49
Comparison to TL-DRAM and SALP 50
Slide on RLTL 51
Speedup – CROW-ref 52
Energy – CROW-ref 53
CROW-cache + ref 54
CROW-table Organization 55
t. RCD vs t. RAS 56
MRA Area Overhead 57
DRAM Charge over Time Ready to Precharge Ready to Access Cell Ready to Access Charge Level Cell Sense Amplifier charge Data 1 Sense-Amplifier Data 0 Sensing t. RCD ACT Restore R/W Precharge time PRE t. RAS 58
- Slides: 58