Arch Shield Architectural Framework for Assisting DRAM Scaling

  • Slides: 32
Download presentation
Arch. Shield: Architectural Framework for Assisting DRAM Scaling By Tolerating High Error-Rates Prashant Nair

Arch. Shield: Architectural Framework for Assisting DRAM Scaling By Tolerating High Error-Rates Prashant Nair Dae-Hyun Kim Moinuddin K. Qureshi 1

Introduction • DRAM: Basic building block for main memory for four decades • DRAM

Introduction • DRAM: Basic building block for main memory for four decades • DRAM scaling provides higher memory capacity. Moving to smaller node provides ~2 x capacity Scaling (Feasibility) • Shrinking DRAM cells becoming difficult threat to scaling Efficient Error Mitigation Technology node (smaller ) Efficient error handling can help DRAM technology scale 2

Why DRAM Scaling is Difficult? • Scaling is difficult. More so for DRAM cells.

Why DRAM Scaling is Difficult? • Scaling is difficult. More so for DRAM cells. • Volume of capacitance must remain constant (25 f. F) • Scaling: 0. 7 x dimension, 0. 5 x area 2 x height With scaling, DRAM cells not only become narrower but longer 3

DRAM: Aspect Ratio Trend Narrow cylindrical cells are mechanically unstable breaks 4

DRAM: Aspect Ratio Trend Narrow cylindrical cells are mechanically unstable breaks 4

More Reasons for DRAM Faults • Unreliability of ultra-thin dielectric material • In addition,

More Reasons for DRAM Faults • Unreliability of ultra-thin dielectric material • In addition, DRAM cell failures also from: – Permanently leaky cells – Mechanically unstable cells – Broken links in the DRAM array Q DRAM Cell Capacitor Charge Leaks Permanently Leaky Cell DRAM Cells DRAM Cell Capacitor (tilting towards ground) Mechanically Unstable Cell Broken Links Permanent faults for future DRAMs expected to be much higher (we target an error rate as high as 100 ppm) 5

Outline Ø Introduction Ø Current Schemes Ø Arch. Shield Ø Evaluation Ø Summary 6

Outline Ø Introduction Ø Current Schemes Ø Arch. Shield Ø Evaluation Ø Summary 6

Row and Column Sparing • DRAM chip (organized into rows and columns) have spares

Row and Column Sparing • DRAM chip (organized into rows and columns) have spares Spare Rows DRAM Chip: Before Row/Column Sparing Replaced Columns Spare Columns Faults Deactivated Rows and Columns Replaced Rows DRAM Chip: After Row/Column Sparing • Laser fuses enable spare rows/columns • Entire row/column needs to be sacrificed for a few faulty cells Row and Column Sparing Schemes have large area overheads 7

Commodity ECC-DIMM • Commodity ECC DIMM with SECDED at 8 bytes (72, 64) •

Commodity ECC-DIMM • Commodity ECC DIMM with SECDED at 8 bytes (72, 64) • Mainly used for soft-error protection • For hard errors, high chance of two errors in same word (birthday paradox) For 8 GB DIMM 1 billion words Expected errors till double-error word = 1. 25*Sqrt(N) = 40 K errors 0. 5 ppm SECDED not enough for high error-rate (+ lost soft-error protection) 8

Strong ECC Codes • Strong ECC (BCH) codes are robust, but complex and costly

Strong ECC Codes • Strong ECC (BCH) codes are robust, but complex and costly Memory Controller Memory Requests Strong ECC Encoder + Decoder Memory Requests DRAM Memory System • Each memory reference incurs encoding/decoding latency • For BER of 100 ppm, we need ECC-4 50% storage overhead Strong ECC codes provide an inefficient solution for tolerating errors 9

Dissecting Fault Probabilities At Bit Error Rate of 10 -4 (100 ppm) for an

Dissecting Fault Probabilities At Bit Error Rate of 10 -4 (100 ppm) for an 8 GB DIMM (1 billion words) Faulty Bits per word (8 B) Probability Num words in 8 GB 0 99. 3% 0. 99 Billion 1 0. 007 7. 7 Million 2 26 x 10 -6 28 K 3 62 x 10 -9 67 4 10 -10 0. 1 Most faulty words have 1 -bit error The skew in fault probability can be leveraged to develop low cost error resilience Goal: Tolerate high error rates with commodity ECC DIMM while retaining soft-error resilience 10

Outline Ø Introduction Ø Current Schemes Ø Arch. Shield Ø Evaluation Ø Summary 11

Outline Ø Introduction Ø Current Schemes Ø Arch. Shield Ø Evaluation Ø Summary 11

Arch. Shield: Overview Inspired from Solid State Drives (SSD) to tolerate high bit-error rate

Arch. Shield: Overview Inspired from Solid State Drives (SSD) to tolerate high bit-error rate Expose faulty cell information to Architecture layer via runtime testing Replication Area Fault Map Main Memory Arch. Shield Most words will be error-free Fault Map (cached) 1 -bit error handled with SECDED Multi-bit error handled with replication Arch. Shield stores the error mitigation information in memory 12

Arch. Shield: Runtime Testing When DIMM is configured, runtime testing is performed. Each 8

Arch. Shield: Runtime Testing When DIMM is configured, runtime testing is performed. Each 8 B word gets classified into one of three types: No Error 1 -bit Error Multi-bit Error (Replication not needed) SECDED can correct hard error Word gets decommissioned SECDED can correct soft error Need replication for soft error Only the replica is used (Information about faulty cells can be stored in hard drive for future use) Runtime testing identifies the faulty cells to decide correction 13

Architecting the Fault Map • Fault Map (FM) stores information about faulty cells •

Architecting the Fault Map • Fault Map (FM) stores information about faulty cells • Per word FM is expensive (for 8 B, 2 -bits or 4 -bits with redundancy) Keep FM entry per line (4 -bit per 64 B) • FM access method – Table lookup with Lineaddr • Avoid dual memory access via – Caching FM entries in on-chip LLC – Each 64 -byte line has 128 FM entries – Exploits spatial locality Faulty Words Main Memory Fault Map Replicated Words Replication Area Fault Map – Organized at Line Granularity and is also cachable 14 Line-Level Fault Map + Caching provides low storage and low latency

Architecting Replication Area • Faulty cells replicated at word-granularity in Replication Area • Fully

Architecting Replication Area • Faulty cells replicated at word-granularity in Replication Area • Fully associative Replication Area? Prohibitive latency • Set associative Replication Area? Set overflow problem Main Memory Set 1 Faulty Words Set 2 Fault Map Replication Area Set 1 Set 2 Replicated Words Fully Associative Structure Set Associative Structure Chances of Set Overflowing! 15

Overflow of Set-Associative RA There are 10 s/100 s of thousand of sets Any

Overflow of Set-Associative RA There are 10 s/100 s of thousand of sets Any set could overflow How many entries used before one set overflows? Buckets-and-Balls 6 -way table only 8% full when one set overflows Need 12 x entries

Scalable Structure for RA Replication Area Entry OFB 1 PTR 16 -Sets OFB 1

Scalable Structure for RA Replication Area Entry OFB 1 PTR 16 -Sets OFB 1 PTR TAKEN BY SOME OTHER SET 16 -overflow sets With Overflow Sets, Replication Area can handle non uniformity

Arch. Shield: Operation Check R-Bit 1. R-Bit Set, write to 2 locations Read Transaction

Arch. Shield: Operation Check R-Bit 1. R-Bit Set, write to 2 locations Read Transaction Last Level Cache Miss! 2. Else 1 location Write Request R-bit Read Request Last Level Cache 1. 1. Query Fault Map Entry 2. 2. Fault Map Hit: Miss No Faulty words Faulty Words Fault Map Entry Replicated Word OS-usable Mem(7. 7 GB) Fault Map Line Fetch Read Transaction Replication Area Set R-Bit Fault Map (64 MB) Replication Area (256 MB) Main Memory 18

Outline Ø Introduction Ø Current Schemes Ø Arch. Shield Ø Evaluation Ø Summary 19

Outline Ø Introduction Ø Current Schemes Ø Arch. Shield Ø Evaluation Ø Summary 19

Experimental Evaluation Configuration: 8 -core CMP with 8 MB LLC (shared) 8 GB DIMM,

Experimental Evaluation Configuration: 8 -core CMP with 8 MB LLC (shared) 8 GB DIMM, two channels DDR 3 -1600 Workloads: SPEC CPU 2006 suite in rate mode Assumptions: Bit error rate of 100 ppm (random faults) Performance Metric: Execution time norm to fault free baseline 20

Execution Time Two sources of slowdown: Fault Map access and Replication Area access Arch.

Execution Time Two sources of slowdown: Fault Map access and Replication Area access Arch. Shield (No FM Traffic) 1. 06 Arch. Shield (No Replication Area) 1. 04 Arch. Shield 1. 02 1. 00 0. 98 High MPKI n ea Gm m rlb er en h 2 ch 64 re f as ta go r b gr mk om ac s sje ng na m d to n ca to lcu ga lix m es po s vr ay hm pe m ex lib m qu ilc an om tum ne tp bw p av es g sp cc Ge hin m x 3 s. F DT le D sli e 3 d ca ct wrf us AD ze M us m p bz ip 2 d xa ea la l. II nc bm k pl so lb cf 0. 96 m Normalized Execution Time 1. 08 Low MPKI On average, Arch. Shield causes 1% slowdown 21

High MPKI h pe mm rlb er e h 2 nch 64 re as

High MPKI h pe mm rlb er e h 2 nch 64 re as f go tar b gr mk om ac sje s n na g m to d n ca to lcu ga lix m e po ss vr ay Av er ag e m cf lb so m pl ex lib m qu i an lc om tum ne bw tpp av es s gc Ge phin c m x 3 s. F D le TD sli e 3 d ca ct w us rf A ze DM us m bz p ip 2 d xa e la al. I nc I bm k Hit-Rate Fault Map Hit-Rate 1. 0 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0. 0 Low MPKI Hit rate of Fault Map in LLC is high, on average 95% 22

Analysis of Memory Operations Transaction 1 Access(%) 2 Access (%) 3 Access (%) Reads

Analysis of Memory Operations Transaction 1 Access(%) 2 Access (%) 3 Access (%) Reads 72. 1 0. 02 ~0 Writes 22. 1 1. 2 0. 05 Fault Map 4. 5 N/A Overall 98. 75 1. 2 0. 05 1. Only 1. 2% of the total accesses use the Replication Area 2. Fault Map Traffic accounts for <5 % of all traffic 23

Comparison With Other Schemes ECC-4 Arch. Shield 1. 30 1. 25 1. 20 1.

Comparison With Other Schemes ECC-4 Arch. Shield 1. 30 1. 25 1. 20 1. 15 1. 10 1. 05 FREE-p 1. 00 ECC-4 0. 95 Arch. Shield lib High MPKI n ea Gm m rlb er en h 2 ch 64 re f as ta go r b gr mk om ac s sje ng na m d to nt ca o lcu ga lix m e po ss vr ay hm pe ex m qu il an c om tum ne t bw pp av es gc sp c Ge hin m x 3 s. F D le TD sli e 3 d ca w ct us rf AD ze M us m p bz ip 2 xa dea la l. II nc bm k m pl so lb cf 0. 90 m Normalized Execultion Time FREE-p Low MPKI 1. The read-before-write of Free-p + strong ECC high latency 2. ECC-4 incurs decoding delay 3. The impact of execution time is minimum in Arch. Shield 24

Outline Ø Introduction Ø Current Schemes Ø Arch. Shield Ø Evaluation Ø Summary 25

Outline Ø Introduction Ø Current Schemes Ø Arch. Shield Ø Evaluation Ø Summary 25

Summary • DRAM scaling challenge: High fault rate, current schemes limited • We propose

Summary • DRAM scaling challenge: High fault rate, current schemes limited • We propose to expose DRAM errors to Architecture Arch. Shield • Arch. Shiled uses efficient Fault Map and Selective Word Replication • Arch. Shield handles a Bit Error Rate of 100 ppm with less than 4% storage overhead and 1% slowdown • Arch. Shield can be used to reduce DRAM refresh by 16 x (to 1 second) 26

Questions 27

Questions 27

Monte Carlo Simulation Probability that a structure is unable to handle given number of

Monte Carlo Simulation Probability that a structure is unable to handle given number of errors (in million). We recommend the structure with 16 overflow sets to tolerate 7. 74 million errors in DIMM. 28

Arch. Shield Flow. Chart Entries in bold are frequently accessed 29

Arch. Shield Flow. Chart Entries in bold are frequently accessed 29

Memory System with Arch. Shield 30

Memory System with Arch. Shield 30

Memory Footprint 31

Memory Footprint 31

Arch. Shield compared to RAIDR 32

Arch. Shield compared to RAIDR 32