TieredLatency DRAM A Low Latency and A Low

Executive Summary • Problem: DRAM latency is a critical performance bottleneck • Our Goal:

Outline • • Motivation & Key Idea Tiered-Latency DRAM Leveraging Tiered-Latency DRAM Evaluation Results

Historical DRAM Trend Latency (t. RC) Capacity (Gb) 2. 5 16 X 2. 0

What Causes the Long Latency? I/O subarray cell array Subarray DRAM Chip channel DRAM

Why is the Subarray So Slow? access transistor bitline wordline capacitor Row decoder Sense

Trade-Off: Area (Die Size) vs. Latency Long Bitline Short Bitline Faster Smaller Trade-Off: Area

Normalized DRAM Area Cheaper Trade-Off: Area (Die Size) vs. Latency 4 32 3 Fancy

Approximating the Best of Both Worlds Long Bitline Our Proposal Short Bitline Small Area

Approximating the Best of Both Worlds DRAMShort Long Our Proposal Long Bitline. Tiered-Latency Short

Tiered-Latency DRAM • Divide a bitline into two segments with an isolation transistor Far

Near Segment Access • Turn off the isolation transistor Reduced bitline length Reduced bitline

Far Segment Access • Turn on the isolation transistor Long bitline length Large bitline

Latency, Power, and Area Evaluation • Commodity DRAM: 512 cells/bitline • TL-DRAM: 512 cells/bitline

Commodity DRAM vs. TL-DRAM • DRAM Latency (t. RC) • DRAM Power 100% +23%

Latency vs. Near Segment Length Latency (ns) 80 Near Segment Far Segment 60 40

Normalized DRAM Area Cheaper Trade-Off: Area (Die-Area) vs. Latency 4 32 3 64 2

Leveraging Tiered-Latency DRAM • TL-DRAM is a substrate that can be leveraged by the

Near Segment as Hardware-Managed Cache TL-DRAM subarray main far segment memory near segment cache

Inter-Segment Migration • Goal: Migrate source row into destination row • Naïve way: Memory

Inter-Segment Migration • Our way: – Source and destination cells share bitlines – Transfer

Three Caching Mechanisms 1. SC (Simple Caching) – Classic LRU cache – Benefit: Reduced

Evaluation Methodology • System simulator – CPU: Instruction-trace-based x 86 simulator – Memory: Cycle-accurate

Configurations • System configuration – CPU: 5. 3 GHz – LLC: 512 k. B

Single-Core: Performance & Power IPC Improvement 15% 12% 9% 6% 3% 0% WMC SC

Single-Core: Varying Near Segment Length IPC Improvement 15% SC 12% WMC Maximum IPC Improvement

Dual-Core Evaluation • We categorize single-core benchmarks into two categories 1. Sens: benchmarks whose

Performance Improv. Dual-Core: Sens/Sens 20% SC 15% WMC BBC 10% 5% 0% 16 32

Performance Improv. Dual-Core: Sens/Insens & Insens/Insens 20% 15% SC WMC BBC 10% 5% 0%

Other Mechanisms & Results in Paper • More mechanisms for leveraging TL-DRAM – Hardware-managed

Conclusion • Problem: DRAM latency is a critical performance bottleneck • Our Goal: Reduce

Storage Cost in Memory Controller • Organization – Bitline Length: 512 cells/bitline – Near

Hardware-managed Exclusive Cache • Near and Far segment: Main memory • Caching: Swapping near

Three-Tier Analysis • Three tiers Latency – Add two isolation transistors – Near/Mid/Far segment

Slides: 44

Download presentation

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu

Executive Summary • Problem: DRAM latency is a critical performance bottleneck • Our Goal: Reduce DRAM latency with low area cost • Observation: Long bitlines in DRAM are the dominant source of DRAM latency • Key Idea: Divide long bitlines into two shorter segments – Fast and slow segments • Tiered-latency DRAM: Enables latency heterogeneity in DRAM – Can leverage this in many ways to improve performance and reduce power consumption • Results: When the fast segment is used as a cache to the slow segment Significant performance improvement (>12%) and power reduction (>23%) at low area cost (3%) 2

Outline • • Motivation & Key Idea Tiered-Latency DRAM Leveraging Tiered-Latency DRAM Evaluation Results 3

Historical DRAM Trend Latency (t. RC) Capacity (Gb) 2. 5 16 X 2. 0 1. 5 100 80 60 1. 0 -20% 40 0. 5 20 0. 0 0 2003 2006 2008 Latency (ns) Capacity 2011 Year DRAM latency continues to be a critical bottleneck 4

What Causes the Long Latency? I/O subarray cell array Subarray DRAM Chip channel DRAM Latency = Subarray Latency ++ I/O Latency Dominant 5

Why is the Subarray So Slow? access transistor bitline wordline capacitor Row decoder Sense amplifier Cell cell Bitline: 512 cells Subarray extremely large sense amplifier (≈100 X the cell size) Long Bitline: Amortize sense amplifier → Small area Long Bitline: Large bitline cap. → High latency 6

Trade-Off: Area (Die Size) vs. Latency Long Bitline Short Bitline Faster Smaller Trade-Off: Area vs. Latency 7

Normalized DRAM Area Cheaper Trade-Off: Area (Die Size) vs. Latency 4 32 3 Fancy DRAM Short Bitline 64 2 128 1 L A O 256 G 0 0 10 20 30 Commodity DRAM Long Bitline 512 cells/bitline 40 Latency (ns) 50 60 70 Faster 8

Approximating the Best of Both Worlds Long Bitline Our Proposal Short Bitline Small Area Large Area High Latency Low Latency Need Isolation Add Isolation Transistors Short Bitline Fast 9

Approximating the Best of Both Worlds DRAMShort Long Our Proposal Long Bitline. Tiered-Latency Short Bitline Large Area Small Area High Latency Low Latency Small area using long bitline Low Latency 10

Outline • • Motivation & Key Idea Tiered-Latency DRAM Leveraging Tiered-Latency DRAM Evaluation Results 11

Tiered-Latency DRAM • Divide a bitline into two segments with an isolation transistor Far Segment Isolation Transistor Near Segment Sense Amplifier 12

Near Segment Access • Turn off the isolation transistor Reduced bitline length Reduced bitline capacitance Farpower Segment Low latency & low Isolation Transistor (off) Near Segment Sense Amplifier 13

Far Segment Access • Turn on the isolation transistor Long bitline length Large bitline capacitance Additional resistance of isolation transistor Far Segment High latency & high power Isolation Transistor (on) Near Segment Sense Amplifier 14

Latency, Power, and Area Evaluation • Commodity DRAM: 512 cells/bitline • TL-DRAM: 512 cells/bitline – Near segment: 32 cells – Far segment: 480 cells • Latency Evaluation – SPICE simulation using circuit-level DRAM model • Power and Area Evaluation – DRAM area/power simulator from Rambus – DDR 3 energy calculator from Micron 15

Commodity DRAM vs. TL-DRAM • DRAM Latency (t. RC) • DRAM Power 100% +23% (52. 5 ns) 50% 0% Commodity DRAM – 56% Near +49% 150% Power Latency 150% Far TL-DRAM 100% 50% 0% Commodity DRAM – 51% Near Far TL-DRAM • DRAM Area Overhead ~3%: mainly due to the isolation transistors 16

Latency vs. Near Segment Length Latency (ns) 80 Near Segment Far Segment 60 40 20 0 1 2 4 8 16 32 64 128 256 512 Near Segment Length (Cells) Longer near segment length leads to higher near segment latency Ref. 17

Latency vs. Near Segment Length Latency (ns) 80 Near Segment Far Segment 60 40 20 0 1 2 4 8 16 32 64 128 256 512 Near Segment Length (Cells) Ref. Far Segment Length = 512 – Near Segment Length Far segment latency is higher than commodity DRAM latency 18

Normalized DRAM Area Cheaper Trade-Off: Area (Die-Area) vs. Latency 4 32 3 64 2 1 L A O G 0 0 10 128 256 512 cells/bitline Near Segment 20 30 40 Latency (ns) Far Segment 50 60 70 Faster 19

Outline • • Motivation & Key Idea Tiered-Latency DRAM Leveraging Tiered-Latency DRAM Evaluation Results 20

Leveraging Tiered-Latency DRAM • TL-DRAM is a substrate that can be leveraged by the hardware and/or software • Many potential uses 1. Use near segment as hardware-managed inclusive cache to far segment 2. Use near segment as hardware-managed exclusive cache to far segment 3. Profile-based page mapping by operating system 4. Simply replace DRAM with TL-DRAM 21

Near Segment as Hardware-Managed Cache TL-DRAM subarray main far segment memory near segment cache sense amplifier I/O channel • Challenge 1: How to efficiently migrate a row between segments? • Challenge 2: How to efficiently manage the cache? 22

Inter-Segment Migration • Goal: Migrate source row into destination row • Naïve way: Memory controller reads the source row byte by byte and writes to destination row byte by byte → High latency Source Far Segment Destination Isolation Transistor Near Segment Sense Amplifier 23

Inter-Segment Migration • Our way: – Source and destination cells share bitlines – Transfer data from source to destination across shared bitlines concurrently Source Far Segment Destination Isolation Transistor Near Segment Sense Amplifier 24

Inter-Segment Migration • Our way: – Source and destination cells share bitlines – Transfer data from source to destination across Step 1: Activate source row shared bitlines concurrently Migration is overlapped with source row access Additional ~4 ns over row access latency Far Segment Step 2: Activate destination row to connect cell and bitline Isolation Transistor Near Segment Sense Amplifier 25

Three Caching Mechanisms 1. SC (Simple Caching) – Classic LRU cache – Benefit: Reduced reuse latency there(Wait-Minimized another benefit of caching? 2. Is. WMC Caching) – Identify and only Req. forcache Req. forwait-inducing rows Baseline Row 1 Row – Benefit: Reduced wait 2 Row 1 Row 2 3. BBC (Benefit-Based Caching) Time – BBC ≈Wait-inducing SC + WMC row Wait until finishing Req 1 Req. for – Benefit: Reduced. Req. reuse latency & reduced wait Row 2 Caching Row 1 Time Row 2 Row 1 Cached row 27 Reduced wait

Outline • • Motivation & Key Idea Tiered-Latency DRAM Leveraging Tiered-Latency DRAM Evaluation Results 28

Evaluation Methodology • System simulator – CPU: Instruction-trace-based x 86 simulator – Memory: Cycle-accurate DDR 3 DRAM simulator • Workloads – 32 Benchmarks from TPC, STREAM, SPEC CPU 2006 • Metrics – Single-core: Instructions-Per-Cycle – Multi-core: Weighted speedup 29

Configurations • System configuration – CPU: 5. 3 GHz – LLC: 512 k. B private per core – Memory: DDR 3 -1066 • 1 -2 channel, 1 rank/channel • 8 banks, 32 subarrays/bank, 512 cells/bitline • Row-interleaved mapping & closed-row policy • TL-DRAM configuration – Total bitline length: 512 cells/bitline – Near segment length: 1 -256 cells 30

Single-Core: Performance & Power IPC Improvement 15% 12% 9% 6% 3% 0% WMC SC BBC 12. 7% 100% Normalized Power SC 95% WMC BBC – 23% 90% 85% 80% 75% Using near segment as a cache improves performance and reduces power consumption 31

Single-Core: Varying Near Segment Length IPC Improvement 15% SC 12% WMC Maximum IPC Improvement BBC 9% Larger cache capacity 6% 3% Higher caching latency 0% 1 2 4 8 16 32 64 128 256 Near Segment Length (cells) By adjusting the near segment length, we can trade off cache capacity for cache latency 32

Dual-Core Evaluation • We categorize single-core benchmarks into two categories 1. Sens: benchmarks whose performance is sensitive to near segment capacity 2. Insens: benchmarks whose performance is insensitive to near segment capacity • Dual-core workload categorization 1. Sens/Sens 2. Sens/Insens 3. Insens/Insens 33

Performance Improv. Dual-Core: Sens/Sens 20% SC 15% WMC BBC 10% 5% 0% 16 32 64 Near segment length (cells) 128 Larger near segment capacity leads to higher performance improvement in sensitive workloads BBC/WMC show more perf. improvement 34

Performance Improv. Dual-Core: Sens/Insens & Insens/Insens 20% 15% SC WMC BBC 10% 5% 0% 16 32 64 Near segment length 128 Using near segment as a cache provides high performance improvement regardless of near segment capacity 35

Other Mechanisms & Results in Paper • More mechanisms for leveraging TL-DRAM – Hardware-managed exclusive caching mechanism – Profile-based page mapping to near segment – TL-DRAM improves performance and reduces power consumption with other mechanisms • More than two tiers – Latency evaluation for three-tier TL-DRAM • Detailed circuit evaluation for DRAM latency and power consumption – Examination of t. RC and t. RCD • Implementation details and storage cost analysis memory controller in 36

Conclusion • Problem: DRAM latency is a critical performance bottleneck • Our Goal: Reduce DRAM latency with low area cost • Observation: Long bitlines in DRAM are the dominant source of DRAM latency • Key Idea: Divide long bitlines into two shorter segments – Fast and slow segments • Tiered-latency DRAM: Enables latency heterogeneity in DRAM – Can leverage this in many ways to improve performance and reduce power consumption • Results: When the fast segment is used as a cache to the slow segment Significant performance improvement (>12%) and power reduction (>23%) at low area cost (3%) 37

Thank You 38

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu

Backup Slides 40

Storage Cost in Memory Controller • Organization – Bitline Length: 512 cells/bitline – Near Segment Length: 32 cells – Far Segment Length: 480 cells – Inclusive Caching • Simple caching and wait-minimized caching – Tag Storage: 9 KB – Replace Information: 5 KB • Benefit-based caching – Tag storage: 9 KB – Replace Information: 8 KB (8 bit benefit field/near segment row) 41

Hardware-managed Exclusive Cache • Near and Far segment: Main memory • Caching: Swapping near and far segment row Perf. Improvement & Power Reduction – Need one dummy row to swap 15% 10% 5% 0% Performance Improvement Power Reduction 11. 4% 9. 4% 8. 9% 7. 2% 9. 9% 1 (1 -ch) 4 (4 -ch) 2 (2 -ch) 14. 3% Core-count (# of memory channels) Performance improvement is lower than Inclusive 42 caching due to high swapping latency

Profile-Based Page Mapping Perf. Improvement & Power Reduction • Operating system profiles applications and maps frequently accessed rows to the near segment 30% 25% 20% 15% 10% 5% 0% Performance Improvement Power Reduction 24. 8% 19% 8. 9% 1 (1 -ch) 11. 6% 2 (2 -ch) 21. 5% 7. 2% 4 (4 -ch) Core-count (# of memory channels) Allocating frequently accessed rows in the near segment provides performance improvement 43

Three-Tier Analysis • Three tiers Latency – Add two isolation transistors – Near/Mid/Far segment length: 32/224/256 Cells 180% 57% 150% 120% 90% 60% 30% 0% – 56% – 23% Near Mid Far Commodity Three-Tier TL-DRAM More tiers enable finer-grained caching and partitioning mechanisms 44