TieredLatency DRAM A Low Latency and A Low
- Slides: 44
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu
Executive Summary • Problem: DRAM latency is a critical performance bottleneck • Our Goal: Reduce DRAM latency with low area cost • Observation: Long bitlines in DRAM are the dominant source of DRAM latency • Key Idea: Divide long bitlines into two shorter segments – Fast and slow segments • Tiered-latency DRAM: Enables latency heterogeneity in DRAM – Can leverage this in many ways to improve performance and reduce power consumption • Results: When the fast segment is used as a cache to the slow segment Significant performance improvement (>12%) and power reduction (>23%) at low area cost (3%) 2
Outline • • Motivation & Key Idea Tiered-Latency DRAM Leveraging Tiered-Latency DRAM Evaluation Results 3
Historical DRAM Trend Latency (t. RC) Capacity (Gb) 2. 5 16 X 2. 0 1. 5 100 80 60 1. 0 -20% 40 0. 5 20 0. 0 0 2003 2006 2008 Latency (ns) Capacity 2011 Year DRAM latency continues to be a critical bottleneck 4
What Causes the Long Latency? I/O subarray cell array Subarray DRAM Chip channel DRAM Latency = Subarray Latency ++ I/O Latency Dominant 5
Why is the Subarray So Slow? access transistor bitline wordline capacitor Row decoder Sense amplifier Cell cell Bitline: 512 cells Subarray extremely large sense amplifier (≈100 X the cell size) Long Bitline: Amortize sense amplifier → Small area Long Bitline: Large bitline cap. → High latency 6
Trade-Off: Area (Die Size) vs. Latency Long Bitline Short Bitline Faster Smaller Trade-Off: Area vs. Latency 7
Normalized DRAM Area Cheaper Trade-Off: Area (Die Size) vs. Latency 4 32 3 Fancy DRAM Short Bitline 64 2 128 1 L A O 256 G 0 0 10 20 30 Commodity DRAM Long Bitline 512 cells/bitline 40 Latency (ns) 50 60 70 Faster 8
Approximating the Best of Both Worlds Long Bitline Our Proposal Short Bitline Small Area Large Area High Latency Low Latency Need Isolation Add Isolation Transistors Short Bitline Fast 9
Approximating the Best of Both Worlds DRAMShort Long Our Proposal Long Bitline. Tiered-Latency Short Bitline Large Area Small Area High Latency Low Latency Small area using long bitline Low Latency 10
Outline • • Motivation & Key Idea Tiered-Latency DRAM Leveraging Tiered-Latency DRAM Evaluation Results 11
Tiered-Latency DRAM • Divide a bitline into two segments with an isolation transistor Far Segment Isolation Transistor Near Segment Sense Amplifier 12
Near Segment Access • Turn off the isolation transistor Reduced bitline length Reduced bitline capacitance Farpower Segment Low latency & low Isolation Transistor (off) Near Segment Sense Amplifier 13
Far Segment Access • Turn on the isolation transistor Long bitline length Large bitline capacitance Additional resistance of isolation transistor Far Segment High latency & high power Isolation Transistor (on) Near Segment Sense Amplifier 14
Latency, Power, and Area Evaluation • Commodity DRAM: 512 cells/bitline • TL-DRAM: 512 cells/bitline – Near segment: 32 cells – Far segment: 480 cells • Latency Evaluation – SPICE simulation using circuit-level DRAM model • Power and Area Evaluation – DRAM area/power simulator from Rambus – DDR 3 energy calculator from Micron 15
Commodity DRAM vs. TL-DRAM • DRAM Latency (t. RC) • DRAM Power 100% +23% (52. 5 ns) 50% 0% Commodity DRAM – 56% Near +49% 150% Power Latency 150% Far TL-DRAM 100% 50% 0% Commodity DRAM – 51% Near Far TL-DRAM • DRAM Area Overhead ~3%: mainly due to the isolation transistors 16
Latency vs. Near Segment Length Latency (ns) 80 Near Segment Far Segment 60 40 20 0 1 2 4 8 16 32 64 128 256 512 Near Segment Length (Cells) Longer near segment length leads to higher near segment latency Ref. 17
Latency vs. Near Segment Length Latency (ns) 80 Near Segment Far Segment 60 40 20 0 1 2 4 8 16 32 64 128 256 512 Near Segment Length (Cells) Ref. Far Segment Length = 512 – Near Segment Length Far segment latency is higher than commodity DRAM latency 18
Normalized DRAM Area Cheaper Trade-Off: Area (Die-Area) vs. Latency 4 32 3 64 2 1 L A O G 0 0 10 128 256 512 cells/bitline Near Segment 20 30 40 Latency (ns) Far Segment 50 60 70 Faster 19
Outline • • Motivation & Key Idea Tiered-Latency DRAM Leveraging Tiered-Latency DRAM Evaluation Results 20
Leveraging Tiered-Latency DRAM • TL-DRAM is a substrate that can be leveraged by the hardware and/or software • Many potential uses 1. Use near segment as hardware-managed inclusive cache to far segment 2. Use near segment as hardware-managed exclusive cache to far segment 3. Profile-based page mapping by operating system 4. Simply replace DRAM with TL-DRAM 21
Near Segment as Hardware-Managed Cache TL-DRAM subarray main far segment memory near segment cache sense amplifier I/O channel • Challenge 1: How to efficiently migrate a row between segments? • Challenge 2: How to efficiently manage the cache? 22
Inter-Segment Migration • Goal: Migrate source row into destination row • Naïve way: Memory controller reads the source row byte by byte and writes to destination row byte by byte → High latency Source Far Segment Destination Isolation Transistor Near Segment Sense Amplifier 23
Inter-Segment Migration • Our way: – Source and destination cells share bitlines – Transfer data from source to destination across shared bitlines concurrently Source Far Segment Destination Isolation Transistor Near Segment Sense Amplifier 24
Inter-Segment Migration • Our way: – Source and destination cells share bitlines – Transfer data from source to destination across Step 1: Activate source row shared bitlines concurrently Migration is overlapped with source row access Additional ~4 ns over row access latency Far Segment Step 2: Activate destination row to connect cell and bitline Isolation Transistor Near Segment Sense Amplifier 25
Near Segment as Hardware-Managed Cache TL-DRAM subarray main far segment memory near segment cache sense amplifier I/O channel • Challenge 1: How to efficiently migrate a row between segments? • Challenge 2: How to efficiently manage the cache? 26
Three Caching Mechanisms 1. SC (Simple Caching) – Classic LRU cache – Benefit: Reduced reuse latency there(Wait-Minimized another benefit of caching? 2. Is. WMC Caching) – Identify and only Req. forcache Req. forwait-inducing rows Baseline Row 1 Row – Benefit: Reduced wait 2 Row 1 Row 2 3. BBC (Benefit-Based Caching) Time – BBC ≈Wait-inducing SC + WMC row Wait until finishing Req 1 Req. for – Benefit: Reduced. Req. reuse latency & reduced wait Row 2 Caching Row 1 Time Row 2 Row 1 Cached row 27 Reduced wait
Outline • • Motivation & Key Idea Tiered-Latency DRAM Leveraging Tiered-Latency DRAM Evaluation Results 28
Evaluation Methodology • System simulator – CPU: Instruction-trace-based x 86 simulator – Memory: Cycle-accurate DDR 3 DRAM simulator • Workloads – 32 Benchmarks from TPC, STREAM, SPEC CPU 2006 • Metrics – Single-core: Instructions-Per-Cycle – Multi-core: Weighted speedup 29
Configurations • System configuration – CPU: 5. 3 GHz – LLC: 512 k. B private per core – Memory: DDR 3 -1066 • 1 -2 channel, 1 rank/channel • 8 banks, 32 subarrays/bank, 512 cells/bitline • Row-interleaved mapping & closed-row policy • TL-DRAM configuration – Total bitline length: 512 cells/bitline – Near segment length: 1 -256 cells 30
Single-Core: Performance & Power IPC Improvement 15% 12% 9% 6% 3% 0% WMC SC BBC 12. 7% 100% Normalized Power SC 95% WMC BBC – 23% 90% 85% 80% 75% Using near segment as a cache improves performance and reduces power consumption 31
Single-Core: Varying Near Segment Length IPC Improvement 15% SC 12% WMC Maximum IPC Improvement BBC 9% Larger cache capacity 6% 3% Higher caching latency 0% 1 2 4 8 16 32 64 128 256 Near Segment Length (cells) By adjusting the near segment length, we can trade off cache capacity for cache latency 32
Dual-Core Evaluation • We categorize single-core benchmarks into two categories 1. Sens: benchmarks whose performance is sensitive to near segment capacity 2. Insens: benchmarks whose performance is insensitive to near segment capacity • Dual-core workload categorization 1. Sens/Sens 2. Sens/Insens 3. Insens/Insens 33
Performance Improv. Dual-Core: Sens/Sens 20% SC 15% WMC BBC 10% 5% 0% 16 32 64 Near segment length (cells) 128 Larger near segment capacity leads to higher performance improvement in sensitive workloads BBC/WMC show more perf. improvement 34
Performance Improv. Dual-Core: Sens/Insens & Insens/Insens 20% 15% SC WMC BBC 10% 5% 0% 16 32 64 Near segment length 128 Using near segment as a cache provides high performance improvement regardless of near segment capacity 35
Other Mechanisms & Results in Paper • More mechanisms for leveraging TL-DRAM – Hardware-managed exclusive caching mechanism – Profile-based page mapping to near segment – TL-DRAM improves performance and reduces power consumption with other mechanisms • More than two tiers – Latency evaluation for three-tier TL-DRAM • Detailed circuit evaluation for DRAM latency and power consumption – Examination of t. RC and t. RCD • Implementation details and storage cost analysis memory controller in 36
Conclusion • Problem: DRAM latency is a critical performance bottleneck • Our Goal: Reduce DRAM latency with low area cost • Observation: Long bitlines in DRAM are the dominant source of DRAM latency • Key Idea: Divide long bitlines into two shorter segments – Fast and slow segments • Tiered-latency DRAM: Enables latency heterogeneity in DRAM – Can leverage this in many ways to improve performance and reduce power consumption • Results: When the fast segment is used as a cache to the slow segment Significant performance improvement (>12%) and power reduction (>23%) at low area cost (3%) 37
Thank You 38
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu
Backup Slides 40
Storage Cost in Memory Controller • Organization – Bitline Length: 512 cells/bitline – Near Segment Length: 32 cells – Far Segment Length: 480 cells – Inclusive Caching • Simple caching and wait-minimized caching – Tag Storage: 9 KB – Replace Information: 5 KB • Benefit-based caching – Tag storage: 9 KB – Replace Information: 8 KB (8 bit benefit field/near segment row) 41
Hardware-managed Exclusive Cache • Near and Far segment: Main memory • Caching: Swapping near and far segment row Perf. Improvement & Power Reduction – Need one dummy row to swap 15% 10% 5% 0% Performance Improvement Power Reduction 11. 4% 9. 4% 8. 9% 7. 2% 9. 9% 1 (1 -ch) 4 (4 -ch) 2 (2 -ch) 14. 3% Core-count (# of memory channels) Performance improvement is lower than Inclusive 42 caching due to high swapping latency
Profile-Based Page Mapping Perf. Improvement & Power Reduction • Operating system profiles applications and maps frequently accessed rows to the near segment 30% 25% 20% 15% 10% 5% 0% Performance Improvement Power Reduction 24. 8% 19% 8. 9% 1 (1 -ch) 11. 6% 2 (2 -ch) 21. 5% 7. 2% 4 (4 -ch) Core-count (# of memory channels) Allocating frequently accessed rows in the near segment provides performance improvement 43
Three-Tier Analysis • Three tiers Latency – Add two isolation transistors – Near/Mid/Far segment length: 32/224/256 Cells 180% 57% 150% 120% 90% 60% 30% 0% – 56% – 23% Near Mid Far Commodity Three-Tier TL-DRAM More tiers enable finer-grained caching and partitioning mechanisms 44
- Ull cmaf
- Low latency dispersion compensator
- Lola low latency
- Low latency microservices in java
- Lola system
- Amy ousterhout
- Generic low latency
- Emotive communication style
- What is jitter and latency
- Mid = (low + high) / 2
- High precision vs high accuracy
- Low voltage hazards
- Advanced dram organization
- Azerbaycanda ilk dram eseri
- Commoncase
- Give the structure of commercial 8mx 8 bit dram chip
- Cache
- Appistore
- Dram charge sharing
- Dram organization
- Dram in computer architecture
- Dram ras cas
- Overcoming challenges essay
- Dram
- Dram puf
- Dram timing diagram
- Dram scaling challenges
- Dram cell
- Dram cell
- Overfetch
- Dram block diagram
- Internal memory in computer architecture
- Dram
- Sejarah perkembangan ram
- Dram
- Dram tutorial
- Dram
- Virtual memory
- Dram
- Sram vs dram
- Dram scaling
- Dram memory mapping
- Types of dram
- Struktur dram
- Dram