BaseDeltaImmediate Compression Practical Data Compression for OnChip Caches
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu , Todd C. Mowry Phillip B. Gibbons* Michael A. Kozuch* *
Executive Summary • Off-chip memory latency is high – Large caches can help, but at significant cost • Compressing data in cache enables larger cache at low cost • Problem: Decompression is on the execution critical path • Goal: Design a new compression scheme that has 1. low decompression latency, 2. low cost, 3. high compression ratio • Observation: Many cache lines have low dynamic range data • Key Idea: Encode cachelines as a base + multiple differences • Solution: Base-Delta-Immediate compression with low decompression latency and high compression ratio – Outperforms three state-of-the-art compression mechanisms 2
Motivation for Cache Compression Significant redundancy in data: 0 x 0000000 B 0 x 00000003 0 x 00000004 … How can we exploit this redundancy? – Cache compression helps – Provides effect of a larger cache without making it physically larger 3
Background on Cache Compression Hit CPU L 1 Cache L 2 Cache Decompression Uncompressed Compressed • Key requirements: – Fast (low decompression latency) – Simple (avoid complex hardware changes) – Effective (good compression ratio) 4
Shortcomings of Prior Work Compression Mechanisms Zero Decompression Complexity Latency Compression Ratio 5
Shortcomings of Prior Work Compression Mechanisms Decompression Complexity Latency Compression Ratio Zero Frequent Value 6
Shortcomings of Prior Work Compression Mechanisms Decompression Complexity Latency Compression Ratio Zero Frequent Value Frequent Pattern / 7
Shortcomings of Prior Work Compression Mechanisms Decompression Complexity Latency Compression Ratio Zero Frequent Value Frequent Pattern / Our proposal: BΔI 8
Outline • • Motivation & Background Key Idea & Our Mechanism Evaluation Conclusion 9
Key Data Patterns in Real Applications Zero Values: initialization, sparse matrices, NULL pointers 0 x 00000000 … Repeated Values: common initial values, adjacent pixels 0 x 000000 FF … Narrow Values: small values stored in a big data type 0 x 0000000 B 0 x 00000003 0 x 00000004 … Other Patterns: pointers to the same memory region 0 x. C 04039 C 0 0 x. C 04039 C 8 0 x. C 04039 D 0 0 x. C 04039 D 8 … 10
How Common Are These Patterns? 100% 80% 60% 40% Zero Repeated Values Other Patterns 0% 43% of the cache lines belong to key patterns Average 20% libquantum lbm mcf tpch 17 sjeng omnetpp tpch 2 sphinx 3 xalancbmk bzip 2 tpch 6 leslie 3 d apache gromacs astar gobmk soplex gcc hmmer wrf h 264 ref zeusmp cactus. ADM Gems. FDTD Cache Coverage (%) SPEC 2006, databases, web workloads, 2 MB L 2 cache “Other Patterns” include Narrow Values 11
Key Data Patterns in Real Applications Zero Values: initialization, sparse matrices, NULL pointers 0 x 00000000 Low Dynamic Range: … Repeated Values: common initial values, adjacent pixels 0 x 000000 FF … Differences between valuesinare significantly Narrow Values: small values stored a big data type than 0 x 00000003 the values 0 x 00000004 themselves 0 x 0000 smaller 0 x 0000000 B … Other Patterns: pointers to the same memory region 0 x. C 04039 C 0 0 x. C 04039 C 8 0 x. C 04039 D 0 0 x. C 04039 D 8 … 12
Key Idea: Base+Delta (B+Δ) Encoding 4 bytes 32 -byte Uncompressed Cache Line 0 x. C 04039 C 0 0 x. C 04039 C 8 0 x. C 04039 D 0 … 0 x. C 04039 F 8 0 x. C 04039 C 0 Base 0 x 00 0 x 08 0 x 10 1 byte 0 x 38 12 -byte Compressed Cache Line 1 byte Fast Decompression: 20 bytes saved vector addition … Simple Hardware: arithmetic and comparison Effective: good compression ratio 13
Can We Do Better? • Uncompressible cache line (with a single base): 0 x 0000 0 x 09 A 40178 0 x 0000000 B 0 x 09 A 4 A 838 … • Key idea: Use more bases, e. g. , two instead of one • Pro: – More cache lines can be compressed • Cons: – Unclear how to find these bases efficiently – Higher overhead (due to additional bases) 14
B+Δ with Multiple Arbitrary Bases Compression Ratio 2, 2 2 1 2 3 4 8 10 16 1, 8 1, 6 1, 4 1, 2 1 Geo. Mean 2 bases – the best option based on evaluations 15
How to Find Two Bases Efficiently? 1. First base - first element in the cache line Base+Delta part 2. Second base - implicit base of 0 Immediate part Advantages over 2 arbitrary bases: – Better compression ratio – Simpler compression logic Base-Delta-Immediate (BΔI) Compression 16
Compression Ratio 2 1 B+Δ (2 bases) Average compression ratio is close, but BΔI is simpler 17 Geo. Mean 2, 2 lbm wrf hmmer sphinx 3 tpch 17 libquantum leslie 3 d gromacs sjeng mcf h 264 ref tpch 2 omnetpp apache bzip 2 xalancbmk astar tpch 6 cactus. ADM gcc soplex gobmk zeusmp Gems. FDTD B+Δ (with two arbitrary bases) vs. BΔI 1, 8 1, 6 1, 4 1, 2
BΔI Implementation • Decompressor Design – Low latency • Compressor Design – Low cost and complexity • BΔI Cache Organization – Modest complexity 18
BΔI Decompressor Design Compressed Cache Line B 0 Δ 2 Δ 3 B 0 B 0 + + V 0 Δ 1 V 2 V 1 V 3 V 2 Vector addition V 3 Uncompressed Cache Line 19
BΔI Compressor Design 32 -byte Uncompressed Cache Line 8 -byte B 0 1 -byte Δ CU 8 -byte B 0 2 -byte Δ CU CFlag & CCL 8 -byte B 0 4 -byte Δ CU CFlag & CCL 4 -byte B 0 1 -byte Δ CU CFlag & CCL 4 -byte B 0 2 -byte Δ CU CFlag & CCL 2 -byte B 0 1 -byte Δ CU CFlag & CCL Zero CU Rep. Values CU CFlag & CCL Compression Selection Logic (based on compr. size) Compression Flag & Compressed Cache Line 20
BΔI Compression Unit: 8 -byte B 0 1 -byte Δ 32 -byte Uncompressed Cache Line 8 bytes V 0 B 0 = V 0 B 0 V 1 V 2 V 3 B 0 B 0 - - Δ 0 Δ 1 Δ 2 Δ 3 Within 1 -byte range? Is every element within 1 -byte range? B 0 Δ 1 Δ 2 Δ 3 Yes No 21
BΔI Cache Organization Tag Storage: Set 0 Data Storage: 32 bytes Conventional 2 -way cache with 32 -byte cache lines … … Set 1 Tag 0 Tag 1 … Set 0 … … Set 1 Data 0 Data 1 … … … Way 0 Way 1 BΔI: 4 -way cache with 8 -byte segmented data 8 bytes Tag Storage: Set 0 … … Set 1 Tag 0 Tag 1 … … Set 0 … … … Tag 2 Tag 3 Set 1 S 0 S 1 S 2 … … … C … … … S 3 - Compr. S 4 S 5 Sencoding S 7 C 6 bits… … … Way 0 Way 1 Way 2 Way 3 Twice as Tags many tags map 2. 3% to multiple overhead adjacent for 2 MB segments cache 22
Qualitative Comparison with Prior Work • Zero-based designs – ZCA [Dusser+, ICS’ 09]: zero-content augmented cache – ZVC [Islam+, PACT’ 09]: zero-value cancelling – Limited applicability (only zero values) • FVC [Yang+, MICRO’ 00]: frequent value compression – High decompression latency and complexity • Pattern-based compression designs – FPC [Alameldeen+, ISCA’ 04]: frequent pattern compression • High decompression latency (5 cycles) and complexity – C-pack [Chen+, T-VLSI Systems’ 10]: practical implementation of FPC-like algorithm • High decompression latency (8 cycles) 23
Outline • • Motivation & Background Key Idea & Our Mechanism Evaluation Conclusion 24
Methodology • Simulator – x 86 event-driven simulator based on Simics [Magnusson+, Computer’ 02] • Workloads – SPEC 2006 benchmarks, TPC, Apache web server – 1 – 4 core simulations for 1 billion representative instructions • System Parameters – L 1/L 2/L 3 cache latencies from CACTI [Thoziyoor+, ISCA’ 08] – 4 GHz, x 86 in-order core, 512 k. B - 16 MB L 2, simple memory model (300 -cycle latency for row-misses) 25
Compression Ratio 2 1 ZCA FVC FPC 1, 8 Geo. Mean 2, 2 lbm wrf hmmer sphinx 3 tpch 17 libquantum leslie 3 d gromacs sjeng mcf h 264 ref tpch 2 omnetpp apache bzip 2 xalancbmk astar tpch 6 cactus. ADM gcc soplex gobmk zeusmp Gems. FDTD Compression Ratio: BΔI vs. Prior Work SPEC 2006, databases, web workloads, 2 MB L 2 BΔI 1. 53 1, 6 1, 4 1, 2 BΔI achieves the highest compression ratio 26
2 k 51 L 2 cache size 1 0, 8 0, 6 0, 4 0, 2 0 16% 24% 21% 13% 19% 14% 2 k B 1 M B 2 M B 4 M B 8 M 16 B M B 8. 1% 4. 9% 5. 1% 5. 2% 3. 6% 5. 6% Baseline (no compr. ) 51 1, 5 1, 4 1, 3 1, 2 1, 1 1 0, 9 Normalized MPKI Baseline (no compr. ) BΔI B 1 M B 2 M B 4 M B 8 M B 16 M B Normalized IPC Single-Core: IPC and MPKI L 2 cache size BΔI achieves the performance of a 2 X-size cache Performance improves due to the decrease in MPKI 27
Multi-Core Workloads • Application classification based on Compressibility: effective cache size increase (Low Compr. (LC) < 1. 40, High Compr. (HC) >= 1. 40) Sensitivity: performance gain with more cache (Low Sens. (LS) < 1. 10, High Sens. (HS) >= 1. 10; 512 k. B -> 2 MB) • Three classes of applications: – LCLS, HCHS, no LCHS applications • For 2 -core - random mixes of each possible class pairs (20 each, 120 total workloads) 28
Multi-Core: Weighted Speedup If. BΔI at least one application is sensitive, then(9. 5%) the performance improvement is the highest performance improves 29
Other Results in Paper • IPC comparison against upper bounds – BΔI almost achieves performance of the 2 X-size cache • Sensitivity study of having more than 2 X tags – Up to 1. 98 average compression ratio • Effect on bandwidth consumption – 2. 31 X decrease on average • Detailed quantitative comparison with prior work • Cost analysis of the proposed changes – 2. 3% L 2 cache area increase 30
Conclusion • A new Base-Delta-Immediate compression mechanism • Key insight: many cache lines can be efficiently represented using base + delta encoding • Key properties: – Low latency decompression – Simple hardware implementation – High compression ratio with high coverage • Improves cache hit ratio and performance of both singlecore and multi-core workloads – Outperforms state-of-the-art cache compression techniques: FVC and FPC 31
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko, Vivek Seshadri , Onur Mutlu , Todd C. Mowry Phillip B. Gibbons*, Michael A. Kozuch* *
Backup Slides 33
B+Δ: Compression Ratio 2 SPEC 2006, databases, web workloads, L 2 2 MB cache 1, 8 1, 6 1, 4 1, 2 Geo. Mean 1 libquantum lbm wrf hmmer sphinx 3 tpch 17 mcf omnetpp sjeng xalancbmk tpch 2 leslie 3 d apache astar gromacs h 264 ref bzip 2 tpch 6 cactus. ADM gcc soplex gobmk zeusmp Gems. FDTD Compression Ratio 2, 2 Good average compression ratio (1. 40) But some benchmarks have low compression ratio 34
Geo. Mean astar bzip 2 soplex xalancbmk mcf omnetpp tpch 2 tpch 17 gromacs apache sphinx 3 h 264 ref gobmk leslie 3 d zeusmp lbm 2. 3% 1. 7% 1. 3% tpch 6 hmmer gcc cactus. ADM Gems. FDTD wrf 512 k. B-2 way 512 k. B-4 way-BΔI 1 MB-4 way 1 MB-8 way-BΔI 2 MB-8 way 2 MB-16 way-BΔI 4 MB-16 way sjeng 2, 1 2 1, 9 1, 8 1, 7 1, 6 1, 5 1, 4 1, 3 1, 2 1, 1 1 0, 9 Fixed L 2 cache latency libquantum Normalized IPC Single-Core: Effect on Cache Capacity BΔI achieves performance close to the upper bound 35
Multiprogrammed Workloads - I 36
Cache Compression Flow CPU Hit L 1 Data Cache Uncompressed Writeback Compress Miss Writeback Decompress Miss L 2 Cache Compressed Memory Uncompressed Hit L 2 Decompress Compress 37
Example of Base+Delta Compression • Narrow values (taken from h 264 ref): 38
- Slides: 38