BaseDeltaImmediate Compression Practical Data Compression for OnChip Caches
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko, Vivek Seshadri , Onur Mutlu , Todd C. Mowry Phillip B. Gibbons*, Michael A. Kozuch* *
Motivation: “Memory Wall” Computer Architecture: From Microprocessors to Supercomputers, 2011 Main memory latency has significant effect on performance (e. g. , 300+ cycles) 2
Motivation: Caching Intel Core i 7 IBM Power 7+ Cache capacity is hard to increase due to area and power limitations 3
Motivation for Data Compression Significant redundancy in in-cache data: 0 x 0000000 B 0 x 00000003 0 x 00000004 … How can we exploit this redundancy? – Data compression helps – Provides effect of a larger cache without making it physically larger 4
Background on Cache Compression Hit (1 -2 cycles) CPU Hit (~15 cycles) L 1 Cache L 2 Cache Decompression Uncompressed Data Compressed • Key requirements: – Fast (low decompression latency) – Simple (avoid complex hardware changes) – Effective (good compression ratio, e. g. , 1. 5 for 2 MB cache means effective size of 3 MB) 5
Outline • • Motivation & Background Shortcomings of Prior Work Base-Delta-Immediate: Key Idea Base-Delta-Immediate: Implementation Evaluation Conclusion Future Work 6
Why Not Software Compression? • Lossless data compression – Lempel-Ziv • DEFLATE (gzip, png) • LZW (gif) – Huffman coding • Lossy data compression – Multimedia (audio, video) • Effective, but slow and complex in hardware – 100+ cycles vs. 10 -15 cycles for L 2/L 3 cache hits 7
Shortcomings of Prior Work Compression Mechanisms Decompression Complexity Latency Compression Ratio Zero Frequent Value Frequent Pattern Our proposal: Base-Delta. Immediate 8
Zero Value Compression • Idea: one bit to encode zero value cache lines • Advantages: – Low decompression latency – Low complexity • Disadvantages: – Low average compression ratio 9
Shortcomings of Prior Work Compression Mechanisms Zero Decompression Complexity Latency Compression Ratio 10
Frequent Value Compression Idea: encode cache lines based on frequently occurring values 0 x 00000001 01 0 x 0000 00 10 110 x 000000 FF 0 x. ABCDEFFF Frequent Values: 00 - 0 x 0000 01 - 0 x 00000001 10 - 0 x 000000 FF 11 - not frequent 0 x 00000001 01 0 x 0000 00 0 x 000000 FF 10 0 x. ABCDEFFF 11
Shortcomings of Prior Work Compression Mechanisms Decompression Complexity Latency Compression Ratio Zero Frequent Value 12
Frequent Pattern Compression Idea: encode cache lines based on frequently occurring patterns, e. g. , first half of a word is zero 0 x 00000001 0 x 00010 x 0000 000 011 0 x. FFFFFFFF 0 x. ABCDEFFF 111 0 x. ABCDEFFF Frequent Patterns: 000 – All zeros 001 – First half zeros 010 – Second half zeros 011 – Repeated bytes 100 – All ones … 111 – Not a frequent pattern 0 x 00000001 0 x 0000 000 0 x. FFFF 011 0 x. FF 0 x. ABCDEFFF 111 0 x. ABCDEFFF 13
Shortcomings of Prior Work Compression Mechanisms Decompression Complexity Latency Compression Ratio Zero Frequent Value Frequent Pattern 14
Shortcomings of Prior Work Compression Mechanisms Decompression Complexity Latency Compression Ratio Zero Frequent Value Frequent Pattern Our proposal: Base-Delta. Immediate 15
Outline • • Motivation & Background Shortcomings of Prior Work Base-Delta-Immediate: Key Idea Base-Delta-Immediate: Implementation Evaluation Conclusion Future Work 16
Key Data Patterns in Real Applications Zero Values: initialization, sparse matrices, NULL pointers 0 x 00000000 … Repeated Values: common initial values, adjacent pixels 0 x 000000 C 0 … Narrow Values: small values stored in a big data type 0 x 000000 C 0 0 x 000000 C 8 0 x 000000 D 0 0 x 000000 D 8 … Other Patterns: pointers to the same memory region 0 x. C 04039 C 0 0 x. C 04039 C 8 0 x. C 04039 D 0 0 x. C 04039 D 8 … 17
60% 40% 0% 43% of the cache lines belong to key patterns Arith. Mean 80% libquantum lbm mcf tpch 17 sjeng omnetpp tpch 2 sphinx 3 xalancbmk bzip 2 tpch 6 leslie 3 d apache gromacs astar gobmk soplex gcc hmmer wrf h 264 ref zeusmp cactus. ADM Gems. FDTD Cache Coverage (%) How Common Are These Patterns? SPEC 2006, databases, web workloads, 2 MB L 2 cache “Other Patterns” include Narrow Values 100% Zero Repeated Values Other Patterns 20% 18
Key Data Patterns in Real Applications Zero Values: initialization, sparse matrices, NULL pointers 0 x 00000000 … Repeated Values: common initial values, adjacent pixels 0 x 000000 C 0 … Narrow Values: small values stored in a big data type 0 x 000000 C 0 0 x 000000 C 8 0 x 000000 D 0 0 x 000000 D 8 … Other Patterns: pointers to the same memory region 0 x. C 04039 C 0 0 x. C 04039 C 8 0 x. C 04039 D 0 0 x. C 04039 D 8 … 19
Key Data Patterns in Real Applications Low Dynamic Range: Differences between values are significantly smaller than the values themselves 20
Key Idea: Base+Delta (B+Δ) Encoding 4 bytes 32 -byte Uncompressed Cache Line 0 x. C 04039 C 0 0 x. C 04039 C 8 0 x. C 04039 D 0 … 0 x. C 04039 F 8 0 x. C 04039 C 0 Base 0 x 00 0 x 08 0 x 10 1 byte 0 x 38 12 -byte Compressed Cache Line 1 byte Fast Decompression: 20 bytes saved vector addition … Simple Hardware: arithmetic and comparison Effective: good compression ratio 21
Can We Do Better? • Uncompressible cache line (with a single base): 0 x 09 A 40178 0 x 0000 0 x 09 A 4 A 838 0 x 0000000 B … • Key idea: Use more bases, e. g. , two instead of one • Pro: – More cache lines can be compressed • Cons: – Unclear how to find these bases efficiently – Higher overhead (due to additional bases) 22
Compression Ratio B+Δ with Multiple Arbitrary Bases 1. 8 # of bases is fixed 1. 7 1. 6 1. 5 1. 4 1. 3 1. 2 1. 1 1 1 2 3 4 8 10 16 2 bases – the best option based on evaluations 23
How to Find Two Bases Efficiently? 1. First base - first element in the cache line Base+Delta part 2. Second base - implicit base of 0 Immediate part Advantages over 2 arbitrary bases: – Better compression ratio – Simpler compression logic Base-Delta-Immediate (BΔI) Compression 24
Compression Ratio 2 1 B+Δ (2 bases) Geo. Mean 2. 2 lbm wrf hmmer sphinx 3 tpch 17 libquantum leslie 3 d gromacs sjeng mcf h 264 ref tpch 2 omnetpp apache bzip 2 xalancbmk astar tpch 6 cactus. ADM gcc soplex gobmk zeusmp Gems. FDTD B+Δ (with two arbitrary bases) vs. BΔI 1. 8 1. 6 1. 4 1. 2 Average compression ratio is close, but BΔI is simpler 25
Outline • • Motivation & Background Shortcomings of Prior Work Base-Delta-Immediate: Key Idea Base-Delta-Immediate: Implementation Evaluation Conclusion Future Work 26
BΔI Implementation • Decompressor Design – Low latency • Compressor Design – Low cost and complexity • BΔI Cache Organization – Modest complexity 27
BΔI Decompressor Design Compressed Cache Line B 0 Δ 2 Δ 3 B 0 B 0 + + V 0 Δ 1 V 2 V 1 V 3 V 2 Vector addition V 3 Uncompressed Cache Line 28
BΔI Compressor Design 32 -byte Uncompressed Cache Line 8 -byte B 0 1 -byte Δ CU 8 -byte B 0 2 -byte Δ CU CFlag & CCL 8 -byte B 0 4 -byte Δ CU CFlag & CCL 4 -byte B 0 1 -byte Δ CU CFlag & CCL 4 -byte B 0 2 -byte Δ CU CFlag & CCL 2 -byte B 0 1 -byte Δ CU CFlag & CCL Zero CU Rep. Values CU CFlag & CCL Compression Selection Logic (based on compr. size) Compression Flag & Compressed Cache Line 29
BΔI Compression Unit: 8 -byte B 0 1 -byte Δ 32 -byte Uncompressed Cache Line 8 bytes V 0 B 0 = V 0 B 0 V 1 V 2 V 3 B 0 B 0 - - Δ 0 Δ 1 Δ 2 Δ 3 Within 1 -byte range? Is every element within 1 -byte range? B 0 Δ 1 Δ 2 Δ 3 Yes No 30
BΔI Cache Organization Tag Storage: Set 0 Data Storage: 32 bytes Conventional 2 -way cache with 32 -byte cache lines … … Set 1 Tag 0 Tag 1 … Set 0 … … Set 1 Data 0 Data 1 … … … Way 0 Way 1 BΔI: 4 -way cache with 8 -byte segmented data 8 bytes Tag Storage: Set 0 … … Set 1 Tag 0 Tag 1 … … Set 0 … … … Tag 2 Tag 3 Set 1 S 0 S 1 S 2 … … … C … … … S 3 - Compr. encoding S 4 S 5 S 6 S 7 C bits… … … Way 0 Way 1 Way 2 Way 3 Twice as many tags Tags map to multiple adjacent segments 2. 3% overhead for 2 MB cache 31
Outline • • Motivation & Background Shortcomings of Prior Work Base-Delta-Immediate: Key Idea Base-Delta-Immediate: Implementation Evaluation Conclusion Future Work 33
Methodology • Simulator – x 86 event-driven simulator based on Simics [Magnusson+, Computer’ 02] • Workloads – SPEC 2006 benchmarks, TPC, Apache web server – 1 – 4 core simulations for 1 billion representative instructions • System Parameters – L 1/L 2/L 3 cache latencies from CACTI [Thoziyoor+, ISCA’ 08] – 4 GHz, x 86 in-order core, cache size (512 k. B - 16 MB), simple memory model (300 -cycle latency for rowmisses) 34
Methodology • Simulator – x 86 event-driven simulator based on Simics [Magnusson+, Computer’ 02] • Workloads – SPEC 2006 benchmarks, TPC, Apache web server – 1 – 4 core simulations for 1 billion representative instructions • System Parameters – L 1/L 2/L 3 cache latencies from CACTI [Thoziyoor+, ISCA’ 08] – 4 GHz, x 86 in-order core, cache size (512 k. B - 16 MB), simple memory model (300 -cycle latency for rowmisses) 35
Key Metrics • IPC – instructions per cycle • MPKI – misses per kilo instruction • Compression ratio – effective cache size increase, e. g. , 1. 5 for 2 MB cache means effective size of 3 MB • Weighted Speedup: 36
Compression Ratio 2 1 ZCA FVC FPC 1. 8 Geo. Mean 2. 2 lbm wrf hmmer sphinx 3 tpch 17 libquantum leslie 3 d gromacs sjeng mcf h 264 ref tpch 2 omnetpp apache bzip 2 xalancbmk astar tpch 6 cactus. ADM gcc soplex gobmk zeusmp Gems. FDTD Compression Ratio: BΔI vs. Prior Work SPEC 2006, databases, web workloads, 2 MB L 2 BΔI 1. 53 1. 6 1. 4 1. 2 BΔI achieves the highest compression ratio 37
1 0. 8 0. 6 0. 4 0. 2 0 16% 24% 21% 13% 19% 14% 2 k B 1 M B 2 M B 4 M B 8 M B 16 M B 2 k 51 L 2 cache size Baseline (no compr. ) 51 1. 5 3. 6% 1. 4 5. 6% 4. 9% 1. 3 5. 1% 5. 2% 1. 2 1. 1 8. 1% 1 0. 9 Normalized MPKI Baseline (no compr. ) BΔI B 1 M B 2 M B 4 M B 8 M B 16 M B Normalized IPC Single-Core: IPC and MPKI L 2 cache size BΔI achieves the performance of a 2 X-size cache Performance improves due to the decrease in MPKI 38
Multi-Core Workloads • Application classification based on Compressibility: effective cache size increase (Low Compr. (LC) < 1. 40, High Compr. (HC) >= 1. 40) Sensitivity: performance (IPC) gain with more cache (Low Sens. (LS) < 1. 10, High Sens. (HS) >= 1. 10; 512 k. B -> 2 MB) • Three classes of applications: – LCLS, HCHS, no LCHS applications • For 2 -core - random mixes of each possible class pairs (20 each, 120 total workloads) 39
Multi-Core: Weighted Speedup If at least one application is sensitive, then the BΔI performance improvement is the highest (9. 5%) 40 performance improves
Future Work • New compression-aware replacement policies – Minimal-LRU – Cost analysis based on compressed size • Hardware evaluation of the proposed changes – Verilog model • Main memory (DRAM) compression – Increases capacity – Improves performance – Decreases bandwidth and energy consumption 42
Conclusion • Base-Delta-Immediate compression - a new compression mechanism • Key insight: many cache lines can be efficiently represented using base + delta encoding • Key properties: – Low latency decompression – Simple hardware implementation – High compression ratio with high coverage • Improves cache hit ratio and performance of both single-core and multi-core workloads – Outperforms state-of-the-art cache compression techniques: FVC and FPC 43
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko, Vivek Seshadri , Onur Mutlu , Todd C. Mowry Phillip B. Gibbons*, Michael A. Kozuch* *
Backup Slides 45
Normalized IPC 2. 1 2 1. 9 1. 8 1. 7 1. 6 1. 5 1. 4 1. 3 1. 2 1. 1 1 0. 9 Geo. Mean 512 k. B-2 way 512 k. B-4 way-BΔI 1 MB-4 way 1 MB-8 way-BΔI 2 MB-8 way 2 MB-16 way-BΔI 4 MB-16 way astar bzip 2 soplex xalancbmk mcf omnetpp tpch 2 tpch 17 gromacs apache sphinx 3 h 264 ref gobmk leslie 3 d zeusmp lbm tpch 6 hmmer gcc cactus. ADM Gems. FDTD wrf sjeng libquantum Single-Core: Effect on Cache Capacity Fixed L 2 cache latency 2. 3% 1. 7% 1. 3% BΔI achieves performance close to the upper bound 46
2 1 Geo. Mean 2. 2 libquantum lbm wrf hmmer sphinx 3 tpch 17 mcf omnetpp sjeng xalancbmk tpch 2 leslie 3 d apache astar gromacs h 264 ref bzip 2 tpch 6 cactus. ADM gcc soplex gobmk zeusmp Gems. FDTD Compression Ratio B+Δ: Compression Ratio SPEC 2006, databases, web workloads, L 2 2 MB cache 1. 8 1. 6 1. 4 1. 2 Good average compression ratio (1. 40) But some benchmarks have low compression ratio 47
Multiprogrammed Workloads - I 48
Cache Compression Flow CPU Hit L 1 Data Cache Uncompressed Writeback Compress Miss Writeback Decompress Miss L 2 Cache Compressed Memory Uncompressed Hit L 2 Decompress Compress 49
Example of Base+Delta Compression • Narrow values (taken from h 264 ref): 50
- Slides: 48