BaseDeltaImmediate Compression Practical Data Compression for OnChip Caches

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko§ Vivek Seshadri § Onur Mutlu § Michael A. Kozuch † Phillip B. Gibbons † Todd C. Mowry § § Carnegie Mellon University † Intel Labs Pittsburgh Published at PACT 2012 Presented by Marc-Philippe Bartholomä

Problem & Goal 2

Large Cache Improves Performance n n Larger capacity ⇒ fewer misses ⇒ better performance Larger capacity ⇒ fewer off-chip cache misses q q Avoids memory bandwidth bottleneck Especially important for multi-core with shared memory Idea: Compress the data in But increasing capacity by scaling the conventional caches to save on hardware costsdesign: n n n Slower caches More power consumption More area required 3

Goals of Cache Compression n Compression/decompression need to be very fast q n n Decompression is on the critical path Simple compression logic avoids large power and area costs Must compress the data effectively q Otherwise there isn’t much gain in capacity 4

Background 5

Data Patterns in Applications: Zeroes 0 x 00000000 16 -byte cache line 6

Data Pattern: Repeated Values 0 x. CAFE 4 A 11 7

Data Pattern: Narrow Values have more storage allocated than necessary 0 x 000000 CA 0 x 000000 FE 0 x 0000004 A 0 x 00000011 8

Data Patterns are Frequent 43% of application cache lines can be compressed on average Narrow Values are included in Other Patterns 9

Data Patterns: Low Dynamic Range The values are larger than the difference between them 0 x 4100004 0 x 41000108 0 x 4100004 C 0 x 41000130 0 x 000000 CA 0 x 000000 FE 0 x 0000004 A 0 x 00000011 0 x. CAFE 4 A 11 0 x 00000000 10

Base+Delta Encoding 0 x 4100004 0 x 000000 CA 0 x 00000011 0 x. CAFE 4 A 11 0 x 00000000 0 x 41000108 +0 x 0 +0 x 104 0 x 000000 FE +0 x. B 9 +0 x. ED 0 x. CAFE 4 A 11 +0 x 0000 +0 x 0 0 x 4100004 C +0 x 48 +0 x 12 C 0 x 000004 A +0 x 39 0 x. CAFE 4 A 11 +0 x 0 0 x 00000011 +0 x 0 0 x. CAFE 4 A 11 +0 x 0 0 x 41000130 +0 x 0000 11

Novelty 12

Novelty n Compress on cache line granularity q Previous approaches work on individual words n View data patterns as Low Dynamic Range n Apply Base+Delta compression to caches q q Instead of general purpose compression Instead of special case handling for some patterns 13

Key Approach and Ideas 14

Base+Delta Compression 0 x. C 04039 C 0 + 0 x 38 = 0 x. C 04039 F 8 n Fast decompression (vector addition) n Simple hardware (addition/subtraction and comparison) n Effectively compresses observed patterns 15

Room for Improvement n Multiple bases allow compression of more cache lines n Need to encode multiple bases in compressed line 16

Finding Bases? 0 x 0000 0 x. A 478 0 x 000 B 0 x 0001 0 x. A 438 0 x 000 A 0 x 000 B 0 x. A 438 0 x 0000 0 x 000 B 0 x. A 478 0 x 0001 0 x. A 438 0 x 000 A 0 x 000 B 0 x. A 438 0 x 0000 0 x 000 B 0 x 0001 0 x 000 A 0 x. A 438 0 x. A 478 0 x. A 438 n Gets more difficult with more bases 17

Base Delta Immediate (BΔI) n 2 bases q q q n 1 is always 0 x 0000 ⇒ no need to save 1 is arbitrary Values with respect to the zero base are the “immediates” Slightly better than Base+Delta with 2 arbitrary bases q Which in turn compresses better than Base+Delta with other number of bases 18

Mechanism 19

Finding base for BΔI 0 x 0000 0 x 000 B 0 x. A 438 0 x 0001 0 x. A 470 0 x 000 A 0 x 000 B 0 x. A 478 Try compression with base 0 +0 x 0 B 0 x. A 438 +0 x 01 0 x. A 470 +0 x 0 A +0 x 0 B Choose first non-compressible Compress the rest +0 x 00 +0 x 32 +0 x 0 B +0 x 00 +0 x 01 +0 x 0 A +0 x 0 B 0 x. A 478 +0 x 40 Add nontrivial base 0 x. A 438 +0 x 00 +0 x 0 B +0 x 00 +0 x 01 +0 x 32 +0 x 0 A +0 x 0 B +0 x 40 20

Attribute Deltas to Bases 0 x. A 438 +0 x 00 +0 x 0 B +0 x 00 +0 x 01 +0 x 32 +0 x 0 A +0 x 0 B +0 x 40 Generate and save a bitmap 0 0 1 Note: Decompression becomes masked vector addition 21

Determining Base and Delta Sizes A 438 00 0 B 00 4 bytes 32 0 A 0 B 40 FC 5 A 03 7 A 44 AB 0 C 82 1 byte 2 bytes 09 A 40178 01 0000 000 B 0001 A 6 C 0 000 A 000 B C 178 2 byte 22

Cases Determining Base and Delta Sizes 0 R E P E A T E D U N C O M P R E S S E D Byte Usage ■ Base n n 8 ■/■ Delta 16 ■ Free 32 24 ■ Special Zero line and repeated values are special cases Everything is attempted in parallel and shortest is chosen 23

Changes in Cache Organization n Double the amount of cache tags Add encoding bits for cases and bitmask for base determination Segment the cache lines and add segment pointers to the tags 24

Key Results: Methodology and Evaluation 25

Methodology n x 86 -based Simulation n 1 -4 cores n SPEC 2006, TPC-H and Apache web server workloads n L 1/L 2/L 3 cache latencies from CACTI [Thoziyoor+, ISCA’ 08] 26

BΔI vs Baseline ⇒ Capacity Nearly Doubled Instructions per Cycle Misses per Kilo Instruction 27

BΔI vs Other Approaches ⇒ Best Comp. Ratio n n n ZCA (Zero-Content Augmented cache): exploits only zeroes FVC (Frequent Value Compression): zeroes and common words FPC (Frequent Pattern Compression): patterns including repeated values and narrow values 28

Multi-Core Profits Even More LC/HC: low/high compressibility, LS/HS low/high cache size sensitivity (uses 2 cores, 2 MB L 2 cache) Missing: LCHS due to absence in sample workloads 29

Summary 30

Summary n n Goal: Increase cache capacity using data compression at lower cost Key Insight: A significant fraction (43%) of real-world cache lines can be compressed Key Mechanism: Base+Delta encoding fits well to exploit low dynamic range patterns Key Results: BΔI yields nearly the performance gain of a cache with double capacity without the same costs in area and power q q 5. 1% avg. performance increase on single-core over baseline 9. 5% avg. performance increase on dual-core over baseline 31

Strengths 32

Strengths of the Paper n n Novel approach leading to significant improvement Thorough analysis and evaluation of patterns, previous approaches and variants n Elegant solution and principled design n Easy-to-understand well-structured paper n Transparent to the OS and applications n Compression mechanism is predictable for the user 33

Weaknesses 34

Weaknesses/Limitations of the Paper n Requires double amount of cache tags q n n n Potential bottleneck Adds the possibility of eviction when writing with cache-hit Because real capacity is unknown, it is harder to optimize applications Missing category for multi-core workload Analysis of cache size only for Base+Delta (no bitmap) Compressed data patterns don’t capture floating point values Too much latency for L 1 cache 35

Thoughts and Ideas 36

Extensions n n n Special case with only the 0 base Base Finding approach generalizes to 2 arbitrary bases q Analyze the benefit of switching between BΔI and Base+Delta with 1 or 2 bases To save on cache tags you could load 2 contiguous cache lines Include base bitmap in the deltas For repeated values of size up to 4 bytes, you could save them using the bitmask for base attribution 37

Takeaways 38

Key Takeaways n Paper is a prime example of principled design q q q n Carefully examines the potential Thoroughly analyzes the tradeoffs Picks the best variant Data compression is viable for on-chip caches 39

Questions 40

Discussion n Cache Replacement Policy q q Paper uses slightly modified LRU and leaves detailed study for future work For uncompressed caches: theoretical optimal cache replacement policy (adapted from Computer Systems 2018): For eviction: Choose entry that will not be referenced again for the longest period of time. q q q Is the shown CRP also optimal for caches with compression? Why or why not? What aspects need to be considered to adapt it? Ideas about what an actual cache replacement policy should do? 41

Discussion n Patterns in floating point values? Exploitable with BΔI? Systems Programming and Computer Architecture 2017 1 8 bits 23 bits 42

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko§ Vivek Seshadri § Onur Mutlu § Michael A. Kozuch † Phillip B. Gibbons † Todd C. Mowry § § Carnegie Mellon University † Intel Labs Pittsburgh Published at PACT 2012 Presented by Marc-Philippe Bartholomä