Thesaurus Efficient Cache Compression via Dynamic Clustering Amin

Executive summary How to Improve cache effectiveness by avoiding redundancy across nearly identical cachelines?

Limitations of existing methods Data Array cacheline C Cachelines cacheline B compressed A 0

Motivation: compressed caches Prior work: Within Cachelines Data Array cacheline B cacheline C Cachelines

Motivation: compressed caches Prior work: Within Cachelines Data Array cacheline B cacheline C cacheline

Goal: delta compression in vertical fashion New Method: Delta across cachelines Data Array D

Goal: delta compression in vertical fashion Prior work: Across Cachelines Data Array cacheline C

Thesaurus • Efficient near cacheline search • How to encode compressed cacheline • Low

Idea: delta compression across cachelines A 3 FE 00000 C 04000 FF BF 350003

Gmean bwaves cactubssn cam 4 deepsjeng exchange 2 fotonik 3 d gcc imagick lbm

n Gmean ea Gm bwaves cactubssn cam 4 deepsjeng exchange 2 fotonik 3 d

400 300 200 100 0 1200 (DBSCAN) 800 400 0 bwaves cactubssn cam 4

Opportunity: in HW clustering challenges • Cluster counters and sizes VARY PER WORKLOAD ADAPTIVE

Clustering based on Locality-Sensitive hashing Locality-sensitive hashing: Same hash for similar blocks Same cluster

Locality-Sensitive hashing: overview 1: Random Bit Sampling B 3 FE 00000 C 04000 FF

Locality-Sensitive hashing: overview 2: Random Projection More efficient Apply random matrix ~ N(μ, σ)

Locality-Sensitive hashing: in Thesaurus: Novel HW-efficient Random Projection statistical computing techniques sparse {1, 0,

Locality-Sensitive hashing: in Thesaurus: Novel HW-efficient Random Projection Efﬁcient HW implementation * > 1

Thesaurus: architecture Tag Array Data Array 64 -Byte Memory Block Cachelines Tag Array decoupled

Thesaurus: insertion operation Tag Array Data Array 64 -Byte Memory Block LSH 0 x

Tag Array tag A 0 x 02 f 0 Tag Array Data Array cacheline

Thesaurus: insertion operation tag C Tag Array tag B Data Array cacheline A cacheline

Tag Array tag A 0 x 02 f 0 tag C Tag Array tag

Thesaurus: read operation Tag Array tag A 0 x 02 f 0 tag C

Working Set Size 0. 4 40% 0. 2 20% 0 0 (Lower is better)

Summary • Demonstrated significant similarity in data values of memory blocks across different cachelines

Results: cluster delta sizes Average byte difference size - same LSH (Looking at data

Results: Total Power Difference in total power consumption: Thesaurus vs. baseline Less consumption More

Results: Power Dynamic read energy & leakage power: scaled to the same silicon area

IDEA: inter-line deduplication + delta • Q 1: are there enough similar cache lines?

Thesaurus cache: Avoiding fragmentation 40

Slides: 41

Download presentation

Thesaurus: Efficient Cache Compression via Dynamic Clustering Amin Ghasemazar, Prashant Nair, Mieszko Lis The University of British Columbia ASPLOS 2020

Executive summary How to Improve cache effectiveness by avoiding redundancy across nearly identical cachelines? Observations: clusters of nearly identical cachelines in working set Problem: how to use this to compress the cache ? Key Idea: - group nearly identical cachelines via hardware-level dynamic clustering - store each cluster as one clusteroid line + smaller deltas Results: Higher Compression Ratio (up to 9. 9 x - 2. 25 x Gmean) 2

Limitations of existing methods Data Array cacheline C Cachelines cacheline B compressed A 0 x 3 FE 00000 00 FF 03 D 8 … Base cacheline A Δ 2 Δ 3 … 0 x 3 FE 00000 0 x 3 FE 000 FF 0 x 3 FE 00003 0 x 3 FE 000 D 8 … 3

Motivation: compressed caches Prior work: Within Cachelines Data Array cacheline B cacheline C Cachelines 0 x 3 FE 00000 00 FF 03 D 8 … cacheline D 0 x 3 FE 00000 0 x. C 04000 FF 0 x. BF 350003 0 x 3 F 490000 … 4

Motivation: compressed caches Prior work: Within Cachelines Data Array cacheline B cacheline C cacheline D 0 x 3 FE 00000 0 x. C 04000 FF 0 x. BF 350003 0 x 3 F 490000 cacheline E OPPORTUNITY 0 x 3 FE 00000 Missed 0 x. C 04000 FFCompression 0 x. BF 3500 FF 0 x 3 F 490000 … … Cachelines 0 x 3 FE 00000 00 FF 03 D 8 … 5

Goal: delta compression in vertical fashion New Method: Delta across cachelines Data Array D cacheline C 0 x 3 FE 00000 0 x. C 04000 FF 0 x. BF 350003 0 x 3 F 490000 meta data nearly identical cacheline B … FF encode 1 Cachelines 0 x 3 FE 00000 00 FF 03 D 8 … Δ 2 E 0 x 3 FE 00000 0 x. C 04000 FF 0 x. BF 3500 FF 0 x 3 F 490000 SEARCH for Near Cachelines? Exhaustive search … / Deduplication Hashing ? 6

Goal: delta compression in vertical fashion Prior work: Across Cachelines Data Array cacheline C 0 x 3 FE 00000 0 x. C 04000 FF 0 x. BF 350003 0 x 3 F 490000 Identical h(C) =h(D) … Deduplication Exhaustive search cacheline EDfor 0 x 3 FE 00000 0 x. C 04000 FF 0 x. BF 350003 0 x. BF 3500 FF 0 x 3 F 490000 SEARCH Near Cachelines? … Nearly identical h(C) ≠h(E) Deduplication Hashing 7

Thesaurus

Thesaurus • Efficient near cacheline search • How to encode compressed cacheline • Low HW overhead 9

Idea: delta compression across cachelines A 3 FE 00000 C 04000 FF BF 350003 3 F 490000 stored B C 3 FE 00000 C 04000 FF BF 350003 3 F 490000 identical don’t store 3 FA 10000 3 F 4000 FF BF 3500 FF 3 F 4900 B 1 nearly identical store deltas 0 1 0 0 0 1 Bit map A 13 FFFB 1 stored compressed • Q 1: Are there enough nearly identical cachelines? • Q 2: How to find nearly identical cachelines quickly • Q 3: Do we need an on-line and adaptive mechanism? 10

Gmean bwaves cactubssn cam 4 deepsjeng exchange 2 fotonik 3 d gcc imagick lbm leela mcf nab namd omnetpp parest perlbench povray roms wrf x 264 xalancbmk xz Effective LLC capacity Opportunity: exactly identical baseline exactly identical 1. 3 x Only Results reported for SPEC CPU’ 17 11

n Gmean ea Gm bwaves cactubssn cam 4 deepsjeng exchange 2 fotonik 3 d gcc imagick lbm leela mcf nab namd omnetpp parest perlbench povray roms wrf x 264 xalancbmk xz Effective LLC capacity Opportunity: nearly identical baseline exactly identical nearly identical 2. 5× 1. 3× Only Enabling Near Match Big Compression OPPORTUNITY 13

400 300 200 100 0 1200 (DBSCAN) 800 400 0 bwaves cactubssn cam 4 deepsjeng exchange 2 fotonik 3 d gcc imagick lbm leela mcf nab namd omnetpp parest perlbench povray roms wrf x 264 xalancbmk xz # members # clusters Opportunity: clustering for quick search Results reported for SPEC CPU’ 17 Cluster counters and sizes VARY PER WORKLOAD hardcoding is impractical

Opportunity: in HW clustering challenges • Cluster counters and sizes VARY PER WORKLOAD ADAPTIVE cluster creation • Cache content is INPUT DEPENDANT must happen at RUNTIME • Quick and inexpensive clustering in HW cannot use DBSCAN/KMEANS … In HW DYNAMIC and ADAPTIVE cluster creation without SCANNING

Thesaurus Clustering 16

Clustering based on Locality-Sensitive hashing Locality-sensitive hashing: Same hash for similar blocks Same cluster Different hashes for dissimilar blocks. 17

Locality-Sensitive hashing: overview 1: Random Bit Sampling B 3 FE 00000 C 04000 FF BF 350003 3 F 490000 h(B) 0 F 0 A 3 FE 00000 C 04000 FF BF 3500 FF 3 F 490000 h(A) 0 F 0 C BFA 900 D 8 3 ED 457 FF 3 E 745003 C 0400000 h(C) 0 F 0 Assumes C is also similar to A, Bad hashes Increase #bits 18

Locality-Sensitive hashing: overview 1: Random Bit Sampling B 3 FE 00000 C 04000 FF BF 350003 3 F 490000 h(B) 00 FF 30 A 3 FE 00000 C 04000 FF BF 3500 FF 3 F 490000 h(A) 00 FF 30 C BFA 900 D 8 3 ED 457 FF 3 E 745003 C 0400000 h(C) 08 FEC 0 Better hash quality Big hashes, not efficient 19

Locality-Sensitive hashing: overview 2: Random Projection More efficient Apply random matrix ~ N(μ, σ) distance ≈ w. h. p. projection matrix cacheline hash Lots of storage and multiplication, big hashes BAD HW Johnson & Lindenstrauss, Extensions of Lipschitz mappings into a Hilbert space, Contemporary mathematics 26: 189– 206, 1984. 20 Frankl & Maehara, The Johnson-Lindenstrauss lemma and the sphericity of some graphs, J. Combinatorial Theory B, 44: 355– 362, 1988.

Locality-Sensitive hashing: in Thesaurus: Novel HW-efficient Random Projection statistical computing techniques sparse {1, 0, – 1} * projection matrix cacheline hash Cheap Operation, Small Storage *Ping Li, Very Sparse Random Projections, 2006 Sean Fox, Random Projections for Scaling Machine Learning on FPGAs, 2016 21

Locality-Sensitive hashing: in Thesaurus: Novel HW-efficient Random Projection Efﬁcient HW implementation * > 1 Efficient HW 22

Thesaurus Architecture 23

Thesaurus: architecture Tag Array Data Array 64 -Byte Memory Block Cachelines Tag Array decoupled arrays Base Cache 24

Thesaurus: insertion operation Tag Array Data Array 64 -Byte Memory Block LSH 0 x 02 f 0 Cachelines Tag Array tag A Base Cache base cacheline A 25

Tag Array tag A 0 x 02 f 0 Tag Array Data Array cacheline A tag B 64 -Byte Memory Block LSH 0 x 02 f 0 0 x 04 e 1 Cachelines Thesaurus: insertion operation Base Cache base cacheline B 26

Thesaurus: insertion operation tag C Tag Array tag B Data Array cacheline A cacheline B 0 x 04 e 1 Cachelines Tag Array tag A 0 x 02 f 0 64 -Byte Memory Block Base Cache base LSH 0 x 02 f 0 0 x 04 e 1 hit 0 x 04 e 1 C Δs cacheline C Δ Δ diff 27

Tag Array tag A 0 x 02 f 0 tag C Tag Array tag B Data Array cacheline A cacheline B 0 x 04 e 1 C 0 x 04 e 1 Δs 64 -Byte Memory Block LSH 0 x 02 f 0 0 x 04 e 1 Cachelines Thesaurus: read operation Base Cache base 28

Thesaurus: read operation Tag Array tag A 0 x 02 f 0 tag C Tag Array tag B Data Array cacheline A cacheline B 0 x 04 e 1 C 0 x 04 e 1 Δs relevant bytes base xor 0 x 02 f 0 0 x 04 e 1 Δ Δ cacheline C base 29

Thesaurus Results 30

Working Set Size 0. 4 40% 0. 2 20% 0 0 (Lower is better) Compressed Size 0. 8 0. 6 0. 4 0. 2 0 0 1. 28 x Results reported for SPEC CPU’ 17 Iso-silicon 1 MB MPKI Ideal Baseline 2 x 60% 0. 8 Dedup Thesaurus 0. 6 2. 25 x 1 B∆I 80% 1 Baseline 0. 8 Miss Rate normalized MPKI 1 100% Baseline B∆I Dedup Thesaurus Ideal Average working set size Results: compression and performance Compression 2. 25× 31

Summary • Demonstrated significant similarity in data values of memory blocks across different cachelines • Proposed an efficient LLC compression based on clustering nearly identical cachelines using locality-sensitive hashing • Showed practical dynamic and hardware-friendly clustering • Achieved Higher Compression Ratio, Less Miss Rate

Questions ?

Thesaurus cache: operation example 34

Results: cluster delta sizes Average byte difference size - same LSH (Looking at data block only) 1. 2 x 3. 2 x 8 x Cacheline in baseline Cacheline in Thesaurus 64 -byte >2/3 Size reduction 35

Results: Total Power Difference in total power consumption: Thesaurus vs. baseline Less consumption More consumption 36

Results: Power Dynamic read energy & leakage power: scaled to the same silicon area = 5. 56/2. 82 mm 2 45/32 nm 2 1. 75 1. 25 1 0. 75 0. 25 0 Dynamic energy Leakage power Area, Latency, Power overheads of added logics: 64 B Line, 2. 66 Ghz, Free. PDK 45 nm 37

IDEA: inter-line deduplication + delta • Q 1: are there enough similar cache lines? • Q 2: how to find similar lines quickly? • Q 3: how to protect base from eviction? 39

Thesaurus cache: Avoiding fragmentation 40

Results: cluster sizes (same LSH) 41

Results: compression formats 42

Thesaurus cache: operation diagram