Thesaurus Efficient Cache Compression via Dynamic Clustering Amin
- Slides: 41
Thesaurus: Efficient Cache Compression via Dynamic Clustering Amin Ghasemazar, Prashant Nair, Mieszko Lis The University of British Columbia ASPLOS 2020
Executive summary How to Improve cache effectiveness by avoiding redundancy across nearly identical cachelines? Observations: clusters of nearly identical cachelines in working set Problem: how to use this to compress the cache ? Key Idea: - group nearly identical cachelines via hardware-level dynamic clustering - store each cluster as one clusteroid line + smaller deltas Results: Higher Compression Ratio (up to 9. 9 x - 2. 25 x Gmean) 2
Limitations of existing methods Data Array cacheline C Cachelines cacheline B compressed A 0 x 3 FE 00000 00 FF 03 D 8 … Base cacheline A Δ 2 Δ 3 … 0 x 3 FE 00000 0 x 3 FE 000 FF 0 x 3 FE 00003 0 x 3 FE 000 D 8 … 3
Motivation: compressed caches Prior work: Within Cachelines Data Array cacheline B cacheline C Cachelines 0 x 3 FE 00000 00 FF 03 D 8 … cacheline D 0 x 3 FE 00000 0 x. C 04000 FF 0 x. BF 350003 0 x 3 F 490000 … 4
Motivation: compressed caches Prior work: Within Cachelines Data Array cacheline B cacheline C cacheline D 0 x 3 FE 00000 0 x. C 04000 FF 0 x. BF 350003 0 x 3 F 490000 cacheline E OPPORTUNITY 0 x 3 FE 00000 Missed 0 x. C 04000 FFCompression 0 x. BF 3500 FF 0 x 3 F 490000 … … Cachelines 0 x 3 FE 00000 00 FF 03 D 8 … 5
Goal: delta compression in vertical fashion New Method: Delta across cachelines Data Array D cacheline C 0 x 3 FE 00000 0 x. C 04000 FF 0 x. BF 350003 0 x 3 F 490000 meta data nearly identical cacheline B … FF encode 1 Cachelines 0 x 3 FE 00000 00 FF 03 D 8 … Δ 2 E 0 x 3 FE 00000 0 x. C 04000 FF 0 x. BF 3500 FF 0 x 3 F 490000 SEARCH for Near Cachelines? Exhaustive search … / Deduplication Hashing ? 6
Goal: delta compression in vertical fashion Prior work: Across Cachelines Data Array cacheline C 0 x 3 FE 00000 0 x. C 04000 FF 0 x. BF 350003 0 x 3 F 490000 Identical h(C) =h(D) … Deduplication Exhaustive search cacheline EDfor 0 x 3 FE 00000 0 x. C 04000 FF 0 x. BF 350003 0 x. BF 3500 FF 0 x 3 F 490000 SEARCH Near Cachelines? … Nearly identical h(C) ≠h(E) Deduplication Hashing 7
Thesaurus
Thesaurus • Efficient near cacheline search • How to encode compressed cacheline • Low HW overhead 9
Idea: delta compression across cachelines A 3 FE 00000 C 04000 FF BF 350003 3 F 490000 stored B C 3 FE 00000 C 04000 FF BF 350003 3 F 490000 identical don’t store 3 FA 10000 3 F 4000 FF BF 3500 FF 3 F 4900 B 1 nearly identical store deltas 0 1 0 0 0 1 Bit map A 13 FFFB 1 stored compressed • Q 1: Are there enough nearly identical cachelines? • Q 2: How to find nearly identical cachelines quickly • Q 3: Do we need an on-line and adaptive mechanism? 10
Gmean bwaves cactubssn cam 4 deepsjeng exchange 2 fotonik 3 d gcc imagick lbm leela mcf nab namd omnetpp parest perlbench povray roms wrf x 264 xalancbmk xz Effective LLC capacity Opportunity: exactly identical baseline exactly identical 1. 3 x Only Results reported for SPEC CPU’ 17 11
n Gmean ea Gm bwaves cactubssn cam 4 deepsjeng exchange 2 fotonik 3 d gcc imagick lbm leela mcf nab namd omnetpp parest perlbench povray roms wrf x 264 xalancbmk xz Effective LLC capacity Opportunity: nearly identical baseline exactly identical nearly identical 2. 5× 1. 3× Only Enabling Near Match Big Compression OPPORTUNITY 13
400 300 200 100 0 1200 (DBSCAN) 800 400 0 bwaves cactubssn cam 4 deepsjeng exchange 2 fotonik 3 d gcc imagick lbm leela mcf nab namd omnetpp parest perlbench povray roms wrf x 264 xalancbmk xz # members # clusters Opportunity: clustering for quick search Results reported for SPEC CPU’ 17 Cluster counters and sizes VARY PER WORKLOAD hardcoding is impractical
Opportunity: in HW clustering challenges • Cluster counters and sizes VARY PER WORKLOAD ADAPTIVE cluster creation • Cache content is INPUT DEPENDANT must happen at RUNTIME • Quick and inexpensive clustering in HW cannot use DBSCAN/KMEANS … In HW DYNAMIC and ADAPTIVE cluster creation without SCANNING
Thesaurus Clustering 16
Clustering based on Locality-Sensitive hashing Locality-sensitive hashing: Same hash for similar blocks Same cluster Different hashes for dissimilar blocks. 17
Locality-Sensitive hashing: overview 1: Random Bit Sampling B 3 FE 00000 C 04000 FF BF 350003 3 F 490000 h(B) 0 F 0 A 3 FE 00000 C 04000 FF BF 3500 FF 3 F 490000 h(A) 0 F 0 C BFA 900 D 8 3 ED 457 FF 3 E 745003 C 0400000 h(C) 0 F 0 Assumes C is also similar to A, Bad hashes Increase #bits 18
Locality-Sensitive hashing: overview 1: Random Bit Sampling B 3 FE 00000 C 04000 FF BF 350003 3 F 490000 h(B) 00 FF 30 A 3 FE 00000 C 04000 FF BF 3500 FF 3 F 490000 h(A) 00 FF 30 C BFA 900 D 8 3 ED 457 FF 3 E 745003 C 0400000 h(C) 08 FEC 0 Better hash quality Big hashes, not efficient 19
Locality-Sensitive hashing: overview 2: Random Projection More efficient Apply random matrix ~ N(μ, σ) distance ≈ w. h. p. projection matrix cacheline hash Lots of storage and multiplication, big hashes BAD HW Johnson & Lindenstrauss, Extensions of Lipschitz mappings into a Hilbert space, Contemporary mathematics 26: 189– 206, 1984. 20 Frankl & Maehara, The Johnson-Lindenstrauss lemma and the sphericity of some graphs, J. Combinatorial Theory B, 44: 355– 362, 1988.
Locality-Sensitive hashing: in Thesaurus: Novel HW-efficient Random Projection statistical computing techniques sparse {1, 0, – 1} * projection matrix cacheline hash Cheap Operation, Small Storage *Ping Li, Very Sparse Random Projections, 2006 Sean Fox, Random Projections for Scaling Machine Learning on FPGAs, 2016 21
Locality-Sensitive hashing: in Thesaurus: Novel HW-efficient Random Projection Efficient HW implementation * > 1 Efficient HW 22
Thesaurus Architecture 23
Thesaurus: architecture Tag Array Data Array 64 -Byte Memory Block Cachelines Tag Array decoupled arrays Base Cache 24
Thesaurus: insertion operation Tag Array Data Array 64 -Byte Memory Block LSH 0 x 02 f 0 Cachelines Tag Array tag A Base Cache base cacheline A 25
Tag Array tag A 0 x 02 f 0 Tag Array Data Array cacheline A tag B 64 -Byte Memory Block LSH 0 x 02 f 0 0 x 04 e 1 Cachelines Thesaurus: insertion operation Base Cache base cacheline B 26
Thesaurus: insertion operation tag C Tag Array tag B Data Array cacheline A cacheline B 0 x 04 e 1 Cachelines Tag Array tag A 0 x 02 f 0 64 -Byte Memory Block Base Cache base LSH 0 x 02 f 0 0 x 04 e 1 hit 0 x 04 e 1 C Δs cacheline C Δ Δ diff 27
Tag Array tag A 0 x 02 f 0 tag C Tag Array tag B Data Array cacheline A cacheline B 0 x 04 e 1 C 0 x 04 e 1 Δs 64 -Byte Memory Block LSH 0 x 02 f 0 0 x 04 e 1 Cachelines Thesaurus: read operation Base Cache base 28
Thesaurus: read operation Tag Array tag A 0 x 02 f 0 tag C Tag Array tag B Data Array cacheline A cacheline B 0 x 04 e 1 C 0 x 04 e 1 Δs relevant bytes base xor 0 x 02 f 0 0 x 04 e 1 Δ Δ cacheline C base 29
Thesaurus Results 30
Working Set Size 0. 4 40% 0. 2 20% 0 0 (Lower is better) Compressed Size 0. 8 0. 6 0. 4 0. 2 0 0 1. 28 x Results reported for SPEC CPU’ 17 Iso-silicon 1 MB MPKI Ideal Baseline 2 x 60% 0. 8 Dedup Thesaurus 0. 6 2. 25 x 1 B∆I 80% 1 Baseline 0. 8 Miss Rate normalized MPKI 1 100% Baseline B∆I Dedup Thesaurus Ideal Average working set size Results: compression and performance Compression 2. 25× 31
Summary • Demonstrated significant similarity in data values of memory blocks across different cachelines • Proposed an efficient LLC compression based on clustering nearly identical cachelines using locality-sensitive hashing • Showed practical dynamic and hardware-friendly clustering • Achieved Higher Compression Ratio, Less Miss Rate
Questions ?
Thesaurus cache: operation example 34
Results: cluster delta sizes Average byte difference size - same LSH (Looking at data block only) 1. 2 x 3. 2 x 8 x Cacheline in baseline Cacheline in Thesaurus 64 -byte >2/3 Size reduction 35
Results: Total Power Difference in total power consumption: Thesaurus vs. baseline Less consumption More consumption 36
Results: Power Dynamic read energy & leakage power: scaled to the same silicon area = 5. 56/2. 82 mm 2 45/32 nm 2 1. 75 1. 25 1 0. 75 0. 25 0 Dynamic energy Leakage power Area, Latency, Power overheads of added logics: 64 B Line, 2. 66 Ghz, Free. PDK 45 nm 37
IDEA: inter-line deduplication + delta • Q 1: are there enough similar cache lines? • Q 2: how to find similar lines quickly? • Q 3: how to protect base from eviction? 39
Thesaurus cache: Avoiding fragmentation 40
Results: cluster sizes (same LSH) 41
Results: compression formats 42
Thesaurus cache: operation diagram
- Fe00000
- Flat and hierarchical clustering
- Divisive hierarchical clustering example
- Flat clustering vs hierarchical clustering
- Cure: an efficient clustering algorithm for large databases
- Productively efficient vs allocatively efficient
- Productively efficient vs allocatively efficient
- Allocative efficiency vs productive efficiency
- Productively efficient vs allocatively efficient
- Productively efficient vs allocatively efficient
- Cache efficient matrix transpose
- Dynamic network surgery for efficient dnns
- Decimo quinta estacion via crucis
- Via negativa
- Via piramidal primera y segunda neurona
- Lucis
- Palavras convergentes
- Register thesaurus
- Fine similar words
- An almanac is a thesaurus.
- Fungsi thesaurus
- Thesaurus introduction
- Structured thesaurus
- English comprehension and composition
- Lochkerne
- Aat thesaurus
- Thesaurus for graphic materials
- Cdc plain language thesaurus
- Wan thesaurus
- Automatic thesaurus generation
- Thesaurus
- Animated thesaurus
- Dynamic dynamic - bloom
- Jasmine amin chelsea
- Aromatik amin
- Kuaterner amin türevleri
- Dr. jadallah
- Otrzymywanie amin
- Ghazala amin
- Prinsip kerja nitrimetri
- Calvin amin
- Rektant