Computer Architecture Lab Modeling Cache Performance Beyond LRU

Computer Architecture Lab. Modeling Cache Performance Beyond LRU (HPCA ‘ 16) Nathan Beckmann and Daniel Sanchez @MIT 2016/08/31 SIGARCH Presented by Changdae Kim

Understanding LLC is Critical • Consume large area: >50% of Westmere • Off-chip misses are very expensive • Enable many optimizations – Single-threaded performance – Shared cache performance – Fairness, Quality of services Via cache partitioning, replacement policy, scheduling, etc • Unfortunately, it depends on many things – Cache size and associativity – Access pattern of applications – Replacement policy now, LLC does not use (pseudo-)LRU • New model is needed! 2

LLC behaves as a random process • Observation 1: private cache filters correlated accesses – LLC accesses are free of short-term temporal correlations • Observation 2: Address hashing removes hot sets – [Skew-associative cache], [zcache] – LLC accesses have near-uniform behavior across sets – LLC has high effective associativity • Make Conflict misses a second-order concern • Our approach – Focus on individual lines, not sets – Replacement policy as a probabilistic process on lines 3

Outline • Motivation and Key approach • Cache modeling – Terms and notations – Cache model for LRU – Extend to other replacement policies • Model solution and implementation – Algorithm to solve the model – Optimization: increased step size – Implementation: how to profile reuse distance • Model validation • Case study: apply to cache partitioning • Conclusion 4

Terms and Notations (1/2) Dead Live How old is the line based on # accesses Hit Evict 5

Terms and Notations (2/2) How old is the line based on # accesses Hit Evict hit-rate D: reuse Distance / A: Age / H: Hit / E: Eviction 6

Outline • Motivation and Key approach • Cache modeling – Terms and notations – Cache model for LRU – Extend to other replacement policies • Model solution and implementation – Algorithm to solve the model – Optimization: increased step size – Implementation: how to profile reuse distance • Model validation • Case study: apply to cache partitioning • Conclusion 7

Cache Model Overview • Three interdependent probability distributions Reuse distance distribution • Solved by iterating to a fixed point • Distributions converge accurate model – Converges in practice, without proof (future work) 8

Model Assumptions • Reuse distance d distributed identically and independently according to PD(d) – Good approximation for large caches [Talus] – Private caches capture the correlated accesses • Replacement candidates are drawn at random from cached lines – Good approximation for hashed, set-associative caches with many ways • Skew-associative cache, zcache • Victim’s age is independent of whether the cache hits or misses – Large cache is typically unaffected by a single candidate 9

Age Distribution • D: reuse Distance / A: Age / H: Hit / E: Eviction 10

Hit Distribution • D: reuse Distance / A: Age / H: Hit / E: Eviction 11

Eviction Distribution in LRU • D: reuse Distance / A: Age / H: Hit / E: Eviction 12

Cache Model Overview in LRU D: reuse Distance / A: Age / H: Hit / E: Eviction 13

Outline • Motivation and Key approach • Cache modeling – Terms and notations – Cache model for LRU – Extend to other replacement policies • Model solution and implementation – Algorithm to solve the model – Optimization: increased step size – Implementation: how to profile reuse distance • Model validation • Case study: apply to cache partitioning • Conclusion 14

Ranking Functions • PDP: protects lines up to an age dp – Prefers to evict lines older than dp – If no such lines, prefer to evict the youngest line • IRGD: based on {reuse distance – current age} – Expected reuse time from now on – Use a weighted harmonic mean D: reuse Distance / A: Age / H: Hit / E: Eviction 15

Generalized Eviction Distribution • [LRU] a is the oldest [LRU] 1 D: reuse Distance / A: Age / H: Hit / E: Eviction 16

Cache Model Overview D: reuse Distance / A: Age / H: Hit / E: Eviction 17

Outline • Motivation and Key approach • Cache modeling – Terms and notations – Cache model for LRU – Extend to other replacement policies • Model solution and implementation – Algorithm to solve the model – Optimization: increased step size – Implementation: how to profile reuse distance • Model validation • Case study: apply to cache partitioning • Conclusion 18

Model Solution • 19

Model Solution • 20

Model Solution • 21

Model Solution • 22

Model Solution • 23

Model Solution • 24

Model Solution • 25

Model Solution • 26

Model Solution • 27

Model Solution • Changes < 0. 001 for ten iterations [1] Solve based on old P[hit] and inputs [2] Calculate new P[hit] based on current solution [3] Moving average of solutions 28

Outline • Motivation and Key approach • Cache modeling – Terms and notations – Cache model for LRU – Extend to other replacement policies • Model solution and implementation – Algorithm to solve the model – Optimization: increased step size – Implementation: how to profile reuse distance • Model validation • Case study: apply to cache partitioning • Conclusion 29

Increased Step Size (1/2) • Model works on coarsened regions • How to determine the regions – Divide all ages evenly into N/2 regions • 0, 8, 16, …, 256 with 8 -bit ages – Halves a region N/2 times to equalize the probability of hits and evictions in each region • Halves a region with the highest probability 30

Increased Step Size (2/2) • #regions=32 vs. for all ages 31

Outline • Motivation and Key approach • Cache modeling – Terms and notations – Cache model for LRU – Extend to other replacement policies • Model solution and implementation – Algorithm to solve the model – Optimization: increased step size – Implementation: how to profile reuse distance • Model validation • Case study: apply to cache partitioning • Conclusion 32

How to Profile Reuse Distance • Our implementation: add a small, tagged, LRU array D Access H 3 hash Sample (1%) 1023 Tglobal: 1024 Increment for each sample Line 0 Line 1 Line 2 Line 3 Partial Tag (16 b) Tentry(=Tglobal/32) I don’t know why 1% and 32… A D G 32 31 30 26 (1024/32) – (30)=2 K 21 Reuse Distance 1 2 3 … Distribution 20 71 70 55 … • Other schemes can be used – Hardware based, software based, … 33

Outline • Motivation and Key approach • Cache modeling – Terms and notations – Cache model for LRU – Extend to other replacement policies • Model solution and implementation – Algorithm to solve the model – Optimization: increased step size – Implementation: how to profile reuse distance • Model validation • Case study: apply to cache partitioning • Conclusion 34

Synthetic Simulation • Small cache – But, randomly selects replacement candidate • Our assumption • Generated access traces replacement 35

Synthetic Simulation • Small cache – But, randomly selects replacement candidate • Our assumption • Generated access traces replacement 36

Synthetic Simulation • Small cache – But, randomly selects replacement candidate • Our assumption • Generated access traces replacement 37

Execution-driven: Methodology • Simulator: zsim – L 3 cache size: 128 KB ~ 128 MB • Benchmark: SPECCPU 2006 – Fast-forward: 10 B inst – Run: 10 B inst • Solve the model every 250 K accesses – >400 K solutions in all 38

Model Accuracy • For >400 K solutions with N=128 – Median error: 0. 1% ~ 0. 6% – Mean error: 2. 2% ~ 3. 7% – 90 th percentile error: 6. 1% ~ 9. 9% 39

Sensitivity to Step Size • N < 128 noticeably degrades accuracy – Skew-associative LLCs improve model accuracy even further N=∞ means the age-by-age solution 40

Reason of Model Error (1/2) • Unstable access pattern degrades accuracy Simulation Model Error First 25 M LLC accesses on a 1 MB, LRU LLC 41

Reason of Model Error (2/2) • Long run averages out the error Simulation Model Error A model for 10 B accesses on various size LRU LLC mcf, lbm, Gems on LRU or PDP LLC shows large error è breaking accesses into multiple classes reduces this error 42

Outline • Motivation and Key approach • Cache modeling – Terms and notations – Cache model for LRU – Extend to other replacement policies • Model solution and implementation – Algorithm to solve the model – Optimization: increased step size – Implementation: how to profile reuse distance • Model validation • Case study: apply to cache partitioning • Conclusion 43

Apply to Way-based Cache Partitioning • Speedup for 100 random mixes – 4 MB LLC on 4 -core / Run at least 1 B inst – 16 solutions for 16 -way partitions 0. 6% runtime overhead – UCP+Model IRGD improves 10. 2% weighted speedup 44

Outline • Motivation and Key approach • Cache modeling – Terms and notations – Cache model for LRU – Extend to other replacement policies • Model solution and implementation – Algorithm to solve the model – Optimization: increased step size – Implementation: how to profile reuse distance • Model validation • Case study: apply to cache partitioning • Conclusion 45

Conclusions • Understanding LLC is critical – Performance, fairness, etc. • Model modern LLCs as a random process – Can model replacement policies beyond LRU • Propose iterative algorithm to solve the model – Show high accuracy • Apply to way-based cache partitioning – Outperforms the state-of-art mechanisms 46