High Performance Cache Replacement Using ReReference Interval Prediction

  • Slides: 32
Download presentation
High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon

High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr. , Joel Emer Intel Corporation, VSSAD International Symposium on Computer Architecture ( ISCA – 2010 )

Motivation • Factors making caching important • Increasing ratio of CPU speed to memory

Motivation • Factors making caching important • Increasing ratio of CPU speed to memory speed • Multi-core poses challenges on better shared cache management • LRU has been the standard replacement policy at LLC • However LRU has problems! 2

Problems with LRU Replacement LLCsize Working set larger than the cache causes thrashing miss

Problems with LRU Replacement LLCsize Working set larger than the cache causes thrashing miss miss References to non-temporal data (scans) discards frequently referenced working set hit Wsize hit scan miss Our studies show that scans occur frequently in many commercial workloads 3

Desired Behavior from Cache Replacement hit miss hit miss Working set larger than the

Desired Behavior from Cache Replacement hit miss hit miss Working set larger than the cache Preserve some of working set in the cache Wsize LLCsize Recurring scans Preserve frequently referenced working set in the cache hit hit scan hit 4

Prior Solutions to Enhance Cache Replacement Working set larger than the cache Preserve some

Prior Solutions to Enhance Cache Replacement Working set larger than the cache Preserve some of working set in the cache Dynamic Insertion Policy (DIP) Thrash-resistance with minimal changes to HW Recurring scans Preserve frequently referenced working set in the cache Least Frequently Used (LFU) addresses scans LFU adds complexity and also performs bad for recency friendly workloads GOAL: Design a High Performing Scan-Resistant Policy that Requires Minimum Changes to HW 5

Belady’s Optimal (OPT) Replacement Policy • • Replacement decisions using perfect knowledge of future

Belady’s Optimal (OPT) Replacement Policy • • Replacement decisions using perfect knowledge of future reference order Victim Selection Policy: • Replaces block that will be re-referenced furthest in future victim block Physical Way # Cache Tag “Time” when block will be referenced next 0 1 2 3 4 5 6 7 a c b h f d g e 4 13 11 5 3 6 9 1 6

Practical Cache Replacement Policies • • Replacement decisions made by predicting the future reference

Practical Cache Replacement Policies • • Replacement decisions made by predicting the future reference order Victim Selection Policy: • • Replace block predicted to be re-referenced furthest in future Continually update predictions on the future reference order • Natural update opportunities are on cache fills and cache hits victim block Physical Way # Cache Tag 0 1 2 3 4 5 6 7 a c b h f d g e “Predicted Time” when block will be referenced next ~ 4 ~ 13 ~ 11 ~ 5 ~ 3 ~ 6 ~ 9 ~ 1 7

LRU Replacement in Prediction Framework • The “LRU chain” maintains the re-reference prediction •

LRU Replacement in Prediction Framework • The “LRU chain” maintains the re-reference prediction • • Head of chain (i. e. MRU position) predicted to be re-referenced soon Tail of chain (i. e. LRU position) predicted to re-referenced far in the future LRU predicts that blocks are re-referenced in reverse order of reference Rename “LRU Chain” to the “Re-Reference Prediction (RRP) Chain ” • Rename “MRU position” RRP Head and “LRU position” RRP Tail RRP head MRU position LRU chain position stored with each cache block RRP tail LRU position h g f e d c b a 0 1 2 3 4 5 6 7 8

Practicality of Chain Based Replacement RRP Head h RRPV (n=2): Qualitative Prediction: • g

Practicality of Chain Based Replacement RRP Head h RRPV (n=2): Qualitative Prediction: • g f e d c b a 0 1 2 3 ‘nearimmediate’ ‘intermediate’ ‘far’ ‘distant’ Problem: Chain based replacement is too expensive! • • RRP Tail log 2(associativity) bits required per cache block (16 -way requires 4 -bits/block) Solution: LRU chain positions can be quantized into different buckets • • • Each bucket corresponds to a predicted Re-Reference Interval Value of bucket is called the Re-Reference Prediction Value (RRPV) Hardware Cost: ‘n’ bits per block [ ideally you would like n < log 2 A ] 9

Representation of Quantized Replacement (n = 2) RRP Head h RRPV: Qualitative Prediction: RRP

Representation of Quantized Replacement (n = 2) RRP Head h RRPV: Qualitative Prediction: RRP Tail g f e d c b a 0 1 2 3 ‘nearimmediate’ ‘intermediate’ ‘far’ ‘distant’ Physical Way # 0 1 2 3 4 5 6 7 Cache Tag a c b h f d g e RRPV 3 2 3 0 1 10

Emulating LRU with Quantized Buckets (n=2) • n Victim Selection Policy: Evict block with

Emulating LRU with Quantized Buckets (n=2) • n Victim Selection Policy: Evict block with distant RRPV (i. e. 2 -1 = ‘ 3’) • • If no distant RRPV (i. e. ‘ 3’) found, increment all RRPVs and repeat the search If multiple found, need tie breaker. Let us always start search from physical way ‘ 0’ Insertion Policy: Insert new block with RRPV=‘ 0’ Update Policy: Cache hits update the block’s RRPV=‘ 0’ hit victim block Physical Way # 0 1 2 3 4 5 6 7 Cache Tag s a c b h f d g e RRPV 0 3 2 0 3 0 1 1 0 1 But We Want to do BETTER than LRU!!! 11

Re-Reference Interval Prediction (RRIP) • 1 2 3 4 5 6 7 Cache Tag

Re-Reference Interval Prediction (RRIP) • 1 2 3 4 5 6 7 Cache Tag a c b h f d g e RRPV 3 2 3 0 1 1 0 1 Unlike LRU, can use non-zero RRPV on insertion Unlike LRU, can use a non-zero RRPV on cache hits Static Re-Reference Interval Prediction (SRRIP) • • 0 Framework enables re-reference predictions to be tuned at insertion/update • • • Physical Way # Determine best insertion/update prediction using profiling [and apply to all apps] Dynamic Re-Reference Interval Prediction (DRRIP) • Dynamically determine best re-reference prediction at insertion 12

Static RRIP Insertion Policy – Learn Block’s Re-reference Interval • Key Idea: Do not

Static RRIP Insertion Policy – Learn Block’s Re-reference Interval • Key Idea: Do not give new blocks too much (or too little) time in the cache • • • Predict new cache block will not be re-referenced soon Insert new block with some RRPV other than ‘ 0’ Similar to inserting in the “middle” of the RRP chain • However it is NOT identical to a fixed insertion position on RRP chain (see paper) victim block Physical Way # 0 1 2 3 4 5 6 7 Cache Tag s a c b h f d g e RRPV 2 3 0 1 13

Static RRIP Update Policy on Cache Hits • Hit Priority (HP) • • Like

Static RRIP Update Policy on Cache Hits • Hit Priority (HP) • • Like LRU, Always update RRPV=0 on cache hits. Intuition: Predicts that blocks receiving hits after insertion will be re-referenced soon hit Physical Way # 0 1 2 3 4 5 6 7 Cache Tag s c b h f d g e RRPV 2 0 2 3 0 1 1 0 1 An Alternative Update Scheme Also Described in Paper 14

SRRIP Hit Priority Sensitivity to Cache Insertion Prediction at LLC % Fewer Cache Misses

SRRIP Hit Priority Sensitivity to Cache Insertion Prediction at LLC % Fewer Cache Misses Relative to LRU Averaged Across PC Games, Multimedia, Server, and SPEC 06 Workloads on 16 -way 2 MB LLC 10. 00 7. 50 5. 00 2. 50 0. 00 -2. 50 -5. 00 -7. 50 n=1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 -10. 00 Re-Reference Interval Prediction (RRIP) Value At Insertion n=1 is in fact the NRU replacement policy commonly used in commercial processors 15

SRRIP Hit Priority Sensitivity to Cache Insertion Prediction at LLC % Fewer Cache Misses

SRRIP Hit Priority Sensitivity to Cache Insertion Prediction at LLC % Fewer Cache Misses Relative to LRU Averaged Across PC Games, Multimedia, Server, and SPEC 06 Workloads on 16 -way 2 MB LLC 10. 00 7. 50 5. 00 2. 50 0. 00 -2. 50 -5. 00 -7. 50 n=1 n=2 n=3 n=4 n=5 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 -10. 00 Re-Reference Interval Prediction (RRIP) Value At Insertion Regardless of ‘n’ Static RRIP Performs Best When RRPVinsertion is 2 n-2 Regardless of ‘n’ Static RRIP Performs Worst When RRPVinsertion is 2 n-1 16

Why Does RRPVinsertion of 2 n-2 Work Best for SRRIP? Wsize hit • •

Why Does RRPVinsertion of 2 n-2 Work Best for SRRIP? Wsize hit • • hit scan ? Recall, NRU (n=1) is not scan-resistant For scan resistance RRPVinsertion MUST be different from RRPV of working set blocks Larger insertion RRPV tolerates larger scans • • hit Before scan, re-reference prediction of active working set is ‘ 0’ • • Slen Maximum insertion prediction (i. e. 2 n-2) works best! In general, re-references after scan hit IF Slen < ( RRPVinsertion – Starting-RRPVworkingset) * (LLCsize – Wsize) SRRIP is Scan Resistant for Slen < ( RRPVinsertion ) * (LLCsize – Wsize) For n > 1 Static RRIP is Scan Resistant! What about Thrash Resistance? 17

DRRIP: Extending Scan-Resistant SRRIP to Be Thrash-Resistant SRRIP miss DRRIP miss hit • •

DRRIP: Extending Scan-Resistant SRRIP to Be Thrash-Resistant SRRIP miss DRRIP miss hit • • miss hit miss Always using same prediction for all insertions will thrashes the cache Like DIP, need to preserve some fraction of working set in cache • • miss Extend DIP to SRRIP to provide thrash resistance Dynamic Re-Reference Interval Prediction: • • Dynamically select between inserting blocks with 2 n-1 and 2 n-2 using Set Dueling Inserting blocks with 2 n-1 is same as “no update on insertion” DRRIP Provides Both Scan-Resistance and Thrash-Resistance 18

% Performance Improvement over LRU Performance Comparison of Replacement Policies NRU 20 DIP SRRIP

% Performance Improvement over LRU Performance Comparison of Replacement Policies NRU 20 DIP SRRIP DRRIP 16 -way 2 MB LLC 15 10 5 0 -5 GAMES MULTIMEDIA SERVER SPEC 06 Static RRIP Always Outperforms LRU Replacement Dynamic RRIP Further Improves Performance of Static RRIP ALL 19

Cache Replacement Competition (CRC) Results NRU % Performance Improvement Over LRU 8 7 3

Cache Replacement Competition (CRC) Results NRU % Performance Improvement Over LRU 8 7 3 -bit SRRIP 3 -bit DRRIP Dueling Segmented-LRU (CRC winner) Averaged Across PC Games, Multimedia, Enterprise Server, SPEC CPU 2006 Workloads 6 5 4 3 2 1 DRRIP D R R I P 0 -1 16 -way 1 MB Private Cache 16 -way 4 MB Shared Cache 65 Single-Threaded Workloads 165 4 -core Workloads Private Caches Shared Caches Un-tuned DRRIP Would Be Ranked 2 nd and is within 1% of CRC Winner Unlike CRC Winner, DRRIP Does Not Require Any Changes to Cache Structure 20

Total Storage Overhead (16 -way Set Associative Cache) • LRU: • NRU • DRRIP-3:

Total Storage Overhead (16 -way Set Associative Cache) • LRU: • NRU • DRRIP-3: • CRC Winner: 4 -bits / cache block 1 -bit / cache block 3 -bits / cache block ~8 -bits / cache block DRRIP Outperforms LRU With Less Storage Than LRU NRU Can Be Easily Extended to Realize DRRIP! 21

Summary • Scan-resistance is an important problem in commercial workloads • State-of-the art policies

Summary • Scan-resistance is an important problem in commercial workloads • State-of-the art policies do not address scan-resistance • We Propose a Simple and Practical Replacement Policy • Static RRIP (SRRIP) for scan-resistance • Dynamic RRIP (DRRIP) for thrash-resistance and scan-resistance • DRRIP requires ONLY 3 -bits per block • In fact it incurs less storage than LRU • Un-tuned DRRIP would be 2 nd place in CRC Championship • DRRIP requires significantly less storage than CRC winner 22

Q&A 23

Q&A 23

Q&A 24

Q&A 24

Q&A 25

Q&A 25

Static RRIP with n=1 • Static RRIP with n = 1 is the commonly

Static RRIP with n=1 • Static RRIP with n = 1 is the commonly used NRU policy (polarity inverted) • • • Victim Selection Policy: Evict block with RRPV=‘ 1’ Insertion Policy: Insert new block with RRPV=‘ 0’ Update Policy: Cache hits update the block’s RRPV=‘ 0’ hit victim block Physical Way # 0 1 2 3 4 5 6 7 Cache Tag s a c b h f d g e RRPV 0 1 1 0 1 But NRU Is Not Scan-Resistant 26

SRRIP Update Policy on Cache Hits • Frequency Priority (FP): • • Improve re-reference

SRRIP Update Policy on Cache Hits • Frequency Priority (FP): • • Improve re-reference prediction to be shorter than before on hits (i. e. RRPV--). Intuition: Like LFU, predicts that frequently referenced blocks should have higher priority to stay in cache Physical Way # 0 1 2 3 4 5 6 7 Cache Tag s c b h f d g e RRPV 2 2 1 3 0 1 1 0 1 27

SRRIP-HP and SRRIP-FP Cache Performance SRRIP-Frequency Priority • • • SRRIP-HP has 2 X

SRRIP-HP and SRRIP-FP Cache Performance SRRIP-Frequency Priority • • • SRRIP-HP has 2 X better cache performance relative to LRU than SRRIP-FP We do not need to precisely detect frequently referenced blocks We need to preserve blocks that receive hits SRRIP-Hit Priority 28

Games, Multimedia, Enterprise Server, Mixed Workloads Common Access Patterns in Workloads • Stack Access

Games, Multimedia, Enterprise Server, Mixed Workloads Common Access Patterns in Workloads • Stack Access Pattern: (a 1, a 2, …ak, …a 2, a 1)A • • Streaming Access Pattern: (a 1, a 2, … ak) for k >> assoc • • Solution: For any ‘k’, LRU performs well for such access patterns No Solution: Cache replacement can not solve this problem Thrashing Access Pattern: (a 1, a 2, … ak)A , for k > assoc • • LRU receives no cache hits due to cache thrashing Solution: preserve some fraction of working set in cache (e. g. Use BIP) • BIP does NOT update replacement state for the majority of cache insertions • Mixed Access Pattern: [(a 1, a 2, …ak, …a 2, a 1)A (b 1, b 2, … bm)] N, m > assoc-k • • LRU always misses on frequently referenced: (a 1, a 2, … ak, … a 2, a 1)A (b 1, b 2, … bm) commonly referenced to as a scan in literature In absence of scan, LRU performs well for such access patterns Solution: preserve frequently referenced working set in cache (e. g. use LFU) • LFU replaces infrequently referenced blocks in the presence of frequently referenced blocks 29

Performance of Hybrid Replacement Policies at LLC PC Games / multimedia server SPEC CPU

Performance of Hybrid Replacement Policies at LLC PC Games / multimedia server SPEC CPU 2006 Average 4 -way Oo. O Processor, 32 KB L 1, 256 KB L 2, 2 MB LLC • DIP addresses SPEC workloads but NOT PC games & multimedia workloads • Real world workloads prefer scan-resistance instead of thrash-resistance 30

Understanding LRU Enhancements in the Prediction Framework RRP Head h • g f e

Understanding LRU Enhancements in the Prediction Framework RRP Head h • g f e d c b a Recent policies, e. g. , DIP, say “Insert new blocks at the ‘LRU position’” • • RRP Tail What does it mean to insert an MRU line in the LRU position? Prediction that new block will be re-referenced later than existing blocks in the cache What DIP really means is “Insert new blocks at the `RRIP Tail’ ” Other policies, e. g. , PIPP, say “Insert new blocks in ‘middle of the LRU chain’” • Prediction that new block will be re-referenced at an intermediate time The Re-Reference Prediction Framework Helps Describe the Intuitions Behind Existing Replacement Policy Enhancements 31

% Performance Improvement over LRU Performance Comparison of Replacement Policies 20 NRU DIP SRRIP

% Performance Improvement over LRU Performance Comparison of Replacement Policies 20 NRU DIP SRRIP Best RRP Chain Insertion DRRIP 16 -way 2 MB LLC 15 10 5 0 -5 GAMES MULTIMEDIA SERVER SPEC 06 Static RRIP Always Outperforms LRU Replacement Dynamic RRIP Further Improves Performance of Static RRIP ALL 32