COMPUTER ARCHITECTURE CS 6354 Caches Samira Khan University

  • Slides: 64
Download presentation
COMPUTER ARCHITECTURE CS 6354 Caches Samira Khan University of Virginia April 14, 2016 The

COMPUTER ARCHITECTURE CS 6354 Caches Samira Khan University of Virginia April 14, 2016 The content and concept of this course are adapted from CMU ECE 740

AGENDA • Logistics • Review from last lecture • Caches 2

AGENDA • Logistics • Review from last lecture • Caches 2

LOGISTICS • Milestone II Meetings – Today • Review class – Problem solving class

LOGISTICS • Milestone II Meetings – Today • Review class – Problem solving class – Tuesday April 19 3

TWO-LEVEL PREDICTOR • Intuition behind two-level predictors • Realization 1: A branch’s outcome can

TWO-LEVEL PREDICTOR • Intuition behind two-level predictors • Realization 1: A branch’s outcome can be correlated with other branches’ outcomes – Global branch correlation • Realization 2: A branch’s outcome can be correlated with past outcomes of the same branch (other than the outcome of the branch “last-time” it was executed) – Local branch correlation 4

ONE-LEVEL BRANCH PREDICTOR Direction predictor (2 -bit counters) taken? PC + inst size Program

ONE-LEVEL BRANCH PREDICTOR Direction predictor (2 -bit counters) taken? PC + inst size Program Counter Next Fetch Address hit? Address of the current instruction target address Cache of Target Addresses (BTB: Branch Target Buffer) 5

TWO-LEVEL GLOBAL HISTORY PREDICTOR Which direction earlier branches went Direction predictor (2 -bit counters)

TWO-LEVEL GLOBAL HISTORY PREDICTOR Which direction earlier branches went Direction predictor (2 -bit counters) taken? Global branch history Program Counter PC + inst size Next Fetch Address hit? Address of the current instruction target address Cache of Target Addresses (BTB: Branch Target Buffer) 6

TWO-LEVEL GSHARE PREDICTOR Which direction earlier branches went Direction predictor (2 -bit counters) taken?

TWO-LEVEL GSHARE PREDICTOR Which direction earlier branches went Direction predictor (2 -bit counters) taken? Global branch history Program Counter PC + inst size XOR Next Fetch Address hit? Address of the current instruction target address Cache of Target Addresses (BTB: Branch Target Buffer) 7

TWO-LEVEL LOCAL HISTORY PREDICTOR Which directions earlier instances of *this branch* went Direction predictor

TWO-LEVEL LOCAL HISTORY PREDICTOR Which directions earlier instances of *this branch* went Direction predictor (2 -bit counters) taken? PC + inst size Program Counter Next Fetch Address hit? Address of the current instruction target address Cache of Target Addresses (BTB: Branch Target Buffer) 8

SOME OTHER BRANCH PREDICTOR TYPES • Loop branch detector and predictor – Loop iteration

SOME OTHER BRANCH PREDICTOR TYPES • Loop branch detector and predictor – Loop iteration count detector/predictor – Works well for loops, where iteration count is predictable – Used in Intel Pentium M • Perceptron branch predictor – Learns the direction correlations between individual branches – Assigns weights to correlations – Jimenez and Lin, “Dynamic Branch Prediction with Perceptrons, ” HPCA 2001. • Hybrid history length based predictor – Uses different tables with different history lengths – Seznec, “Analysis of the O-Geometric History Length branch predictor, ” ISCA 2005. 9

STATE OF THE ART IN BRANCH PREDICTION • See the Branch Prediction Championship –

STATE OF THE ART IN BRANCH PREDICTION • See the Branch Prediction Championship – http: //www. jilp. org/cbp 2014/program. html Andre Seznec, “TAGE-SC-L branch predictors, ” CBP 2014. 10

CACHING BASICS 11

CACHING BASICS 11

REVIEW: DIRECT-MAPPED CACHE STRUCTURE • Assume byte-addressable memory: 256 bytes, 8 byte blocks 32

REVIEW: DIRECT-MAPPED CACHE STRUCTURE • Assume byte-addressable memory: 256 bytes, 8 byte blocks 32 blocks • Assume cache: 64 bytes, 8 blocks – Direct-mapped: A block can go to only one location – Addresses with same index contend for the same location Causes conflict misses tag 2 b index byte in block Tag store 3 bits Data store Address V tag MUX =? 12 Hit? Data byte in block

REVIEW: PROBLEM WITH DIRECT-MAPPED • Direct-mapped cache: Two blocks in memory that map to

REVIEW: PROBLEM WITH DIRECT-MAPPED • Direct-mapped cache: Two blocks in memory that map to the same index in the cache cannot be present in the cache at the same time – One index one entry • Can lead to 0% hit rate if more than one block accessed in an interleaved manner map to the same index – Assume addresses A and B have the same index bits but different tag bits – A, B, … conflict in the cache index – All accesses are conflict misses 13

REVIEW: SET ASSOCIATIVITY • Addresses 0 and 8 always conflict in direct mapped cache

REVIEW: SET ASSOCIATIVITY • Addresses 0 and 8 always conflict in direct mapped cache • Instead of having one column of 8, have 2 columns of 4 blocks Tag store Data store SET V tag V =? MUX Logic Address tag 3 b index byte in block 2 bits 3 bits MUX byte in block Hit? Associative memory within the set -- More complex, slower access, larger tag store + Accommodates conflicts better (fewer conflict misses) 14

REVIEW: HIGHER ASSOCIATIVITY • 4 -way Tag store =? =? Hit? Logic Data store

REVIEW: HIGHER ASSOCIATIVITY • 4 -way Tag store =? =? Hit? Logic Data store MUX byte in block -- More tag comparators and wider data mux; larger tags + Likelihood of conflict misses even lower 15

REVIEW: FULL ASSOCIATIVITY • Fully associative cache – A block can be placed in

REVIEW: FULL ASSOCIATIVITY • Fully associative cache – A block can be placed in any cache location Tag store =? =? =? Logic Hit? Data store MUX 16 byte in block =?

REVIEW: APPROXIMATIONS OF LRU • Most modern processors do not implement “true LRU” in

REVIEW: APPROXIMATIONS OF LRU • Most modern processors do not implement “true LRU” in highly-associative caches • Why? – True LRU is complex – LRU is an approximation to predict locality anyway (i. e. , not the best possible replacement policy) • Examples: – Not MRU (not most recently used) – Hierarchical LRU: divide the 4 -way set into 2 -way “groups”, track the MRU group and the MRU way in each group – Victim-Next. Victim Replacement: Only keep track of the victim and the next victim 17

REPLACEMENT POLICY • LRU vs. Random – Set thrashing: When the “program working set”

REPLACEMENT POLICY • LRU vs. Random – Set thrashing: When the “program working set” in a set is larger than set associativity – 4 -way: Cyclic references to A, B, C, D, E • 0% hit rate with LRU policy – Random replacement policy is better when thrashing occurs • In practice: – Depends on workload – Average hit rate of LRU and Random are similar • Hybrid of LRU and Random – How to choose between the two? Set sampling • See Qureshi et al. , “A Case for MLP-Aware Cache Replacement, “ ISCA 2006. 18

OPTIMAL REPLACEMENT POLICY? • Belady’s OPT – Replace the block that is going to

OPTIMAL REPLACEMENT POLICY? • Belady’s OPT – Replace the block that is going to be referenced furthest in the future by the program – Belady, “A study of replacement algorithms for a virtualstorage computer, ” IBM Systems Journal, 1966. – How do we implement this? Simulate? • Is this optimal for minimizing miss rate? • Is this optimal for minimizing execution time? – No. Cache miss latency/cost varies from block to block! – Two reasons: Remote vs. local caches and miss overlapping – Qureshi et al. “A Case for MLP-Aware Cache Replacement, “ ISCA 2006. 19

HANDLING WRITES (STORES) n When do we write the modified data in a cache

HANDLING WRITES (STORES) n When do we write the modified data in a cache to the next level? • Write through: At the time the write happens • Write back: When the block is evicted – Write-back + Can consolidate multiple writes to the same block before eviction – Potentially saves bandwidth between cache levels + saves energy -- Need a bit in the tag store indicating the block is “modified” – Write-through + Simpler + All levels are up to date. Consistency: Simpler cache coherence because no need to check lower-level caches -- More bandwidth intensive; no coalescing of writes 20

HANDLING WRITES (STORES) • Do we allocate a cache block on a write miss?

HANDLING WRITES (STORES) • Do we allocate a cache block on a write miss? – Allocate on write miss: Yes – No-allocate on write miss: No • Allocate on write miss + Can consolidate writes instead of writing each of them individually to next level + Simpler because write misses can be treated the same way as read misses -- Requires (? ) transfer of the whole cache block • No-allocate + Conserves cache space if locality of writes is low (potentially better cache hit rate) 21

CACHE PARAMETERS VS. MISS RATE • Cache size • Block size • Associativity •

CACHE PARAMETERS VS. MISS RATE • Cache size • Block size • Associativity • Replacement policy • Insertion/Placement policy 22

CACHE SIZE • Cache size: total data (not including tag) capacity – bigger can

CACHE SIZE • Cache size: total data (not including tag) capacity – bigger can exploit temporal locality better – not ALWAYS better • Too large a cache adversely affects hit and miss latency – smaller is faster => bigger is slower – access time may degrade critical path • Too small a cache hit rate – doesn’t exploit temporal locality well – useful data replaced often • Working set: the whole set of data the executing application references “working set” size – Within a time interval cache size 23

BLOCK SIZE • Block size is the data that is associated with an address

BLOCK SIZE • Block size is the data that is associated with an address tag – not necessarily the unit of transfer between hierarchies • Sub-blocking: A block divided into multiple pieces (each with V bit) – Can improve “write” performance hit rate • Too small blocks – don’t exploit spatial locality well – have larger tag overhead • Too large blocks – too few total # of blocks • likely-useless data transferred • Extra bandwidth/energy consumed 24 block size

LARGE BLOCKS: CRITICAL-WORD AND SUBBLOCKING • Large cache blocks can take a long time

LARGE BLOCKS: CRITICAL-WORD AND SUBBLOCKING • Large cache blocks can take a long time to fill into the cache – fill cache line critical word first – restart cache access before complete fill • Large cache blocks can waste bus bandwidth – divide a block into subblocks – associate separate valid bits for each subblock – When is this useful? v d subblock 25 tag

ASSOCIATIVITY • How many blocks can map to the same index (or set)? •

ASSOCIATIVITY • How many blocks can map to the same index (or set)? • Larger associativity – lower miss rate, less variation among programs – diminishing returns, higher hit latency hit rate • Smaller associativity – lower cost – lower hit latency • Especially important for L 1 caches • Power of 2 associativity? associativity 26

CLASSIFICATION OF CACHE MISSES • Compulsory miss – first reference to an address (block)

CLASSIFICATION OF CACHE MISSES • Compulsory miss – first reference to an address (block) always results in a miss – subsequent references should hit unless the cache block is displaced for the reasons below – dominates when locality is poor • Capacity miss – cache is too small to hold everything needed – defined as the misses that would occur even in a fullyassociative cache (with optimal replacement) of the same capacity • Conflict miss – defined as any miss that is neither a compulsory nor a capacity miss 27

HOW TO REDUCE EACH MISS TYPE • Compulsory – Caching cannot help – Prefetching

HOW TO REDUCE EACH MISS TYPE • Compulsory – Caching cannot help – Prefetching • Conflict – More associativity – Other ways to get more associativity without making the cache associative • Victim cache • Hashing • Software hints? • Capacity – Utilize cache space better: keep blocks that will be referenced – Software management: divide working set such that each “phase” fits in cache 28

IMPROVING CACHE “PERFORMANCE” • Remember – Average memory access time (AMAT) = ( hit-rate

IMPROVING CACHE “PERFORMANCE” • Remember – Average memory access time (AMAT) = ( hit-rate * hit-latency ) + ( miss-rate * miss-latency ) • Reducing miss rate – Caveat: reducing miss rate can reduce performance if more costly-to-refetch blocks are evicted • Reducing miss latency/cost • Reducing hit latency 29

IMPROVING BASIC CACHE PERFORMANCE • Reducing miss rate – More associativity – Alternatives/enhancements to

IMPROVING BASIC CACHE PERFORMANCE • Reducing miss rate – More associativity – Alternatives/enhancements to associativity • Victim caches, hashing, pseudo-associativity, skewed associativity – Better replacement/insertion policies – Software approaches • Reducing miss latency/cost – – – – Multi-level caches Critical word first Subblocking/sectoring Better replacement/insertion policies Non-blocking caches (multiple cache misses in parallel) Multiple accesses per cycle Software approaches 30

VICTIM CACHE: REDUCING CONFLICT MISSES Direct Mapped Cache • Victim cache Next Level Cache

VICTIM CACHE: REDUCING CONFLICT MISSES Direct Mapped Cache • Victim cache Next Level Cache Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers, ” ISCA 1990. • Idea: Use a small fully associative buffer (victim cache) to store evicted blocks + Can avoid ping ponging of cache blocks mapped to the same set (if two cache blocks continuously accessed in nearby time conflict with each other) -- Increases miss latency if accessed serially with L 2 31

HASHING AND PSEUDO-ASSOCIATIVITY • Hashing: Better “randomizing” index functions + can reduce conflict misses

HASHING AND PSEUDO-ASSOCIATIVITY • Hashing: Better “randomizing” index functions + can reduce conflict misses • by distributing the accessed memory blocks more evenly to sets – Example: stride where stride value equals cache size -- More complex to implement: can lengthen critical path • Pseudo-associativity (Poor Man’s associative cache) + can reduce conflict miss – Serial lookup: On a miss, use a different index function and access cache again +/- Introduces heterogeneity in hit latency 32

SKEWED ASSOCIATIVE CACHES (I) • Basic 2 -way associative cache structure Way 1 Way

SKEWED ASSOCIATIVE CACHES (I) • Basic 2 -way associative cache structure Way 1 Way 0 Same index function for each way =? Tag Index 33 Byte in Block

SKEWED ASSOCIATIVE CACHES (II) • Skewed associative caches – Each bank has a different

SKEWED ASSOCIATIVE CACHES (II) • Skewed associative caches – Each bank has a different index function Way 0 same index redistributed to different sets same index same set Way 1 f 0 =? tag index 34 byte in block =?

SKEWED ASSOCIATIVE CACHES (III) • Idea: Reduce conflict misses by using different index functions

SKEWED ASSOCIATIVE CACHES (III) • Idea: Reduce conflict misses by using different index functions for each cache way • Benefit: indices are randomized – Less likely two blocks have same index • Reduced conflict misses – May be able to reduce associativity • Cost: additional latency of hash function • Seznec, “A Case for Two-Way Skewed-Associative Caches, ” ISCA 1993. 35

IMPROVING BASIC CACHE PERFORMANCE • Reducing miss rate – More associativity – Alternatives/enhancements to

IMPROVING BASIC CACHE PERFORMANCE • Reducing miss rate – More associativity – Alternatives/enhancements to associativity • Victim caches, hashing, pseudo-associativity, skewed associativity – Better replacement/insertion policies – Software approaches • Reducing miss latency/cost – – – – Multi-level caches Critical word first Subblocking/sectoring Better replacement/insertion policies Non-blocking caches (multiple cache misses in parallel) Multiple accesses per cycle Software approaches 36

MLP-AWARE CACHE REPLACEMENT Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N.

MLP-AWARE CACHE REPLACEMENT Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt, "A Case for MLP-Aware Cache Replacement" Proceedings of the 33 rd International Symposium on Computer Architecture (ISCA), pages 167 -177, Boston, MA, June 2006. Slides (ppt)

Memory Level Parallelism (MLP) parallel miss isolated miss A B C time q Memory

Memory Level Parallelism (MLP) parallel miss isolated miss A B C time q Memory Level Parallelism (MLP) means generating and servicing multiple memory accesses in parallel [Glew’ 98] q Several techniques to improve MLP (e. g. , out-of-order execution, runahead execution) q MLP varies. Some misses are isolated and some parallel How does this affect cache replacement? 38

Traditional Cache Replacement Policies q Traditional cache replacement policies try to reduce miss count

Traditional Cache Replacement Policies q Traditional cache replacement policies try to reduce miss count q Implicit assumption: Reducing miss count reduces memory-related stall time q Misses with varying cost/MLP breaks this assumption! q Eliminating an isolated miss helps performance more than eliminating a parallel miss q Eliminating a higher-latency miss could help performance more than eliminating a lower-latency miss 39

An Example P 4 P 3 P 2 P 1 P 2 P 3

An Example P 4 P 3 P 2 P 1 P 2 P 3 P 4 S 1 Misses to blocks P 1, P 2, P 3, P 4 can be parallel Misses to blocks S 1, S 2, and S 3 are isolated Two replacement algorithms: 1. Minimizes miss count (Belady’s OPT) 2. Reduces isolated misses (MLP-Aware) For a fully associative cache containing 4 blocks 40 S 2 S 3

Fewest Misses = Best Performance P 4 P 3 S 1 Cache P 2

Fewest Misses = Best Performance P 4 P 3 S 1 Cache P 2 S 3 P 1 P 4 P 3 S 1 P 2 S 2 P 1 S 3 P 4 P 4 P 3 S 1 P 2 P 4 S 2 P 1 P 3 S 3 P 2 P 4 P 3 S 1 P 2 P 4 S 2 P 3 P 2 S 3 P 4 P 3 P 2 P 1 Hit/Miss H H H M Time P 1 P 2 P 3 P 4 S 1 S 2 H H M M stall S 3 M Misses=4 Stalls=4 Belady’s OPT replacement Hit/Miss H M M M Time H M M M H stall MLP-Aware replacement 41 H Saved cycles H Misses=6 Stalls=2

Motivation q MLP varies. Some misses more costly than others q MLP-aware replacement can

Motivation q MLP varies. Some misses more costly than others q MLP-aware replacement can improve performance by reducing costly misses 42

Outline q Introduction q MLP-Aware Cache Replacement § Model for Computing Cost § Repeatability

Outline q Introduction q MLP-Aware Cache Replacement § Model for Computing Cost § Repeatability of Cost § A Cost-Sensitive Replacement Policy q Practical Hybrid Replacement § Tournament Selection § Dynamic Set Sampling § Sampling Based Adaptive Replacement q Summary 43

Computing MLP-Based Cost q Cost of miss is number of cycles the miss stalls

Computing MLP-Based Cost q Cost of miss is number of cycles the miss stalls the processor q Easy to compute for isolated miss q Divide each stall cycle equally among all parallel misses 1 ½ A ½ 1 B ½ ½ t 0 44 t 1 t 2 t 3 1 t 4 C t 5 time

A First-Order Model q Miss Status Holding Register (MSHR) tracks all in flight misses

A First-Order Model q Miss Status Holding Register (MSHR) tracks all in flight misses q Add a field mlp-cost to each MSHR entry q Every cycle for each demand entry in MSHR mlp-cost += (1/N) N = Number of demand misses in MSHR 45

Machine Configuration q Processor § aggressive, out-of-order, 128 -entry instruction window q L 2

Machine Configuration q Processor § aggressive, out-of-order, 128 -entry instruction window q L 2 Cache § 1 MB, 16 -way, LRU replacement, 32 entry MSHR q Memory § 400 cycle bank access, 32 banks q Bus § Roundtrip delay of 11 bus cycles (44 processor cycles) 46

% of All L 2 Misses Distribution of MLP-Based Cost varies. Does it. MLP-Based

% of All L 2 Misses Distribution of MLP-Based Cost varies. Does it. MLP-Based repeat for a. Cost given cache block? 47

Repeatability of Cost q An isolated miss can be parallel miss next time q

Repeatability of Cost q An isolated miss can be parallel miss next time q Can current cost be used to estimate future cost ? q Let d = difference in cost for successive miss to a block § Small d cost repeats § Large d cost varies significantly 48

d < 60 d > 120 Repeatability of Cost 59 < d < 120

d < 60 d > 120 Repeatability of Cost 59 < d < 120 q In general d is small repeatable cost q When d is large (e. g. parser, mgrid) performance loss 49

The Framework MEMORY MSHR Cost CCL Calculation Logic Cost-Aware Repl Engine Quantization of Cost

The Framework MEMORY MSHR Cost CCL Calculation Logic Cost-Aware Repl Engine Quantization of Cost C A R E L 2 CACHE Computed mlp-based cost is quantized to a 3 -bit value ICACHE DCACHE PROCESSOR 50

Design of MLP-Aware Replacement policy q LRU considers only recency and no cost Victim-LRU

Design of MLP-Aware Replacement policy q LRU considers only recency and no cost Victim-LRU = min { Recency (i) } q Decisions based only on cost and no recency hurt performance. Cache stores useless high cost blocks q A Linear (LIN) function that considers recency and cost Victim-LIN = min { Recency (i) + S*cost (i) } S = significance of cost. Recency(i) = position in LRU stack cost(i) = quantized cost 51

Results for the LIN policy Performance loss for parser and mgrid due to large

Results for the LIN policy Performance loss for parser and mgrid due to large d. 52

Effect of LIN policy on Cost Miss += 4% IPC += 4% 53 Miss

Effect of LIN policy on Cost Miss += 4% IPC += 4% 53 Miss -= 11% IPC += 22% Miss += 30% IPC -= 33%

Outline q Introduction q MLP-Aware Cache Replacement § Model for Computing Cost § Repeatability

Outline q Introduction q MLP-Aware Cache Replacement § Model for Computing Cost § Repeatability of Cost § A Cost-Sensitive Replacement Policy q Practical Hybrid Replacement § Tournament Selection § Dynamic Set Sampling § Sampling Based Adaptive Replacement q Summary 54

Tournament Selection (TSEL) of Replacement Policies for a Single Set ATD-LIN SET A SCTR

Tournament Selection (TSEL) of Replacement Policies for a Single Set ATD-LIN SET A SCTR + SET A MTD ATD-LRU SET A If MSB of SCTR is 1, MTD uses LIN else MTD use LRU ATD-LIN ATD-LRU Saturating Counter (SCTR) HIT Unchanged MISS Unchanged HIT MISS += Cost of Miss in ATD-LRU MISS HIT -= Cost of Miss in ATD-LIN 55

Extending TSEL to All Sets Implementing TSEL on a per-set basis is expensive Counter

Extending TSEL to All Sets Implementing TSEL on a per-set basis is expensive Counter overhead can be reduced by using a global counter ATD-LIN ATD-LRU Set A Set B Set C Set D Set E SCTR + Set F Set G Set H 56 Set F Set G Policy for All Sets In MTD Set H

Dynamic Set Sampling Not all sets are required to decide the best policy Have

Dynamic Set Sampling Not all sets are required to decide the best policy Have the ATD entries only for few sets. ATD-LIN Set A Set B Set C Set D Set E Set F Set G Set H SCTR + Policy for All Sets In MTD ATD-LRU Set A Set B Set C Set D Set E Set F Set G Set H Sets that have ATD entries (B, E, G) are called leader sets 57

Dynamic Set Sampling How many sets are required to choose best performing policy? q

Dynamic Set Sampling How many sets are required to choose best performing policy? q Bounds using analytical model and simulation (in paper) q DSS with 32 leader sets performs similar to having all sets q Last-level cache typically contains 1000 s of sets, thus ATD entries are required for only 2%-3% of the sets ATD overhead can further be reduced by using MTD to always simulate one of the policies (say LIN) 58

Sampling Based Adaptive Replacement (SBAR) MTD Set A Set B Set C Set D

Sampling Based Adaptive Replacement (SBAR) MTD Set A Set B Set C Set D Set E Set F Set G Set H SCTR + Decide policy only for follower sets ATD-LRU Set B Set E Set G Leader sets Follower sets The storage overhead of SBAR is less than 2 KB (0. 2% of the baseline 1 MB cache) 59

Results for SBAR 60

Results for SBAR 60

SBAR adaptation to phases LIN is better LRU is better SBAR selects the best

SBAR adaptation to phases LIN is better LRU is better SBAR selects the best policy for each phase of ammp 61

Outline q Introduction q MLP-Aware Cache Replacement § Model for Computing Cost § Repeatability

Outline q Introduction q MLP-Aware Cache Replacement § Model for Computing Cost § Repeatability of Cost § A Cost-Sensitive Replacement Policy q Practical Hybrid Replacement § Tournament Selection § Dynamic Set Sampling § Sampling Based Adaptive Replacement q Summary 62

Summary q MLP varies. Some misses are more costly than others q MLP-aware cache

Summary q MLP varies. Some misses are more costly than others q MLP-aware cache replacement can reduce costly misses q Proposed a runtime mechanism to compute MLP-Based cost and the LIN policy for MLP-aware cache replacement q SBAR allows dynamic selection between LIN and LRU with low hardware overhead q Dynamic set sampling used in SBAR also enables other cache related optimizations 63

COMPUTER ARCHITECTURE CS 6354 Caches Samira Khan University of Virginia Apr 14, 2016 The

COMPUTER ARCHITECTURE CS 6354 Caches Samira Khan University of Virginia Apr 14, 2016 The content and concept of this course are adapted from CMU ECE 740