Memory Hierarchies Adapted from slides by Sally Mc

Memory Hierarchies Adapted from slides by Sally Mc. Kee Cornell University Copyright Gary S. Tyson 2003 Copyright Sally A. Mc. Kee 2005

SRAM vs. DRAM n SRAM (static random access memory) n n DRAM (dynamic random access memory) n n n Faster than DRAM Each storage cell is larger, so smaller capacity for same area 2 -10 ns access time Each storage cell tiny (capacitance on wire) Can get 2 Gb chips today 50 -70 ns access time Leaky–need to periodically refresh data What happens on a read? CPU clock rates ~0. 2 ns-2 ns (5 GHz-500 MHz) Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 2

Terminology n Temporal locality: n n n If memory location X is accessed, then it is more likely to be reaccessed in the near future than some random location Y Caches exploit temporal locality by placing a memory element that has been referenced into the cache Spatial locality: n n If memory location X is accessed, then locations near X are more likely to be accessed in the near future than some random location Y Caches exploit spatial locality by allocating a cache line of data (including data near the referenced location) Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 3

Cache Design 101 Reg 100 s bytes part of pipeline L 1 Cache (several KB) L 3 becoming more common (sometimes VERY LARGE) 1 -3 cycle access L 2 Cache (½-32 MB) Memory (128 MB – few GB) Memory pyramid 6 -15 cycle access 50 -300 cycle access Disk (Many GB) Millions cycle access! These are rough numbers: mileage may vary for latest/greatest Caches USUALLY made of SRAM Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 4

Cache design issues n Block placement: where can block be placed in higher memory level? n n n Fully-associative: anywhere Direct-mapped: exactly one place Set-associative: some small number of places Block identification: how does processor find the block if it is there at higher memory level? Block replacement: which block should be replaced from higher level to make room for a new block Write strategy: are lower levels updated when block in higher level is written? n n Write-through: yes Write-back: no, update lower level only when block is evicted from higher level Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 5

A Simple Fully Associative Cache Processor Cache Memory Ld Ld Ld R 1 M[ R 2 M[ R 3 M[ R 2 M[ 1 5 1 4 0 ] ] ] 2 cache lines 3 bit tag field 2 byte block tag data V V How many address bits? R 0 R 1 R 2 R 3 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 6

st 1 Access Processor Ld Ld Ld R 1 M[ R 2 M[ R 3 M[ R 2 M[ Cache 1 5 1 4 0 ] ] ] tag data 0 0 R 1 R 2 R 3 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 7

st 1 Access Processor Ld Ld Ld R 1 M[ R 2 M[ R 3 M[ R 2 M[ Cache 1 5 1 4 0 ] ] ] tag 1 0 lru 0 Memory data 100 110 Addr: 0001 R 0 R 1 R 2 R 3 110 Misses: 1 Hits: 0 0 1 2 3 4 5 6 7 t 8 e s f of k 9 c o bl 10 11 12 13 14 15 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 8

nd 2 Access Processor Ld Ld Ld R 1 M[ R 2 M[ R 3 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 5 1 4 0 110 ] ] ] tag 1 0 lru 0 Memory data 100 110 Misses: 1 Hits: 0 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 9

nd 2 Processor Ld Ld Ld R 1 M[ R 2 M[ R 3 M[ R 2 M[ Access Cache 1 5 1 4 0 ] ] ] tag lru 1 0 1 2 Memory data 100 110 140 150 Addr: 0101 R 0 R 1 R 2 R 3 110 150 Misses: 2 Hits: 0 0 1 2 3 4 5 6 7 t 8 e s f of k 9 c o bl 10 11 12 13 14 15 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 10

rd 3 Processor Ld Ld Ld R 1 M[ R 2 M[ R 3 M[ R 2 M[ Access Cache 1 5 1 4 0 ] ] ] tag lru 1 0 1 2 Memory data 100 110 140 150 Addr: 0001 R 0 R 1 R 2 R 3 110 150 Misses: 2 Hits: 0 0 1 2 3 4 5 6 7 t 8 e s f of k 9 c o bl 10 11 12 13 14 15 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 11

rd 3 Processor Ld Ld Ld R 1 M[ R 2 M[ R 3 M[ R 2 M[ R 0 R 1 R 2 R 3 Access Cache 1 5 1 4 0 110 150 110 ] ] ] tag 1 0 lru 1 2 Memory data 100 110 140 150 Misses: 2 Hits: 1 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 12

th 4 Processor Ld Ld Ld R 1 M[ R 2 M[ R 3 M[ R 2 M[ Access Cache 1 5 1 4 0 ] ] ] tag 1 0 lru 1 2 Memory data 100 110 140 150 Addr: 0100 R 1 R 2 R 3 110 150 110 Misses: 2 Hits: 1 0 1 2 3 4 5 6 7 t 8 e s f 9 of k oc 10 l b 11 12 13 14 15 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 13

th 4 Processor Ld Ld Ld R 1 M[ R 2 M[ R 3 M[ R 2 M[ R 0 R 1 R 2 R 3 Access Cache 1 5 1 4 0 110 150 140 ] ] ] tag lru 1 0 1 2 Memory data 100 110 140 150 Misses: 2 Hits: 2 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 14

th 5 Processor Ld Ld Ld R 1 M[ R 2 M[ R 3 M[ R 2 M[ Access Cache 1 5 1 4 0 ] ] ] tag lru 1 0 1 2 Memory data 100 110 140 150 Addr: 0000 R 1 R 2 R 3 110 150 140 Misses: 2 Hits: 2 0 1 2 3 4 5 6 7 t 8 e s f 9 of k oc 10 l b 11 12 13 14 15 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 15

th 5 Processor Ld Ld Ld R 1 M[ R 2 M[ R 3 M[ R 2 M[ R 0 R 1 R 2 R 3 Access Cache 1 5 1 4 0 110 100 140 ] ] ] tag 1 0 lru 1 2 Memory data 100 110 140 150 Misses: 2 Hits: 3 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 16

Block size Decide on the block size How? Simulate lots of different block sizes and see which one gives the best performance n Most systems use a block size between 32 bytes and 128 bytes n Longer sizes reduce the overhead by n n Reducing the number of tags n Reducing the size of each tag n But beyond some block size, you bring in too much data that you do not use: cache pollution Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 17

Write strategy n Where should you write the result of a store? n If that memory location is in the cache? n n Send it to the cache Should we also send it to memory right away? (write-through policy) Wait until we kick the block out (write-back policy) If it is not in the cache? n n Allocate the line (put it in the cache)? (write allocate policy) Write it directly to memory without allocation? (no write allocate policy) Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 18

Handling Stores (Write. Through) Processor Cache Memory Assume write-allocate policy Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 1 7 0 5 10 ] ] ] V tag data 0 0 Misses: 0 Hits: 0 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 19

Write-Through (REF 1) Processor Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 ] ] ] Memory V tag data 0 0 Misses: 0 Hits: 0 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 20

Write-Through (REF 1) Processor Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 ] ] ] Memory V tag data 78 1 0 29 lru 0 Misses: 1 Hits: 0 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 21

Write-Through (REF 2) Processor Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 ] ] ] Memory V tag data 78 1 0 29 lru 0 Misses: 1 Hits: 0 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 22

Write-Through (REF 2) Processor Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 173 ] ] ] Memory V tag data lru 1 0 78 29 1 3 162 173 Misses: 2 Hits: 0 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 23

Write-Through (REF 3) Processor Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 173 ] ] ] Memory V tag data lru 1 0 78 29 1 3 162 173 Misses: 2 Hits: 0 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 24

Write-Through (REF 3) Processor Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 173 ] ] ] Memory V tag data 173 1 0 29 lru 1 3 162 173 Misses: 2 Hits: 1 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 25

Write-Through (REF 4) Processor Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 173 ] ] ] Memory V tag data 173 1 0 29 lru 1 3 162 173 Misses: 2 Hits: 1 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 26

Write-Through (REF 4) Processor Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 173 ] ] ] Memory V tag data lru 1 0 173 29 1 2 71 150 29 Misses: 3 Hits: 1 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 29 120 123 71 150 29 162 173 18 21 33 28 19 200 210 225 27

Write-Through (REF 5) Processor Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 173 ] ] ] Memory V tag data lru 1 0 173 29 1 2 71 29 Misses: 3 Hits: 1 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 29 120 123 71 29 162 173 18 21 33 28 19 200 210 225 28

Write-Through (REF 5) Processor Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 33 ] ] ] Memory V tag data 1 5 33 28 71 lru 1 2 29 Misses: 4 Hits: 1 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 29 120 123 71 29 162 173 18 21 33 28 19 200 210 225 29

How Many Memory References? n Each miss reads a block (only two bytes in this cache) n n n Each store writes a byte Total reads: eight bytes Total writes: two bytes but caches generally miss < 20% usually much lower miss rates. . . but depends on both cache and application! Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 30

Write-Through vs. Write-Back Can we also design the cache NOT to write all stores immediately to memory? Keep the most current copy in cache, and update memory when that data is evicted (write-back policy) n Do we need to write-back all evicted lines? n No, only blocks that have been stored into (written) n Keep a “dirty bit”, reset when the line is allocated, set when the block is written n If a block is “dirty” when evicted, write its data n Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 31

Handling Stores (Write-Back) Processor Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 ] ] ] Memory V d tag data 0 0 Misses: 0 Hits: 0 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 32

Write-Back (REF 1) Processor Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 ] ] ] Memory V d tag data 0 0 Misses: 0 Hits: 0 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 33

Write-Back (REF 1) Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 ] ] ] lru Processor Memory V d tag data 10 0 78 29 0 Misses: 1 Hits: 0 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 34

Write-Back (REF 2) Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 ] ] ] lru Processor Memory V d tag data 10 0 78 29 0 Misses: 1 Hits: 0 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 35

Write-Back (REF 2) Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 173 ] ] ] lru Processor Memory V d tag data 10 0 78 29 10 3 162 173 Misses: 2 Hits: 0 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 36

Write-Back (REF 3) Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 173 ] ] ] lru Processor Memory V d tag data 10 0 78 29 10 3 162 173 Misses: 2 Hits: 0 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 37

Write-Back (REF 3) Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 173 ] ] ] lru Processor Memory V d tag data 11 0 173 29 10 3 162 173 Misses: 2 Hits: 1 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 38

Write-Back (REF 4) Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 173 ] ] ] lru Processor Memory V d tag data 11 0 173 29 10 3 162 173 Misses: 2 Hits: 1 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 39

Write-Back (REF 4) Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 173 ] ] ] lru Processor Memory V d tag data 11 0 173 29 11 3 71 29 Misses: 3 Hits: 1 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 40

Write-Back (REF 5) Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 173 ] ] ] lru Processor Memory V d tag data 11 0 173 29 11 3 71 29 Misses: 3 Hits: 1 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 41

Write-Back (REF 5) Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 173 ] ] ] lru Processor Memory V d tag data 11 0 173 29 11 3 71 29 Misses: 4 Hits: 1 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 42

Write-Back (REF 5) Ld Ld St St Ld R 1 M[ R 2 M[ R 0 R 1 R 2 R 3 Cache 1 7 0 5 10 29 33 ] ] ] lru Processor Memory V d tag data 10 5 33 28 11 3 71 29 Misses: 4 Hits: 1 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 43

How many memory references? n Each miss reads a block Two bytes in this cache Each evicted dirty cache line writes a block n Total reads: eight bytes n Total writes: four bytes (after final eviction) n Choose write-back or write-through? Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 44

Direct-Mapped Cache Address Cache Memory 00000 00010 01011 00100 00110 01000 01010 01100 01110 Block Offset (1 -bit) 10000 10010 Line Index (2 -bit) 10100 10110 Tag (2 -bit) 11000 11010 Compulsory Miss: First reference to memory block 11100 Capacity Miss: Working set doesn’t fit in cache Conflict Miss: Working set maps to same cache line 11110 V d tag data 0 0 Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 23 218 10 44 16 141 28 214 33 98 181 129 119 42 66 74 45

Two-Way Set Associative Cache Memory Address Cache 00000 00010 01101 00100 00110 01000 01010 01100 01110 Block Offset (unchanged) 10000 10010 1 -bit Set Index 10100 10110 Larger (3 -bit) Tag 11000 11010 Rule of thumb: Increasing associativity decreases conflict 11100 misses. A 2 -way associative cache has about the same hit rate as a direct mapped cache twice the size. 11110 V d tag 0 0 data Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 225 23 218 10 44 16 141 28 214 33 98 181 129 119 42 66 74 46

Sources of cache misses n Cold misses: n n n Capacity misses: n n the first time processor accesses a line, there will be a cache miss also known as compulsory misses if number of distinct cache lines accessed between two references to the same line is greater than the capacity of the cache and the second reference is a miss, it is called a capacity miss Conflict misses: n n misses causes by evictions of line because of associativity conflicts cannot occur in fully associative caches Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 47

3 Cs Absolute Miss Rate Conflict Fall 2000 48 Siddhartha Chatterjee 48

Effects of Varying Cache Parameters Total cache size: block size # sets associativity n Positives: Should decrease miss rate n Negatives: May increase hit time n Probably increase area requirements Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 (how are these related? ) n 49

Effects of Varying Cache Parameters Bigger block size n Positives: n Exploits spatial locality ; reduce compulsory misses n Reduces tag overhead (bits) n Reduces transfer overhead (address, burst data mode) n Negatives: n Fewer blocks for given size; increase conflict misses n Increases miss transfer time (multi-cycle transfers) n Wastes bandwidth for non-spatial data Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 50

Effects of Varying Cache Parameters Increasing associativity n Positives: n Reduces conflict misses n Low-associative caches can have pathological behavior (very high miss rates) n Negatives: n Increased hit time n More hardware requirements (comparators, muxes, bigger tags) n Decreases improvements past 4 - or 8 - way n Belady’s anomaly (eventually more associativity = lower performance!) Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 51

Effects of Varying Cache Parameters Replacement strategy: (for associative caches) How is the evicted line chosen? 1. LRU: intuitive; difficult to implement with high associativity; worst case performance can occur (N+1 element array) 2. Random: Pseudo-random easy to implement; performance close to LRU for high associativity; usually avoids pathological behavior 3. Optimal: replace block that has its next Copyright Garyfarthest S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 Belady reference in the future; 52

Other Cache Design Decisions Write Policy: how to deal with write misses? n Write-through / no-allocate traffic? Read misses block size + writes n Common for L 1 caches back by L 2 (especially onchip) n Total n Write-back / write-allocate n Needs a dirty bit to determine whether cache data differs n Total traffic? (read misses + write misses) block size + dirty-block-evictions block size n Common for L 2 caches (memory bandwidth limited) n Variation: Write validate Copyright Gary S. without Tyson 2003, Copyright Sally A. Mc. Kee 2005 n Write-allocate fetch-on-write 53

n Other Cache Design Decisions Write Buffering Delay writes until bandwidth available n Put them in FIFO buffer n Only stall on write if buffer is full n Use bandwidth for reads first (since they have latency problems) n Important for write-through caches→ write traffic frequent n Write-Back buffer Holds evicted (dirty) lines for Write-Back caches n Gives reads priority on the L 2 or memory bus n Usually only needs a small buffer Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 54

Prefetching Already done – loading entire line assumes spatial locality n Extend this… Next Line Prefetch n Bring in next block in memory as well on a miss n Very good for Icache (why? ) n n Software prefetch Loads to R 0 have no data dependency Aggressive/speculative prefetch useful for L 2 n Speculative prefetch problematic for L 1 n Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 55

Calculating the Effects of Latency Does a cache miss reduce performance? depends if critical instructions waiting for the result Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 56

Calculating the Effects of Latency Depends on whether critical resources are held up n Blocking: When a miss occurs, all later reference to the cache must wait. This is a resource conflict. n Non-blocking: Allows later references to access cache while miss is being processed. Generally there is some limit to how many outstanding misses can be bypassed. n Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 57

Programming for caches n How do we reduce the number of cache misses? How do we reduce cold misses? n How do we reduce capacity misses? n How do we reduce conflict misses? n n How do we reduce the impact of cache misses on overall performance? Copyright Gary S. Tyson 2003, Copyright Sally A. Mc. Kee 2005 58

Reduce Misses by Compiler Optimizations n Instructions n n Reorder procedures in memory so as to reduce misses Profiling to look at conflicts Mc. Farling [1989] reduced caches misses by 75% on 8 KB direct mapped cache with 4 byte blocks Data n Merging Arrays Improve spatial locality by single array of compound elements vs. 2 arrays n Loop Interchange Change nesting of loops to access data in order stored in memory n Loop Fusion Combine two independent loops that have same looping and some variables overlap n Blocking Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows Fall 2000 59 Siddhartha Chatterjee 59