Cache Memory Caches work on the principle of








































































- Slides: 72
Cache Memory Caches work on the principle of locality of program behaviour. This principle states that programs access a relatively small portion of their address space at any instant of time. There are three different types of locality. 10/20/2021 1
Temporal Locality References tend to repeat. If an item is referenced then it will tend to be referenced soon again. If a sequence of references X 1, X 2, X 3, X 4 have recently been made then it is likely that the next reference will be one of X 1, X 2, X 3, X 4 10/20/2021 2
Spatial Locality If an item is referenced then there is high probability that items whose addresses are close by will be referenced soon References tend to cluster into distinct regions ( working Sets) 10/20/2021 3
Sequentiality It is a restricted type of spatial locality and can be called a subset of it. It states that given a reference been made to a particular location ‘S’ then there is high probability that with in the next several references a reference to location ‘S+1’ will be made. 10/20/2021 4
Locality in Programs • Locality in programs arise from simple and natural program structures. • For example ‘Loops’ – where instructions and data are normally accessed sequentially. • Instructions are normally accessed sequentially. • Some data accesses like elements of an array show high degree of spatial locality. 10/20/2021 5
Memory Hierarchy • Taking advantage of this principle of locality of program behavior , The memory of computer is implemented as memory hierarchy. • It consists of multiple levels of memory with different access speeds and sizes • The faster memory levels have high cost of per bit storage, so they tend to be smaller in size 10/20/2021 6
Memory Hierarchy • This implementation helps in creating an illusion for the user that he can access as much memory as available in cheapest technology, while getting the access times of faster memory. 10/20/2021 7
Basic Notions HIT: Processor references that are found in the cache are called ‘Cache Hits’. Cache Miss: Processor references not found in cache are called ‘cache miss’. On a cache miss the cache control mechanism must fetch the missing data from main memory and place it in the cache. Usually the cache fetches a spatial locality (Set of contiguous words) called ‘Line’ from memory 10/20/2021 8
Basic Notions Hit Rate: Fraction of memory references found in cache. References found in cache / Total memory References. Miss Rate: ( 1 -Hit Rate) Fraction of memory references not found in cache. Hit Time: Time to service memory references found in cache( Including the time to determine Hit or Miss. 10/20/2021 9
Basic Notions Miss Penalty: Time required to fetch a block into a level of memory hierarchy from a lower level. This includes time to access the block, transmit it across to higher level, and insert it in level that experienced the miss The primary measure of cache performance is Miss Rate. In most processor designs the CPU stalls or ceases activity on cache Miss. 10/20/2021 10
Processor cache Interface It can be characterized by a number of parameters. • Access Time for a reference found in cache (A Hit)- Depends on cache size and organisation. • Access Time for a reference not found in cache (A Miss) – Depends on memory organisation. 10/20/2021 11
Processor cache Interface • Time to compute a real address from a virtual address ( Not in TLB time)Depends on address translation facility. From cache’s point of view the processor behavior that affects design is 1. No of requests or references per cycle 2. The physical word size or the transfer unit between CPU and Cache. 10/20/2021 12
Cache Organization • The cache is organized as a directory to locate the data item and a data array to hold the data item • Cache can be organized to fetch on demand or to prefetch data. • Fetch on demand ( Most common) brings a new line to cache only when a processor reference is not found in current cache contents. (A cache Miss). 10/20/2021 13
Cache Organization There are three basic types of cache organisations. 1. Direct Mapped 2. Set associative Mapped 3. Fully associative Mapped 10/20/2021 14
Direct Mapped Cache • In this organization each memory location is mapped to exactly one location in the cache. • Cache Directory consists of number of lines (entries) with each line containing a number of contiguous words. • The Cache Directory is Indexed by lower order bits and higher order bits are stored as Tag Bits. 10/20/2021 15
Direct Mapped Cache The 24 bit real address is partitioned into following for cache usage. Tag (10 Bits) Index (8 Bits) W/L (3 Bits) B/W (3 Bits) • A 16 KB Cache having a line of 8 words of 8 bytes each. • Total 256 ( 16 KB/64 ) Lines or Entries in cache Directory. • Total 10 bits of Tag (Higher order bits) to differentiate various addresses mapping to same line number. 10/20/2021 16
Direct Mapped Cache TLB 10 8 3 3 Tag Bits Index Bits W/L B/w 8 B Data Array Valid Bit Dirty Bit Ref Bit 2 K X 8 B 2 K 10 Tag Bits COMP 10/20/2021 To Processor 17
Set Associative Cache • The set associative cache operates in a fashion similar to direct mapped cache. • Here we have more than one choice to locate the line. • If there are ‘n‘ such locations the cache is said to be n way set associative. • Each Line in memory maps to a unique Set in cache and it can be placed in any element of the set. 10/20/2021 18
Set Associative Cache • This improves locality ( Hit Rate ) since now line may lie in more than one location. Going from one way to two way decreases miss rate by 15 % • The reference address bits are compared with all the entries in the set to find a match. • If there is a hit, then that particular sub cache array is selected and outgated to processor. 10/20/2021 19
Set Associative Cache Disadvantages: 1. Requires more comparators and stores more tag bits per block 2. Additional compares and multiplexing increases cache access time 10/20/2021 20
Set Associative Cache 11 Tag Bits Valid Bit 7 Index Bits 3 3 W/L B/w Dirty Bit Ref Bit 11 Tag Bits Data Array Set 1 11 Tag Bits Set 2 1 K x 8 B M U X 10/20/2021 21
Fully Associative Cache • It’s the extreme case of Set associative mapping. • In this mapping Line can be stored in any of the directory entries. • Referenced address is compared to all the entries in the directory. (High H/W Cost) • If a match is found the corresponding location is fetched and returned to processor. • Suitable for small caches only. 10/20/2021 22
Fully Associative Mapped Cache TLB 10 8 Tag Bits 3 3 W/L B/w 8 B Data Array Valid Bit Dirty Bit Ref Bit 2 K X 8 B 2 K 18 Tag Bits 10/20/2021 To Processor 23
Write Policies • There are two strategies to update the memory on a write. 1. The Write through cache stores both into cache and main memory on each write. 2. In Copy back cache write is done in cache only and Dirty bit is set. Entire line is stored in main memory on replacement. (If dirty bit is set) 10/20/2021 24
Write Though • • • A write is directed at both the cache and main memory for every CPU store. Advantage of maintaining a consistent image in main memory. Disadvantage of increasing memory traffic in case of large caches. 10/20/2021 25
Copy Back • • • A write is directed only at cache on CPU store, and dirty bit is set. The entire line is replaced in main memory only when this line is replaced with another line. When a read miss occurs in cache, the old line is simply discarded if dirty bit is not set, else old line is first written out and then new line is accessed and written to cache. 10/20/2021 26
Write Allocate • • • If a cache miss occurs in Store write, the new line can be Allocated to the cache and the store write can then be performed in cache. This policy of “Write allocate” is generally used with copy back caches Copy back caches result in lower memory traffic with large caches. 10/20/2021 27
No Write Allocate • • • If a cache miss occurs in Store write, then cache may be bypassed and write is performed in main memory only. This policy of “No write allocate “ is generally used with write through caches. So we have two types of caches. CBWA – copy back write allocate WTNWA – write through no write allocate. 10/20/2021 28
Line Replacement Strategies • If the reference is not found in the directory, a cache miss occurs. • This requires two actions to be taken promptly 1. The line having missed reference must be fetched from the main memory. 2. One of the current line must be designated for replacement by the new line 10/20/2021 29
Fetching a Line • • • In a write through cache, concern is only accessing the new line. The replaced line is simply discarded ( Written Over). In copy back cache , we must first determine if line to be replaced is dirty. If the line is clean it can be written over, otherwise it should first be written back to memory. A write buffer can speed up the process. 10/20/2021 30
Fetching a Line • • The access of the line can begin at the start of the line or at the faulted word. The second approach is called fetch bypass or wraparound load. This can minimize miss time penalty as CPU can resume processing while the rest of line is being loaded in the cache. This can result in contention for cache as both CPU and memory may need cache simultaneously 10/20/2021 31
Fetching a Line • • Potentially fastest approach is Non blocking or prefetching cache. This cache has additional control hardware to handle cache miss while the processor continues to execute. This strategy works when miss is handling data not currently required by the processor. So this can work with compilers that provide adequate prefetching of lines in anticipation. 10/20/2021 32
Line Replacement • There are three replacement policies that determine which line to replace on cache miss. 1. Least Recently Used (LRU) 2. First in First Out (FIFO) 3. Random Replacement (Rand) The LRU is generally regarded as the ideal policy as it closely corresponds to concept of temporal locality. But it involve additional h/w control and complexity. 10/20/2021 33
Effects of writes on Memory Traffic An integrated cache with two references per instruction (One I reference, one D reference) has following details. The data references are divided 68% Reads, 32 % writes. 30% Dirty Lines 8 B Physical Word 64 B Line 5% Read Miss Rate Compute the memory traffic for a 5 % miss rate for both types of caches. 10/20/2021 34
Warm Caches • In the multi programmed environment , control passes back to a task that once resided in the cache. • If a cache retains a significant portion of working set from a previous execution its said to be a warm cache • Caches that have no history from prior executions are called a cold cache 10/20/2021 35
Common Types of Cache • • • Integrated or Unified cache Split cache I and D. Sectored Cache Two Level cache Write assembly Cache 10/20/2021 36
Split I & D Caches • Separate Instruction and data caches offer the possibility of significantly increased cache bandwidth ( almost twice ) • This comes at some sacrifice of increased miss rate for same size of unified cache. • Caches are not split equally. I caches are not required to manage a processor store. • Spatial locality is much higher in I caches so larger lines are more effective in I caches than in D caches. 10/20/2021 37
Split I And D caches P R O C E S S O R I Reads I - Cache Invalidate if Found D writes D -Cache D Reads 10/20/2021 38
Split I and D Caches • In some program environments data parameters are placed directly in the program • A program location fetched into I cache may also bring data along • When operand parameter is fetched into D cache, a duplicate line entry occurs. • Two split case policies possible for dealing with duplicate lines. 10/20/2021 39
Duplicate Lines • If miss on I –ref, Line goes to I – cache. • If miss on D – ref, Line goes to D – cache. • On CPU store ref check both directories. - use write policy in D –Cache - Invalidate line in I - Cache 10/20/2021 40
No Duplicate Lines • If miss on I –ref, Line goes to I – cache. and check D cache : invalidate if present. • If miss on D – ref, Line goes to D – cache. and check I- cache : invalidate if present. • On CPU store, difference check both directories 10/20/2021 41
On Chip Caches • On Chip caches have two notable considerations. – Due to pin limitations transfer path to and from memory is usually limited. – The cache organisation must be optimized to make best use of area. So the area of directory should be small , allowing maximum area for data array. This implies large block size ( less entries) and simply organised cache with fewer bits per directory entry. 10/20/2021 42
On Chip Caches • The directory overhead is inversely proportional to the line size and number of bits contained in the entry. • Cache utilization = b / b+v/8 • Where b = No of data bytes per line and v = block line overhead ( No of bits in dir entry) Larger line sizes make more of the area available for data array 10/20/2021 43
Sectored Cache • Use of large blocks specially for small caches causes an increase in miss rate and specially increases Miss Time penalty ( due to large access time for large blocks) • The solution is a Sectored cache. • In a sectored cache each line is broken into transfer units (one access from cache to memory) • The Directory is organised around line size as usual. 10/20/2021 44
Sectored Cache • On a cache miss the missed line is entered in the directory (Address tags etc), but only the transfer unit that is required by the processor is brought into data array. • A valid bit indicates the status of sub lines. • If a subsequent access is required in another sub line of newly loaded line, then that sub line is brought into data array. • This while maintaining temporal locality, greatly reduces the size of directory. 10/20/2021 45
Two Level Caches • First level on chip cache is supported by a larger, (off or on chip) second level cache. • The two level cache improves performance by effectively lowering the first level cache access time and Miss Penalty. • A two level cache system is termed Inclusive if all the contents of lower level cache (L 1) are also contained in higher level cache ( L 2). 10/20/2021 46
Two Level Caches • Second level cache analysis is done using the principle of inclusion. – A large second level cache includes everything in the first level cache. Thus for purpose of evaluating performance, the first level cache can be presumed not to exist and assuming that processor made all its requests to second level cache. The line size and in fact the overall size of second level cache must be significantly larger than first level cache. 10/20/2021 47
Two Level Caches • For a two level cache system following Miss rates are defined. 1. Local Miss Rate : No of misses experienced by the cache divided by the number of incoming references 2. Global Miss Rate: Number of L 2 misses divided by the no of references made by the processor. 3. Solo Miss Rate: The Miss rate the cache would have if it were the only cache in the system. The principle of inclusion specifies that global miss rate will be essentially the same as solo miss 10/20/2021 48 rate
Two Level Caches • True or logical inclusion where All the contents of L 1 reside also in L 2, have number of requirements. 1. L 1 cache must be Write Through ( L 2 may not be). 2. No of L 2 Sets >= Number of L 1 Sets 3. L 2 associativity >= L 1 associativity. Cache size = Line Size * Associativity * No of sets. 10/20/2021 49
Two Level Caches • Example: a certain processor has two level cache. L 1 is 4 KB direct mapped WTNWA. L 2 is 8 KB direct mapped CBWA. Both have 16 Byte lines with LRU replacement. 1. Is it always true that L 2 includes all lines of L 1. 2. If L 2 is now 8 KB 4 way set associative does L 2 include all lines at L 1. 3. If L 1 is 4 way set associative (CBWA) and L 2 is direct mapped, does L 2 includes all lines of 10/20/2021 50 L 1.
Two Level Caches 1. L 2 sets = 8 KB / 1 *16 = 512. L 1 sets = 4 KB / 1* 16 = 256 So we have L 2 Sets > L 1 Sets L 2 Asso = L 1 Assoc Also L 1 is WTNWA so answer is YES. 2. L 2 Sets = 8 KB / 4 *16 =126 < L 1 Sets, so Ans is NO. 3. L 2 assoc < L 1 assoc and L 1 is CBWA so Ans 10/20/2021 51 is No.
Two Level Caches • Example: Suppose we have a two level cache with miss rates of 4% (L 1) and 1% (L 2) Suppose Miss Penalties are Miss in L 1, hit in L 2 is : 2 cycles. Miss in L 1, Miss in L 2 : 7 cycles. If a processor makes 1. 5 ref per instruction compute the excess CPI due to cache miss. 10/20/2021 52
Two Level Caches • Excess CPI due to L 1 misses = 1. 5 ref/inst * 0. 04 Miss/ref * 2 cycles/miss =0. 12 CPI Excess CPI due to L 2 Misses =1. 5 ref/inst * 0. 01 miss/ref * 5 cycles /miss =0. 075 CPI (Note: L 2 miss penalty is 5 cycles not 7 cycles, since 1% L 2 misses have already been 10/20/2021 charged 2 cycles in the excess L 1 CPI. ) 53
Two Level Caches • Total effect = excess L 1 CPI + excess L 2 CPI = 0. 12 + 0. 075 =0. 195 CPI 10/20/2021 54
Write Assembly Cache • Write Assembly caches centralize pending memory writes in a single buffer, reducing resulting bus traffic. • The goal of write assembly cache is to assemble writes so that they can be transmitted to memory in an orderly way. • If a synchronizing event occurs as in case of multiple shared memory processors, the entire WAC should be transferred to memory to ensure consistency. • Temporal locality seems to play a more important role in case of write traffic than spatial locality Thus its advantageous to have more smaller lines. 10/20/2021 55
Virtual to Real Translation • Cache is accessed with real memory address obtained from TLB. • There at least three performance aspect that directly relate to V-R Translation. 1. TLB must be properly organised and sufficiently sized to reduce not in TLB faults which add extra cycles in program execution. 10/20/2021 56
Virtual to Real Translation 2. The TLB access must occur before the cache access, this extends cache access time. 3. Some addresses which are independent in virtual address space may collide in real address space, when they draw pages whose lower page address bits and upper cache address bits are identical. 10/20/2021 57
TLB with Two Way Set Associativity Virtual Address Tag Real Low order Bits Index Set 1 R 1 Set 2 V 1 R 2 V 2 Not in TLB N N Y Y 10/20/2021 M U X Real High Order 58 Bits
Overlapping the T cycle in V – R Translation. • To reduce translation delay , the translation must be performed simultaneously with data access in cache array. This can be achieved by 1. Using high degree of set associativity so that the directory index bits are not affected by the translation. 2. Using a virtual cache 3. Using perfectly colored pages. 10/20/2021 59
Address Translation and Access Virtual Page No. Virtual Page Offset Virtual Page No. Virtual Page Offset Translation Cache Tag Compare 10/20/2021 Data Sequential Access Cache Tag Compare Data 60 Parallel Access
Parallel Translation and Access • In virtual address, lower 12 bits specify page off set, and upper 20 bits are used to specify seg / page no. • It is only upper bits which need translation, the lower bits remain unchanged as Real and Virtual pages are of same size ( 4 K Bytes). • In case of Instruction fetch, the chances are high that new address will be in same page. So new real address can be obtained by simply incrementing the old 10/20/2021 61 address.
Parallel Translation and Access V =R Old Real + Increment New Real = Old Real If Overflow Must use TLB Instruction Address Translation 10/20/2021 62
Parallel Translation and Access • In case of data fetch, the chances of new address being outside previous page are very high. • Now address translation is required. • If the bits needed to address the cache line ( Index Bits ) are within the lower 12 bits ( which do not require translation), The cache data can be accessed at the same time as the TLB access. • The Upper 12 bits from TLB are used only in the last stage for tag compare. 10/20/2021 63
Parallel Translation and Access • Since increasing the set associativity decreases the no of lines and hence the index bits needed to access the cache, some processors use high degree of set associativity to allow parallel access to cache and TLB. • This approach works for small caches but as the caches get larger it rapidly becomes impractical. 10/20/2021 64
Virtual Caches • It stores addresses in virtual format. • When a miss occurs and access to main memory is required, only then the concerned virtual address is translated into real address ( Using TLB ). • Since different processes can use same virtual address, its important to prohibit multiple processes from occupying the same virtual cache at the same time. 10/20/2021 65
Virtual Caches • There are two basic control strategies here. 1. To prohibit any process ( apart from OS ) to cohabit the cache with another user process. 2. To require a process ID number to be associated with each line in the cache directory. This uniquely determines the mapping from a virtual address to a real address. 10/20/2021 66
Virtual Caches • • The first approach requires the cache to be purged when a new process entered. This is called Flushing, and it is suited only for small caches. The other approach increases the size of cache directory but otherwise gives reasonable performance. 10/20/2021 67
Physically Addressed Caches Using Colored Pages Index Bits TAG Color Bits Line Offset • A part of Index bits may fall in upper bits which require translation. • We call these bits the color bits. 10/20/2021 68
Physically Addressed Caches Using Colored Pages • While loading a virtual page into real page in memory, the memory manager uses a list of available pages ( free list ) • By maintaining several free lists , one for each color bit combination, we can force the address bits required for cache access to have the same real bit address as virtual address. • The remaining upper virtual address bits still must be translated using TLB, but it 10/20/2021 69 can be concurrent with cache access.
An Example A 128 KB Cache has 64 B Lines, 8 B physical word, 4 K byte pages and is 4 way set associative. It uses CBWA and LRU replacement. The processor creates 30 bit ( Byte addressed) virtual addresses that are translated into 24 bit ( Byte addressed) real byte address ( Labelled A 0 – A 23) a) Which address bits are unaffected by translation. b) Which address bits are used to address the cache directories. c) Which address bits are compared to entries in the cache directory 10/20/2021 70
An Example d) Which address bits are appended to address bits in (b) to address the cache array. 10/20/2021 71
An Example a)12 lower order bits ie A 0 to A 11. b)128 KB cache having 64 B lines will have 2 k number of lines. Since cache is 4 way set associative. The effective no of lines is (2 k /4) ie 512. we require 9 bits to index 512 lines. So address bits used to access cache directory Bits A 6 to A 14. ( Bits A 0 to A 5 are used for B/w and W/L ) c) Remaining upper address bits ie A 15 to A 23. d) 10/20/2021 Bits A 3 to A 5 giving word offset in line. 72