Memory Hierarchy Design Motivated by a combination of

Memory Hierarchy Design • Motivated by a combination of programmer's desire for unlimited fast memory and economical considerations, and based on: on – principle of locality, and – cost/performance ratio of memory technologies (fast small, large slow), to achieve a memory system with cost almost as low as the cheapest level of memory and speed almost as fast as the fastest level. • Hierarchy: Hierarchy CPU/register file (RF) cache (C) main memory (MM) disk memory/I/O devices (DM) • Speed in descending order: order RF > C > MM > DM • Space in ascending order: order RF < C < MM < DM Slide 1

Memory Hierarchy Design • The gaps in speed and space between the different levels are widening increasingly: increasingly Level/Name 1/RF 2/C 3/MM 4/DM Typical size < 1 KB < 16 MB < 16 GB > 100 GB Implementation technology Custom memory w. multiple ports, CMOS On-chip or off-chip CMOS SRAM CMOS DRAM Magnetic disk Access time (ns) 0. 25 -0. 5 -25 80 -250 5, 000 Bandwidth 20, 000 -100, 000 (MB/s) 5000 -10, 000 (MB/s) 1000 -5000 (MB/s) 20 -150 (MB/s) Managed by Compiler Hardware Operating system OS/operator Backed by Cache Main memory Disk CD or tape Slide 2

Memory Hierarchy Design • • Cache performance review: review Memory stall cycles = Number_of_misses * Miss_penalty = IC * Miss_per_instr * Miss_penalty = IC * MAPI * Miss_rate * Miss_penalty where MAPI stands for memory accesses per instruction Four Fundamental Memory Hierarchy Design Issues: Issues 1. Block placement issue: where can a block, the atomic memory unit in cache-memory transactions, be placed in the upper level? 2. Block identification issue: how is a block found if it is in the upper level? 3. Block replacement issue: which block should be replaced on a miss? 4. Write strategy issue: what happens on a write? Slide 3

Memory Hierarchy Design 1. Placement: Placement three approaches: 1) 2) 3) fully associative: any block in the main memory can be placed in any block frame. It is flexible but expensive due to associativity direct mapping: each block in memory is placed in a fixed block frame with the following mapping function: (Block Address) MOD (Number of blocks in cache) set associative: a compromise between fully associative and direct mapping; The cache is divided into sets of block frames, and each block from the memory is first mapped to a fixed set wherein the block can be placed in any block frame. Mapping to a set follows the function, called a bit selection: (Block Address) MOD (Number of sets in cache) Slide 4

Memory Hierarchy Design 2. Identification: Identification v Each block frame in the cache has an address tag indicating the block's address in the memory v All possible tags are searched in parallel v A valid bit is attached to the tag to indicate whether the block contains valid information or not v An address for a datum from CPU, A, is divided into a block address field and a block offset field: Ø block address = (A) / (block size) Ø block offset = (A) MOD (block size) v block address is further divided into tag and index: Ø index indicates the set in which the block may reside Ø tag is compared to indicate a hit or a miss Slide 5

Memory Hierarchy Design 3. Replacement on a cache miss: miss v The more choices for replacement, the more expensive for hardware direct mapping is the simplest v Random vs. least-recently used (LRU): the former has uniform allocation and is simple to build while the latter can take advantage of temporal locality but can be expensive to implement (why? ). First in, first out (FIFO) approximates LRU and is simpler than LRU Associativity Two-way Four-way Eight-way Size LUR Random FIFO 16 K 114. 1 117. 3 115. 5 111. 7 115. 1 113. 3 109. 0 111. 8 110. 4 64 K 103. 4 104. 3 103. 9 102. 4 102. 3 103. 1 99. 7 100. 5 100. 3 256 K 92. 2 92. 1 92. 5 Data cache misses per 1000 instructions Slide 6

Memory Hierarchy Design 4. Write strategies: strategies v v Most cache accesses are reads: 10% stores + 37% loads + 100% instructions only 7% of all memory accesses are writes Optimize reads to make the common case fast, observing that CPU doesn't have to wait for writes while must wait for reads: fortunately, read is easy in direct-mapping: reading and tag comparison can be done in parallel (what about associative mapping? ); but write is hard: a) b) c) Cannot overlap tag reading and block writing (destructive) CPU specifies write size: only 1 - 8 bytes. Thus write strategies often distinguish cache design; On a write hit: i. write through (or store through): Ø ensuring consistency at the cost of memory and bus bandwidth Ø write stalls may be alleviated by using write buffers ii. write back (store in): Ø minimizing memory and bus traffic at the cost of weakened consistency, Ø use dirty bit to indicate modification Ø read misses may result in writes (why? ) On a write miss: a) write allocate (fetch on write) b) no-write allocate (write around) Slide 7

Memory Hierarchy Design q An Example: The Alpha 21264 Data Cache Example v Cache size=64 KB, block size=64 B, two-way set associativity, writeback, write allocate on a write miss. v What is the index size? = 64 K/(64*2) = 216/(26+1)=29 Slide 8

Memory Hierarchy Design q Cache Performance: Performance v Memory access time is an indirect measure of performance and it is not a substitute for execution time: Slide 9

Memory Hierarchy Design q Example 1: 1 How much does cache help in performance? Slide 10

Memory Hierarchy Design q Example 2: 2 What’s the relationship between AMAT and CPU Time? Slide 11

Memory Hierarchy Design q Improving Cache Performance v v The average memory access time can be improved by reducing any of the three parameters above: 1. R 1 reducing miss rate; 2. R 2 reducing miss penalty; 3. R 3 reducing hit time; Four categories of cache organizations that help reduce these parameters: 1. Organizations that help reduce miss rate: larger block size, larger cache size, higher associativity, way prediction and pseudoassociativity, and compiler optimization; 2. Organizations that help reduce miss penalty: multilevel caches, critical word first, read miss before write miss, merging write buffers, and victim cache; 3. Organizations that help reduce miss penalty or miss rate via parallelism: non-blocking caches, hardware prefetching, and compiler prefetching; 4. Organizations that help reduce hit time: small and simple caches, avoid address translation, pipelined cache access, and trace cache. Slide 12

Memory Hierarchy Design q Reducing Miss Rate v There are three kinds of cache misses depending on the causes: 1. Compulsory: the very first access to a block cannot be a hit, since the block must be first brought in from the main memory. Also call cold-start misses; 2. Capacity: lack of space in cache to hold all blocks needed for the execution. Capacity misses will occur because of blocks being discarded and later retrieved; 3. Conflict: due to mapping that confines blocks to restricted area of cache (e. g. , direct mapping, set-associative), also called collision misses or interference misses v While 3 -C characterization gives insights to causes, they are at times too simplistic (and they are inter-dependent). For example, they ignore replacement policies. Slide 13

Memory Hierarchy Design Roles of 3 -C Slide 14

Memory Hierarchy Design Roles of 3 -C Slide 15

Memory Hierarchy Design q First Miss Rate Reduction Technique: Large Block Size v v Takes advantage of spatial locality reduces compulsory miss Increases miss penalty (it takes longer to fetch a block) Increases conflict misses, and/or increases capacity misses Must strike a delicate balance among MP, MR, and AMAT, in finding an appropriate block size Slide 16

Memory Hierarchy Design q First Miss Rate Reduction Technique: Larger Block Size v Example: Find the optimal block size in terms of AMAT, given that miss penalty is 40 cycles overhead plus 2 cycles/16 bytes and miss rates of the table below. v Solution: v v High latency and bandwidth encourages large block size Low latency and bandwidth encourages small block size Slide 17

Memory Hierarchy Design q Second Miss Rate Reduction Technique: Larger Caches v v v An obvious way to reduce capacity misses. Drawback: high overhead in terms of hit time and higher cost. Popular in off-chip cache (2 nd and 3 rd level cache). q Third Miss Rate Reduction Technique: Higher Associativity v v Miss rate Rule of Thumb: i. 8 -way associativity is almost equal to full associativity; ii. Miss rate of (1 -way of N-sized cache) is almost equal to Miss rate of (2 -way of 0. 5 N-sized cache) iii. The higher the associativity, the longer the hit time (why? ) v Higher miss rate rewards higher associativity. Slide 18

Memory Hierarchy Design q Fourth Miss Rate Reduction Technique: Way Prediction and Pseudoassociative Caches v v Way prediction helps select one block among those in a set, thus requiring only one tag comparison (if hit). Ø Preserves advantages of direct-mapping (why? ); Ø In case of a miss, other block(s) are checked. Pseudoassociative (also called column associative) caches Ø Operate exactly as direct-mapping caches when hit, thus again preserving advantages of the direct-mapping; Ø In case of a miss, another block is checked (as if in set-associative caches), by simply inverting the most significant bit of the index field to find the other block in the “pseudoset”. Ø real hit time > pseudo-hit time Ø too many pseudo hits would defeat the purpose Slide 19

Memory Hierarchy Design q Fifth Miss Rate Reduction Technique: Compiler Optimizations Slide 20

Memory Hierarchy Design q Fifth Miss Rate Reduction Technique: Compiler Optimizations Slide 21

Memory Hierarchy Design q Fifth Miss Rate Reduction Technique: Compiler Optimizations Slide 22

Memory Hierarchy Design q Fifth Miss Rate Reduction Technique: Compiler Optimizations IV. Blocking: improve temporal and spatial locality a) b) c) multiple arrays are accessed in both ways (i. e. , row-major and columnmajor), namely, orthogonal accesses that can not be helped by earlier methods concentrate on submatrices, or blocks All N*N elements of Y and Z are accessed N times and each element of X is accessed once. Thus, there are N 3 operations and 2 N 3 + N 2 reads! Capacity misses are a function of N and cache size in this case. Slide 23

Memory Hierarchy Design q Fifth Miss Rate Reduction Technique: Compiler Optimizations a) b) c) To ensure that elements being accessed can fit in the cache, the original code is changed to compute a submatrix of size B*B, where B is called the blocking factor. To total number of memory words accessed is 2 N 3//B + N 2 Blocking exploits a combination of spatial (Y) and temporal (Z) locality. Slide 24

Memory Hierarchy Design q First Miss Penalty Reduction Technique: Multilevel Caches a) To keep up with the widening gap between CPU and main memory, try to: i. make cache faster, and ii. make cache larger by adding another, larger but slower cache between cache and the main memory. Slide 25

Memory Hierarchy Design q First Miss Penalty Reduction Technique: Multilevel Caches b) Local miss rate vs. global miss rate: i. Local miss rate is defined as ii. Global miss rate is defined as Slide 26

Memory Hierarchy Design q Second Miss Penalty Reduction Technique: Critical Word First and Early Restart v CPU needs just one word of the block at a time: Ø Ø critical word first: fetch the required word first, and early start: as soon as the required word arrives, send it to CPU. q Third Miss Penalty Reduction Technique: Giving Priority to Read Misses over Write Misses v Serves reads before writes have been completed: Ø Ø Ø while write buffers improve write-through performance, they complicate memory accesses by potentially delaying updates to memory; instead of waiting for the write buffer to become empty before processing a read miss, the write buffer is checked for content that might satisfy the missing read. in a write-back scheme, the dirty copy upon replacing is first written to the write buffer instead of the memory, thus improving Slide 27 performance.

Memory Hierarchy Design q Fourth Miss Penalty Reduction Technique: Merging Write Buffer 1. Improves efficiency of write buffers that are used by both writethrough and write back caches: 1. 2. Multiple single-word writes are combined into a single write buffer entry which is otherwise used for multi-word write. Reduces stalls due to write buffer being full Slide 28

Memory Hierarchy Design q Fifth Miss Penalty Reduction Technique: Victim Cache 1. victim caches attempt to avoid miss penalty on a miss by: 1. v Adding a small fully-associative cache that is used to contain discarded blocks (victims) It is proven to be effective, especially for small 1 -way cache. e. g. , a 4 entry victim cache removes 20% ! Slide 29

Memory Hierarchy Design q Reducing Cache Miss Penalty or Miss Rate via Parallelism v Nonblocking Caches (Lock-free caches): v Hardware Prefetching of Instructions and Data: Slide 30

Memory Hierarchy Design q Reducing Cache Miss Penalty or Miss Rate via Parallelism v Compiler-Controlled Prefetching: compiler inserts prefetch instructions Slide 31

Memory Hierarchy Design q Reducing Cache Miss Penalty or Miss Rate via Parallelism v Compiler-Controlled Prefetching: An Example for(i: =0; i<3; i: =i+1) for(j: =0; j<100; j: =j+1) a[i][j] : = b[j][0] * b[j+1][0] Ø 16 -byte blocks, 8 KB cache, 1 -way write back, 8 -byte elements; What kind of locality, if any, exists for a and b? a. 3 rows and 100 columns; spatial locality: even-indexed elements miss and odd-indexed elements hit, leading to 3*100/2 = 150 misses b. 101 rows and 3 columns; no spatial locality, but there is temporal locality: same element is used in ith and (i + 1)st iterations and the same element is access in each i iteration. 100 misses for i = 0 and 1 miss for j = 0 for a total of 101 misses Ø Assuming large penalty (50 cycles and at least 7 iterations must be prefetched). Splitting the loop into two, we have Slide 32

Memory Hierarchy Design q Reducing Cache Miss Penalty or Miss Rate via Parallelism v Compiler-Controlled Prefetching: An Example (continued) for(j: =0; j<100; j: =j+1){ prefetch(b[j+7][0]; prefetch(a[0][j+7]; a[0][j] : = b[j][0] * b[j+1][0]; }; for(i: =1; i<3; i: =i+1) for(j: =0; j<100; j: =j+1){ prefetch(a[i][j+7]; a[i][j] : = b[j][0] * b[j+1][0]} Ø Assuming that each iteration of the pre-split loop consumes 7 cycles and no conflict and capacity misses, then it consumes a total of 7*300 + 251*50 = 14650 cycles (total iteration cycles plus total cache miss cycles); whereas the split loop consumes a total of (1+1+7)*100+(4+7)*50+(1+7)*200+(1+7)*50 = 3450 Slide 33

Memory Hierarchy Design q Reducing Cache Miss Penalty or Miss Rate via Parallelism v Compiler-Controlled Prefetching: An Example (continued) Ø the first loop consumes 9 cycles per iteration (due to the two prefetch instruction) Ø the second loop consumes 8 cycles per iteration (due to the single prefetch instruction), Ø during the first 7 iterations of the first loop array a incurs 4 cache misses, Ø array b incurs 7 cache misses, Ø during the first 7 iterations of the second loop for i = 1 and i = 2 array a incurs 4 cache misses each Ø array b does not incur any cache miss in the second split!. Slide 34

Memory Hierarchy Design q First Hit Time Reduction Technique: Small and simple caches v smaller is faster: Ø small index, less address translation time Ø small cache can t on the same chip Ø low associativity: in addition to a simpler/shorter tag check, 1 way cache allows overlapping tag check with transmission of data which is not possible with any higher associativity! q Second Hit Time Reduction Technique: Avoid address translation during indexing v Make the common case fast: Ø use virtual address for cache because most memory accesses (more than 90%) take place in cache, resulting in virtual cache Ø there at least three important performance aspects that directly relate to virtual-to-physical translation: ü improperly organized or insuciently sized TLBs may create excess not-in-TLB faults, adding time to program execution time Slide 35

Memory Hierarchy Design q Second Hit Time Reduction Technique: Avoid address translation during indexing v Make the common case fast: Ø there at least three important performance aspects that directly relate to virtual-to-physical translation: 1) improperly organized or insufficiently sized TLBs may create excess not-in-TLB faults, adding time to program execution time 2) for a physical cache, the TLB access time must occur before the cache access, extending the cache access time 3) two-line address (e. g. , an I-line and a D-line address) may be independent of each other in virtual address space yet collide in the real address space, when they draw pages whose lower page address bits (and upper cache address bits) are identical Ø problems with virtual cache: 1) when a process is switched in/out, the entire cache has to be flushed out, i. e. , the problem of context switching (solution: process identifier tag -- PID) 2) different virtual addresses may refer to the same physical address, i. e. , the problem of synonyms/aliases (solution: hw/sw forcing uniqueness of addressed) Slide 36

Memory Hierarchy Design q Third Hit Time Reduction Technique: Pipelined cache writes v the solution is to overlap tag checking with writing of data q Fourth Hit Time Reduction Technique: Trace caches v Finds a dynamic sequence of instructions including taken branches to load into a cache block: Ø Put traces of the executed instructions into cache blocks as determined by the CPU Ø Branch prediction is folded in to the cache and must be validated along with the addresses to have a valid fetch. Ø Disadvantage: store the same instructions multiple times Slide 37

Memory Hierarchy Design q Main Memory and Organizations for Improving Performance Slide 38

Memory Hierarchy Design q Main Memory and Organizations for Improving Performance Slide 39

Memory Hierarchy Design q Main Memory and Organizations for Improving Performance Slide 40

Memory Hierarchy Design q Virtual Memory Slide 41

Memory Hierarchy Design q Virtual Memory Slide 42

Memory Hierarchy Design q Virtual Memory Slide 43

Memory Hierarchy Design q Virtual Memory Slide 44

Memory Hierarchy Design q Virtual Memory Slide 45