Memory Hierarchy Reasons Virtual Memory Cache Memory Translation

  • Slides: 50
Download presentation
Memory Hierarchy – Reasons – Virtual Memory Cache Memory Translation Lookaside Buffer – Address

Memory Hierarchy – Reasons – Virtual Memory Cache Memory Translation Lookaside Buffer – Address translation – Demand paging Datorteknik Memory. Acceleration bild 1

Why Care About the Memory Hierarchy? Processor-DRAM Memory Gap (latency) 1000 CPU 100 Processor-Memory

Why Care About the Memory Hierarchy? Processor-DRAM Memory Gap (latency) 1000 CPU 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 1 DRAM 9%/yr. (2 X/10 yrs) 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance “Moore’s Law” µProc 60%/yr. (2 X/1. 5 yr) Time Datorteknik Memory. Acceleration bild 2

DRAMs over Time 1 st Gen. Sample DRAM Generation ‘ 90 ‘ 93 ‘

DRAMs over Time 1 st Gen. Sample DRAM Generation ‘ 90 ‘ 93 ‘ 84 ‘ 87 1 Mb 4 Mb 16 Mb Die Size (mm 2) 55 85 Memory Area (mm 2) 30 Memory Size Memory Cell Area (µm 2) 28. 84 ‘ 96 ‘ 99 64 Mb 256 Mb 1 Gb 130 200 300 450 47 72 110 165 250 11. 1 4. 26 1. 64 0. 61 0. 23 (from Kazuhiro Sakashita, Mitsubishi) Datorteknik Memory. Acceleration bild 3

Recap: Two Different Types of Locality: – Temporal Locality (Locality in Time): If an

Recap: Two Different Types of Locality: – Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. – Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality: – Present the user with as much memory as is available in the cheapest technology. – Provide access at the speed offered by the fastest technology. DRAM is slow but cheap and dense: – Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense: – Good choice for providing the user FAST access time. Datorteknik Memory. Acceleration bild 4

Memory Hierarchy of a Modern Computer By taking advantage of the principle of locality:

Memory Hierarchy of a Modern Computer By taking advantage of the principle of locality: – Present the user with as much memory as is available in the cheapest technology. – Provide access at the speed offered by the fastest technology. Processor Control Registers On-Chip Cache Second Level Cache (SRAM) 1 ns Xns 10 ns Size (bytes): 100 64 K K. . M Datapath Speed : Main Memory (DRAM) 100 ns M Secondary Storage (Disk) 10 ms G Tertiary Storage (Disk /Tape) 10 sec T Datorteknik Memory. Acceleration bild 5

Levels of the Memory Hierarchy Staging Xfer Unit faster Reg: s Instr. Operands prog

Levels of the Memory Hierarchy Staging Xfer Unit faster Reg: s Instr. Operands prog 1 -8 bytes Cache Blocks cache cntl 8 -128 bytes Memory Pages OS 512 -4 K bytes Disk Files user/operator Mbytes Larger Tape Datorteknik Memory. Acceleration bild 6

The Art of Memory System Design Workload or Benchmark programs Processor Optimize the memory

The Art of Memory System Design Workload or Benchmark programs Processor Optimize the memory system organization to minimize the average memory access time for typical workloads reference stream <op, addr>, . . . op: i-fetch, read, write Memory $ MEM Datorteknik Memory. Acceleration bild 7

Virtual Memory System Design size of information blocks that are transferred from secondary to

Virtual Memory System Design size of information blocks that are transferred from secondary to main storage (M) block of information brought into M, and M is full, then some region of M must be released to make room for the new block --> replacement policy which region of M is to hold the new block --> placement policy missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy cache mem disk reg Page page Paging Organization virtual and physical address space partitioned into blocks of equal size (pages) Datorteknik Memory. Acceleration bild 8

Address Map V = {0, 1, . . . , n - 1} virtual

Address Map V = {0, 1, . . . , n - 1} virtual address space M = {0, 1, . . . , m - 1} physical address space n>m MAP: V --> M U {0} address mapping function MAP(a) = a' if data at virtual address a is present in physical address a' and a' in M = 0 if data at virtual address a is not present in M a missing item fault Name Space V fault handler Processor a Addr Trans Mechanism 0 Main Memory Secondary Memory a' physical address OS performs this transfer Datorteknik Memory. Acceleration bild 9

Paging Organization unit of mapping V. A. 1 K 1 K also unit of

Paging Organization unit of mapping V. A. 1 K 1 K also unit of transfer from virtual to physical memory 1 K page 0 1 31 0 1024 Addr Trans MAP 1 K 1 K frame 0 1024 1 K 7 7168 Physical Memory 31744 Virtual Memory Address Mapping VA P. A. page no. Page Table Base Reg index into page table 10 disp actually, concatenation is more likely Page Table V Access Rights PA table located in physical memory + physical memory address Datorteknik Memory. Acceleration bild 10

Address Mapping CP 0 User Memory MIPS PIPELINE 32 Instr Data 32 -bit Virtual

Address Mapping CP 0 User Memory MIPS PIPELINE 32 Instr Data 32 -bit Virtual Address 32 24 -bit Physical Address User process 2 running Kernel Memory Here we need page table 2 for address mapping Page Table Page 1 Page Table 2 Table n Datorteknik Memory. Acceleration bild 11

Translation Lookaside Buffer (TLB) CP 0 MIPS PIPELINE 32 On TLB hit, the 32

Translation Lookaside Buffer (TLB) CP 0 MIPS PIPELINE 32 On TLB hit, the 32 -bit virtual address is translated into a 24 -bit physical address by hardware 32 Virtual Address User Memory We never call the Kernel! D R Physical Addr [23: 10] 24 Kernel Memory Page Table Page 1 Page Table 2 Table n Datorteknik Memory. Acceleration bild 12

So Far, NO GOOD 60 ns, RAM CP 0 IM 32 DE EX Critical

So Far, NO GOOD 60 ns, RAM CP 0 IM 32 DE EX Critical path 20 ns STALL DM 32 TLB MIPS pipe is clocked at 50 MHz Kernel Memory 5 ns But RAM needs 3 cycles to read/write STALLS the pipe 24 -bit Physical Address Page Table Page 1 Page Table 2 Table n Datorteknik Memory. Acceleration bild 13

Let’s put in a Cache 60 ns, RAM CP 0 IM 32 DE EX

Let’s put in a Cache 60 ns, RAM CP 0 IM 32 DE EX Critical path 20 ns DM 32 TLB Cache MIPS pipe is clocked at 50 MHz Kernel Memory 5 ns A cache Hit never STALLS the pipe 15 ns Page Table Page 1 Page Table 2 Table n Datorteknik Memory. Acceleration bild 14

Fully Associative Cache 23 2 1 0 24 -bit PA Check all Cache lines

Fully Associative Cache 23 2 1 0 24 -bit PA Check all Cache lines Cache Hit if PA[23: 2]=TAG Tag PA[23: 2] Data Word PA[1: 0] 16 all 2 lines 16 2 * 4=256 kb Datorteknik Memory. Acceleration bild 15

Fully Associative Cache Very good hit ratio (nr hits/nr accesses) But! 16 Too expensive

Fully Associative Cache Very good hit ratio (nr hits/nr accesses) But! 16 Too expensive checking all 2 Cache lines concurrently – A comparator for each line! A lot of hardware Datorteknik Memory. Acceleration bild 16

Direct Mapped Cache 23 18 17 2 1 0 24 -bit PA Selects ONE

Direct Mapped Cache 23 18 17 2 1 0 24 -bit PA Selects ONE cache line Cache Hit if PA[23: 18]=TAG Tag PA[23: 18] Data Word PA[1: 0] 1 line 16 2 * 4=256 kb Datorteknik Memory. Acceleration bild 17

Direct Mapped Cache Not so good hit ratio – Each line can hold only

Direct Mapped Cache Not so good hit ratio – Each line can hold only certain addresses, less freedom But! Much cheaper to implement, only one line checked – Only one comparator Datorteknik Memory. Acceleration bild 18

Set Associative Cache 23 18 -z 17 -z 2 1 0 24 -bit PA

Set Associative Cache 23 18 -z 17 -z 2 1 0 24 -bit PA z Selects ONE set of lines, size 2 Cache Hit if PA[23: 18 -z]=TAG in the set Tag PA[23: 18 -z] Data Word PA[1: 0] z 2 lines 2 z-way set associative 16 2 * 4=256 kb Datorteknik Memory. Acceleration bild 19

Set Associative Cache Quite good hit ratio – The number (set) of different addresses

Set Associative Cache Quite good hit ratio – The number (set) of different addresses for each line is greater than that of a directly mapped cache The larger Z the better hit ratio, but more expensive – 2 z comparators – Cost-performance tradeoff Datorteknik Memory. Acceleration bild 20

Cache Miss A Cache Miss should be handled by the hardware – If handled

Cache Miss A Cache Miss should be handled by the hardware – If handled by the OS it would be very slow (>>60 ns) On a Cache Miss – Stall the pipe – Read in new data to cache – Release the pipe, now we get a Cache Hit Datorteknik Memory. Acceleration bild 21

A Summary on Sources of Cache Misses Compulsory (cold start or process migration, first

A Summary on Sources of Cache Misses Compulsory (cold start or process migration, first reference): first access to a block – “Cold” fact of life: not a whole lot you can do about it – Note: If you are going to run “billions” of instructions, Compulsory Misses are insignificant Conflict (collision): – Multiple memory locations mapped to the same cache location – Solution 1: increase cache size – Solution 2: increase associativity Capacity: – Cache cannot contain all blocks access by the program – Solution: increase cache size Invalidation: other process (e. g. , I/O) updates memory Datorteknik Memory. Acceleration bild 22

Example: 1 KB Direct Mapped Cache with 32 Byte Blocks For a 2 N

Example: 1 KB Direct Mapped Cache with 32 Byte Blocks For a 2 N byte cache: – The uppermost (32 - N) bits are always the Cache Tag – The lowest M bits are the Byte Select (Block Size = 2 M) 31 Cache Tag Example: 0 x 50 Stored as part of the cache “state” 0 x 50 Cache Data Byte 31 Byte 0 0 Byte 63 Byte 32 1 2 3 : : Byte 1023 : : Cache Tag 4 0 Byte Select Ex: 0 x 00 : : Valid Bit 9 Cache Index Ex: 0 x 01 Byte 992 31 Datorteknik Memory. Acceleration bild 23

Block Size Tradeoff In general, larger block size take advantage of spatial locality BUT:

Block Size Tradeoff In general, larger block size take advantage of spatial locality BUT: – Larger block size means larger miss penalty: Takes longer time to fill up the block – If block size is too big relative to cache size, miss rate will go up: Too few cache blocks In gerneral, Average Access Time: Time. Av= Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate Exploits Spatial Locality Miss Penalty Fewer blocks: compromises temporal locality Block Size Average Access Time Increased Miss Penalty & Miss Rate Block Size Datorteknik Memory. Acceleration bild 24

Extreme Example: single big line Valid Bit Cache Tag Cache Data Byte 3 Byte

Extreme Example: single big line Valid Bit Cache Tag Cache Data Byte 3 Byte 2 Byte 1 Byte 0 0 Cache Size = 4 bytes Block Size = 4 bytes – Only ONE entry in the cache If an item is accessed, likely that it will be accessed again soon – But it is unlikely that it will be accessed again immediately!!! – The next access will likely to be a miss again Continually loading data into the cache but discard (force out) them before they are used again Worst nightmare of a cache designer: Ping Pong Effect Conflict Misses are misses caused by: – Different memory locations mapped to the same cache index Solution 1: make the cache size bigger Solution 2: Multiple entries for the same Cache Index Datorteknik Memory. Acceleration bild 25

Hierarchy Small, fast and expensive VS Slow big and inexpensive Cache Contains copies What

Hierarchy Small, fast and expensive VS Slow big and inexpensive Cache Contains copies What if copies are changed? HD 2 Gb INCONSISTENCY! RAM 16 Mb Cache 256 kb I D Datorteknik Memory. Acceleration bild 26

Cache Miss, Write Through/Back To avoid INCONSISTENCY we can Write Through – Always write

Cache Miss, Write Through/Back To avoid INCONSISTENCY we can Write Through – Always write data to RAM – Not so good performance (write 60 ns) – Therefore, WT always combined with write buffers so that don’t wait for lower level memory Write Back – Write data to memory only when cache line is replaced We need a Dirty bit (D) for each cache line D-bit set by hardware on write operation – Much better performance, but more complex hardware Datorteknik Memory. Acceleration bild 27

Write Buffer for Write Through Processor Cache DRAM Write Buffer A Write Buffer is

Write Buffer for Write Through Processor Cache DRAM Write Buffer A Write Buffer is needed between the Cache and Memory – Processor: writes data into the cache and the write buffer – Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: – Typical number of entries: 4 – Works fine if: Store frequency (w. r. t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: – Store frequency (w. r. t. time) -> 1 / DRAM write cycle – Write buffer saturation Datorteknik Memory. Acceleration bild 28

Write Buffer Saturation Store frequency (w. r. t. time) -> 1 / DRAM write

Write Buffer Saturation Store frequency (w. r. t. time) -> 1 / DRAM write cycle – If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time Solution for write buffer saturation: – Use a write back cache – Install a second level (L 2) cache: Processor Cache L 2 Cache DRAM Write Buffer Datorteknik Memory. Acceleration bild 29

Replacement Strategy in Hardware A Direct mapped cache selects ONE cache line – No

Replacement Strategy in Hardware A Direct mapped cache selects ONE cache line – No replacement strategy Set/Fully Associative Cache selects a set of lines. Strategy to select one Cache line – Random, Round Robin Not so good, spoils the idea with Associative Cache – Least Recently Used, (move to top strategy) Good, but complex and costly for large Z – We could use an approximation (heuristic) Not Recently Used, (replace if not used for a certain time) Datorteknik Memory. Acceleration bild 30

Sequential RAM Accessing sequential words from RAM is faster than accessing RAM randomly –

Sequential RAM Accessing sequential words from RAM is faster than accessing RAM randomly – Only lower address bits will change How could we exploit this? – Let each Cache Line hold an Array of Data words – Give the Base address and array size “Burst Read” the array from RAM to Cache “Burst Write” the array from Cache to RAM Datorteknik Memory. Acceleration bild 31

System Startup, RESET Random Cache Contents – We might read incorrect values from the

System Startup, RESET Random Cache Contents – We might read incorrect values from the Cache We need to know if the contents is Valid, a Vbit for each cache line – Let the hardware clear all V-bits on RESET – Set the V-bit and clear the D-bit for the line copied from RAM to Cache Datorteknik Memory. Acceleration bild 32

Final Cache Model 23 18 -z 17 -z 2+j 1+j 0 24 -bit PA

Final Cache Model 23 18 -z 17 -z 2+j 1+j 0 24 -bit PA z Selects ONE set of lines, size 2 Cache Hit if (PA[23: 18 -z]=TAG) and V in set Set D bit if Write V D Tag PA[23: 18 -z] Data Word PA[1+j: 0] z 2 lines . . . Datorteknik Memory. Acceleration bild 33

Translation Lookaside Buffer (TLB) CP 0 MIPS PIPELINE 32 On TLB hit, the 32

Translation Lookaside Buffer (TLB) CP 0 MIPS PIPELINE 32 On TLB hit, the 32 -bit virtual address is translated into a 24 -bit physical address by hardware 32 Virtual Address User Memory We never call the Kernel! D R Physical Addr [23: 10] 24 Kernel Memory Page Table Page 1 Page Table 2 Table n Datorteknik Memory. Acceleration bild 34

Virtual Address and a Cache CPU PA for update: must update all cache entries

Virtual Address and a Cache CPU PA for update: must update all cache entries with same physical address or memory becomes inconsistent Translation hit ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! data This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible VA It takes an extra memory access to translate VA to PA Cache software enforced alias boundary: same lsb of VA &PA > cache size miss determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or Main Memory Datorteknik Memory. Acceleration bild 35

Translation Look-Aside Buffers Just like any other cache, the TLB can be organized as

Translation Look-Aside Buffers Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations. hit PA VA CPU Translation with a TLB miss TLB Lookup miss Cache Main Memory hit OS Page table data Datorteknik Memory. Acceleration bild 36

Reducing Translation Time Machines with TLBs go one step further to reduce cycles/cache access

Reducing Translation Time Machines with TLBs go one step further to reduce cycles/cache access They overlap the cache access with the TLB access Works because high order bits of the VA are used to look in the TLB while low order bits are used as index into cache Datorteknik Memory. Acceleration bild 37

Overlapped Cache & TLB Access 32 TLB index assoc lookup 10 PA Hit/ Miss

Overlapped Cache & TLB Access 32 TLB index assoc lookup 10 PA Hit/ Miss 20 page # 2 00 Cache 4 bytes PA Data 12 disp 1 K Hit/ Miss = IF cache hit AND (cache tag = PA) then deliver data to CPU ELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN access memory with the PA from the TLB ELSE do standard VA translation Datorteknik Memory. Acceleration bild 38

Problems With Overlapped TLB Access Overlapped access only works as long as the address

Problems With Overlapped TLB Access Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache Example: suppose everything the same except that the cache is increased to 8 K bytes instead of 4 K: 11 cache index 20 virt page # 2 00 12 disp Solutions: go to 8 K byte page sizes; go to 2 way set 10 associative cache; or SW guarantee VA[13]=PA[13] This bit is changed by VA translation, but is needed for cache lookup 1 K 2 way set assoc cache 4 4 Datorteknik Memory. Acceleration bild 39

Startup a User process Allocate Stack pages, Make a Page Table; – Set Instruction

Startup a User process Allocate Stack pages, Make a Page Table; – Set Instruction (I), Global Data (D) and Stack pages (S) – Clear Resident (R) and Dirty (D) bits Kernel Memory Clear V-bits in TLB V D R Page Table 0 0 0 I . . . 0 0 I 0 0 0 D TLB 0 0 S Place on Hard Disk I Page Table Datorteknik Memory. Acceleration bild 40

Demand Paging IM Stage: We get a TLB Miss and Page Fault (page 0

Demand Paging IM Stage: We get a TLB Miss and Page Fault (page 0 not resident) – Page Table (Kernel memory) holds HD address for page 0 (P 0) – Read page to RAM page X, Update PA[23: 10] in Page Table – Update TLB, set V, clear D, Page #, PA[23: 10] XX…X 00. . 0 Restart failing instruction: TLB hit! RAM Page 0 I TLB V 1 0 22 -bit Page # D Physical Addr PA[23: 10] 00…. . . …. 0 0 XX………………. . X I P 0 . . . 0 Datorteknik Memory. Acceleration bild 41

Demand Paging DM Stage: We get a TLB Miss and Page Fault (page 3

Demand Paging DM Stage: We get a TLB Miss and Page Fault (page 3 not resident) – Page Table (Kernel memory) holds HD address for page 3 (P 3) – Read page to RAM page Y, Update PA[23: 10] in Page Table – Update TLB, set V, clear D, Page #, PA[23: 10] Restart failing instruction: TLB hit! TLB V 22 -bit Page # D Physical Addr PA[23: 10] 00…. . . …. 0 0 XX………………. . X I 1 00…. . . … 11 0 YY………………. . Y D 0 Page 0 I Page 3 D YY…Y 00. . 0 1 . . . RAM P 0 P 3 P 1 P 2 Datorteknik Memory. Acceleration bild 42

Spatial and Temporal Locality Spatial Locality Now TLB holds page translation; 1024 bytes, 256

Spatial and Temporal Locality Spatial Locality Now TLB holds page translation; 1024 bytes, 256 instructions – The next instruction (PC+4) will cause a TLB Hit – Access a data array, e. g. , 0($t 0), 4($t 0) etc Temporal Locality TLB holds translation – Branch within the same page, access the same instruction address – Access the array again e. g. , 0($t 0), 4($t 0) etc THIS IS THE ONLY REASON A SMALL TLB WORKS Datorteknik Memory. Acceleration bild 43

Replacement Strategy If TLB is full the OS selects the TLB line to replace

Replacement Strategy If TLB is full the OS selects the TLB line to replace Any line will do, they are the same and concurrently checked Strategy to select one Random – Not so good Round Robin – Not so good, about the same as random Least Recently Used, (move to top strategy) – Much better, (the best we can do without knowing or predicting page access). Based on temporal locality Datorteknik Memory. Acceleration bild 44

Hierarchy Small, fast and expensive VS Slow big and inexpensive TLB 64 Lines TLB/RAM

Hierarchy Small, fast and expensive VS Slow big and inexpensive TLB 64 Lines TLB/RAM Contains copies RAM 256 Mb What if copies are changed? INCONSISTENCY! >> 64 Kernel Memory HD 32 Gb Page Table Datorteknik Memory. Acceleration bild 45

Inconsistency Replace a TLB entry, caused by TLB Miss If old TLB entry dirty

Inconsistency Replace a TLB entry, caused by TLB Miss If old TLB entry dirty (D-bit) we update Page Table (Kernel memory) Replace a page in RAM (swapping) caused by Page Fault If old Page is in TLB – Check old page TLB D-bit, if Dirty write page to HD – Clear TLB V-bit and Page Table R-bit (now not resident) If old Page is in not in TLB – Check old page Page Table D-bit, if Dirty write page to HD – Clear Page Table R-bit (page not resident any more) Datorteknik Memory. Acceleration bild 46

Current Working Set If RAM is full the OS selects a page to replace,

Current Working Set If RAM is full the OS selects a page to replace, Page Fault OBS! The RAM is shared by many User processes Least Recently Used, (move to top strategy) – Much better, (the best we can do without knowing or predicting page access) Swapping is VERY expensive (, maybe > 100 ms) – Why not try harder to keep the pages needed (the working set) in RAM using Advanced memory paging algorithms Current working set of process P {p 0, p 3, . . . } set of pages used under t t t now Datorteknik Memory. Acceleration bild 47

Trashing Probability of Page Fault 1 Trashing No useful work done! This we want

Trashing Probability of Page Fault 1 Trashing No useful work done! This we want to avoid 0 0 1 Fragment of working set not resident Datorteknik Memory. Acceleration bild 48

Summary: Cache, TLB, Virtual Memory Caches, TLBs, Virtual Memory all understood by examining how

Summary: Cache, TLB, Virtual Memory Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: – – Where can a page be placed? How is a page found? What page is replaced on miss? How are writes handled? Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are significant in processor performance: – (some systems can’t access all of 2 nd level cache without TLB misses!) Datorteknik Memory. Acceleration bild 49

Summary: Memory Hierachy Virtual memory was controversial at the time: can SW automatically manage

Summary: Memory Hierachy Virtual memory was controversial at the time: can SW automatically manage 64 KB across many programs? – 1000 X DRAM growth removed the controversy Today VM allows many processes to share single memory without having to swap all processes to disk; VM protection is more important than memory space increase Today CPU time is a function of (ops, cache misses) vs. just of(ops): – What does this mean to Compilers, Data structures, Algorithms? – Vtune performance analyzer, cache misses. Datorteknik Memory. Acceleration bild 50