Virtual Memory Topics n Virtual Memory Access Page

Memory Hierarchy Smaller, faster, costlier per byte Larger, slower, cheaper byte regs on-chip L

Why Caches Work Temporal locality: n Recently referenced items are likely to be referenced

Cache (L 1 and L 2) Performance Metrics Miss Rate n n Fraction of

Lets think about those numbers Huge difference between a hit and a miss n

Types of Cache Misses Cold (compulsory) miss n Occurs on first access to a

What about writes? Multiple copies of data exist: n L 1, L 2, Main

Main Memory is something like a Cache (for Disk) Driven by enormous miss penalty:

Virtual Memory Programs refer to virtual memory addresses n n n Conceptually very large

Virtual Addressing CPU Chip CPU Virtual address (VA) MMU Main memory 0: 1: Physical

MMU Needs Table of Translations CPU Chip CPU Virtual address (VA) MMU . .

Where is page table kept ? CPU Chip CPU Virtual address (VA) MMU .

Speeding up Translation with a TLB Translation Lookaside Buffer (TLB) – 13 – n

TLB Hit CPU Chip CPU TLB 1 VA 2 PTE VA 3 Page Table

TLB Miss CPU Chip TLB 2 4 PTE VA CPU 1 VA MMU 3

How to Program for Virtual Memory At any point in time, programs tend to

More on TLBs Assume a 256 -entry TLB, and each page is 4 KB

Memory Optimization: Summary Caches n Conflict Misses: l Not much of a concern (set-associative

IA 32 Linux Memory Layout Stack n Runtime stack (8 MB limit) Data n

Slides: 19

Download presentation

Virtual Memory Topics n Virtual Memory Access Page Table, TLB Programming for locality n Memory Mountain Revisited n n

Memory Hierarchy Smaller, faster, costlier per byte Larger, slower, cheaper byte regs on-chip L 1 cache (SRAM) on-chip L 2 cache (SRAM) main memory (DRAM) local secondary storage (local disks) remote secondary storage (tapes, distributed file systems, Web servers)

Why Caches Work Temporal locality: n Recently referenced items are likely to be referenced again in the near future Spatial locality: n block Items with nearby addresses tend to be referenced close together in time block – 3–

Cache (L 1 and L 2) Performance Metrics Miss Rate n n Fraction of memory references not found in cache (misses / accesses) = 1 – hit rate Typical numbers (in percentages): l 3 -10% for L 1 l can be quite small (e. g. , < 1%) for L 2, depending on size, etc. Hit Time n n Time to deliver a block in the cache to the processor l includes time to determine whether the line is in the cache Typical numbers: l 1 -3 clock cycles for L 1 l 5 -20 clock cycles for L 2 Miss Penalty n Additional time required because of a miss l typically 50 -400 cycles for main memory

Lets think about those numbers Huge difference between a hit and a miss n Could be 100 x, if just L 1 and main memory Would you believe 99% hits is twice as good as 97%? n Consider: cache hit time of 1 cycle miss penalty of 100 cycles n Average access time: 0. 97 * 1 cycle + 0. 03 * 100 cycles = 3. 97 cycles 0. 99 * 1 cycle + 0. 01 * 100 cycles = 1. 99 cycles – 5–

Types of Cache Misses Cold (compulsory) miss n Occurs on first access to a block n Spatial locality of access helps (also prefetching---more later) Conflict miss n Multiple data objects all map to the same slot (like in hashing) l e. g, block i must be placed in cache entry/slot: i mod 8 l replacing block already in that slot l referencing blocks 0, 8, . . . would miss every time n Conflict misses are less of a problem these days l Set associative caches with 8, or 16 set size per slot help Capacity miss n When the set of active cache blocks (working set) is larger than the cache l This is where to focus nowadays

What about writes? Multiple copies of data exist: n L 1, L 2, Main Memory, Disk What to do on a write-hit? n Write-back (defer write to memory until replacement of line) l Need a dirty bit (line different from memory or not) What to do on a write-miss? n Write-allocate (load into cache, update line in cache) Typical n Write-back + Write-allocate Rare n – 7– Write-through (write immediately to memory, usually for I/O)

Main Memory is something like a Cache (for Disk) Driven by enormous miss penalty: n Disk is about 10, 000 x slower than DRAM Design: n – 8– Large page (block) size: typically 4 KB

Virtual Memory Programs refer to virtual memory addresses n n n Conceptually very large array of bytes (4 GB for IA 32, 16 exabytes for 64 bits) Each byte has its own address System provides address space private to each process Allocation: Compiler and run-time system n – 9– All allocation within single virtual address space

Virtual Addressing CPU Chip CPU Virtual address (VA) MMU Main memory 0: 1: Physical address 2: 3: (PA) 4: 5: 6: 7: . . . Data word MMU = Memory Management Unit MMU keeps mapping of VAs -> PAs in a “page table”

MMU Needs Table of Translations CPU Chip CPU Virtual address (VA) MMU . . . Page Table Main memory 0: 1: Physical address 2: 3: (PA) 4: 5: 6: 7: MMU keeps mapping of VAs -> PAs in a “page table” – 11 –

Where is page table kept ? CPU Chip CPU Virtual address (VA) MMU . . . Main memory 0: 1: Physical address 2: 3: (PA) 4: 5: 6: Page 7: Table In main memory – can be cached e. g. , in L 2 (like data) – 12 –

Speeding up Translation with a TLB Translation Lookaside Buffer (TLB) – 13 – n Small hardware cache for page table in MMU n Caches page table entries for a number of pages (eg. , 256 entries)

TLB Hit CPU Chip CPU TLB 1 VA 2 PTE VA 3 Page Table MMU PA 4 Mem Data 5 A TLB hit saves you from accessing memory for the page table – 14 –

TLB Miss CPU Chip TLB 2 4 PTE VA CPU 1 VA MMU 3 PTE request PA Page Table Mem 5 Data 6 A TLB miss incurs an additional memory access (the PT) – 15 –

How to Program for Virtual Memory At any point in time, programs tend to access a set of active virtual pages called the working set n Programs with better temporal locality will have smaller working sets If ((working set size) > main mem size) n Thrashing: Performance meltdown where pages are swapped (copied) in and out continuously If ((# working set pages) > # TLB entries) n n – 16 – Will suffer TLB misses Not as bad as page thrashing, but still worth avoiding

More on TLBs Assume a 256 -entry TLB, and each page is 4 KB n Can only have TLB hits for 1 MB of data (256*4 k. B = 1 MB) n This is called the “TLB reach”---amount of mem TLB can cover Typical L 2 cache is 6 MB n Hence should consider TLB-size before L 2 size when tiling? Real CPUs have second-level TLBs (like an L 2 for TLB) – 17 – n This is getting complicated to reason about! n Likely have to experiment to find best tile size

Memory Optimization: Summary Caches n Conflict Misses: l Not much of a concern (set-associative caches) n Cache Capacity: l Keep working set within on-chip cache capacity l Fit in L 1 or L 2 depending on working-set size Virtual Memory: n Page Misses: l Keep page-level working set within main memory capacity n – 18 – TLB Misses: may want to keep working set #pages < TLB #entries

IA 32 Linux Memory Layout Stack n Runtime stack (8 MB limit) Data n n Statically allocated data E. g. , arrays & strings declared in code Heap n n Dynamically allocated storage When call malloc(), calloc(), new() Text n n Executable machine instructions Read-only