The Buffer Cache Jeff Chase Duke University The
The Buffer Cache Jeff Chase Duke University
The kernel syscall trap/return fault/return system call layer: file API fault entry: VM page faults memory management: block/page cache I/O completions interrupt/return policy timer ticks
De. Filer interfaces: overview create, destroy, read, write a dfile list dfiles DFS DBuffer dbuf = get. Block(block. ID) release. Block(dbuf) DBuffer. Cache read(), write() start. Fetch(), start. Push() wait. Valid(), wait. Clean() DBuffer io. Complete() start. Request(dbuf, r/w) Virtual. Disk
Memory Allocation How should an OS allocate its memory resources among contending demands? – Virtual address spaces: fork, exec, sbrk, page fault. – The kernel controls how many machine memory frames back the pages of each virtual address space. – The kernel can take memory away from a VAS at any time. – The kernel always gets control if a VAS (or rather a thread running within a VAS) asks for more. – The kernel controls how much machine memory to use as a cache for data blocks whose home is on slow storage. – Policy choices: which pages or blocks to keep in memory? And which ones to evict from memory to make room for others?
Memory/storage hierarchy Terms to know cache index/directory cache line/entry, associativity cache hit/miss, hit ratio spatial locality of reference temporal locality of reference eviction / replacement write-through / writeback dirty/clean small and fast registers (ns) caches L 1/L 2 off-core L 3 off-chip main memory (RAM) off-module disk, other storage, network RAM • In general, each layer is a cache over the layer below. – inclusion property • Technology trends rapid change • The triangle is expanding vertically bigger gaps, more levels big and slow (ms)
Memory as a cache data Processes access external storage objects through file APIs and VM abstraction. The OS kernel manages caching of pages/blocks in main memory. virtual address spaces data files and filesystems, databases, other storage objects page/block read/write accesses disk and other storage network RAM memory (frames) backing storage volumes (pages and blocks)
The Buffer Cache Proc Memory File cache Ritchie and Thompson The UNIX Time-Sharing System, 1974
Editing Ritchie/Thompson The system maintains a buffer cache (block cache, file cache) to reduce the number of I/O operations. Proc Suppose a process makes a system call to access a single byte of a file. UNIX determines the affected disk block, and finds the block if it is resident in the cache. If it is not resident, UNIX allocates a cache buffer and reads the block into the buffer from the disk. Then, if the op is a write, it replaces the affected byte in the buffer. A buffer with modified data is marked dirty: an entry is made in a list of blocks to be written. The write call may then return. The actual write may not be completed until a later time. If the op is a read, it picks the requested byte out of the buffer and returns it, leaving the block in the cache. Memory File cache
The De. Filer buffer cache File abstraction implemented in upper DFS layer. All knowledge of how files are laid out on disk is at this layer. Access underlying disk volume through buffer cache API. Obtain buffers (dbufs), write/read to/from buffers, orchestrate I/O. DBuffer dbuf = get. Block(block. ID) release. Block(dbuf) DBuffer. Cache Device I/O interface Asynchronous I/O to/from buffers block read and write Blocks numbered by block. IDs DBuffer read(), write() start. Fetch(), start. Push() wait. Valid(), wait. Clean()
Page/block cache internals HASH(block. ID) Each frame/buffer of memory is described by a meta-object (header). Resident pages or blocks are accessible through a global hash table. An ordered list of eviction candidates winds through the hash chains. Some frames/buffers are free (no valid data). These are on a free list.
DBuffer. Cache internals HASH(block. ID) Any given block (block. ID) is either resident or not. If resident, then it has exactly one copy (dbuf) in the cache. If it is resident then get. Block finds the dbuf (cache hit). This requires some kind of cache index, e. g. , a hash table. DBuffer dbuf = get. Block(block. ID) DBuffer. Cache I/O cache buffers Each is byte[blocksize] DBuffer headers DBuffer dbuf There is a one-to-one correspondence of dbufs to buffers.
DBuffer. Cache internals HASH(block. ID) If the requested block is not resident, then get. Block allocates a dbuf for the block and places the correct block contents in its buffer (cache miss). If there are no free dbufs in the cache, then we must evict some other block from the cache and reuse its dbuf. DBuffer dbuf = get. Block(block. ID) DBuffer. Cache I/O cache buffers Each is byte[blocksize] DBuffer headers DBuffer dbuf There is a one-to-one correspondence of dbufs to buffers.
Page/block cache internals HASH(block. ID) cache directory List(s) of free buffers (bufs) or eviction candidates. These dbufs might be listed in the cache directory if they contain useful data, or not, if they are truly free. To replace a dbuf Remove from free/eviction list. Remove from cache directory. Change dbuf block. ID and status. Enter in directory w/ new block. ID. Re-register on eviction list. Beware of concurrent accesses.
Dbuffer (dbuf) states DFS A DBuffer dbuf returned by get. Block is always associated with exactly one block in the disk volume. But it might or might not be “in sync” with the underlying disk contents. read(…) write(. . . ) start. Fetch(), start. Push() wait. Valid(), wait. Clean() DBuffer A dbuf is valid iff it has the “correct” copy of the data. A dbuf is dirty iff it is valid and has an update (a write) that has not yet been written to disk. A valid dbuf is clean if it is not dirty. Your De. Filer should return only valid data to a client. That may require you to zero the dbuf or fetch data from the disk. Your De. Filer should ensure that all dirty data is eventually pushed to disk.
Asynchronous I/O on dbufs Start I/O on a dbuf by posting it to a producer/consumer queue for service by a device start. Fetch(), start. Push() thread. Client threads may wait on the dbuf for asynchronous I/O to complete. wait. Valid(), wait. Clean() DBuffer start. Request(dbuf, r/w) Device I/O interface Async I/O on dbufs device threads Virtual. Disk start. Fetch(), start. Push() wait. Valid(), wait. Clean() io. Complete() Thread upcalls dbuf io. Complete when I/O operation is done.
More dbuf states Do not evict a dbuf that is in active use (busy)! DFS A dbuf is pinned if I/O is in progress, i. e. , a disk request has started but not yet completed. dbuf = get. Block(block. ID) release. Block(dbuf) A dbuf is held if DFS obtained a reference to the dbuf from get. Block but has not yet released the dbuf. DBuffer. Cache DBuffer start. Request(dbuf, r/w); Virtual. Disk io. Complete()
File system layer (DFS) create, destroy, read, write a dfile list dfiles Allocate blocks to files and file metadata. Allocate DFile. IDs to files. Track which block. IDs and DFile. IDs are free and which are in use. “inode” Maintain a block map “inode” for each file, as metadata stored on disk. DBuffer dbuf = get. Block(block. ID) release. Block(dbuf) sync() DBuffer. Cache DBuffer read(), write() start. Fetch(), start. Push() wait. Valid(), wait. Clean()
A Filesystem On Disk sector 0 sector 1 allocation bitmap file wind: 18 0 directory file 111000101101 10111101 snow: 62 0 once upo n a time /n in a l 10011010 001100010101 00101110 00011001 0100 and far away , lived th rain: 32 hail: 48 Data
A Filesystem On Disk sector 0 sector 1 allocation bitmap file wind: 18 0 directory file 111000101101 10111101 snow: 62 0 once upo n a time /n in a l 10011010 001100010101 00101110 00011001 0100 and far away , lived th rain: 32 hail: 48 Metadata
Managing files create, destroy, read, write a dfile list dfiles Each file has a size: it is the first byte offset in the file that has never been written. Never return data past a file’s size. Fetch blocks for data and metadata (or zero new ones fresh), read and write in place, and push dirty blocks back to the disk. “inode” Serialize DFS read/write on each inode. DBuffer dbuf = get. Block(block. ID) release. Block(dbuf) sync() DBuffer. Cache DBuffer read(), write() start. Fetch(), start. Push() wait. Valid(), wait. Clean()
Representing a File On Disk file attributes e. g. , size block map Index by logical block number maps to a block. ID access blocks through the block cache with get. Block, start. Fetch, wait. Valid, read, release. Block. “inode” once upo n a time /nin a l logical block 0 and far away , /nlived t logical block 1 he wise and sage wizard. logical block 2
Filesystem layout on disk inode 0 bitmap file inode 1 root directory fixed locations on disk 111000101101 10111101 wind: 18 0 snow: 62 0 once upo n a time /n in a l 10011010 001100010101 allocation bitmap file blocks rain: 32 hail: 48 file blocks 00101110 00011001 0100 and far away , lived th inode This is a toy example (Nachos).
Filesystem layout on disk inode 0 bitmap file X X X inode 1 root directory 111000101101 10111101 Your De. Filer volume is small. You can keep the free block/inode maps in memory. You don’t need metadata structures on disk for that. But you have to scan the disk to rebuild the in-memory structures on initialization. De. Filer must be able to find all valid inodes on disk. X 0 once upo n a time /n in a l rain: 32 hail: 48 file blocks and far away , lived th inode De. Filer has no directories. You just need to keep track of which DFile. IDs are currently valid, and return a list.
Disk layout: the easy way De. Filer must be able to find all valid inodes on disk. Given a list of valid inodes, you can determine which inodes and blocks are free and which are in use. once upo n a time /n in a l file blocks and far away , lived th inode
- Slides: 25