Local file systems Landon Cox March 26 2018

Block-oriented vs byte-oriented • Disks are accessed in terms of blocks • Also called

Block-oriented vs byte-oriented • To read less than a block • Read entire block

Red request scheduled next After BLUE read

Seek to red’s track After BLUE read SEEK Seek for RED

Wait for red sector to reach head After BLUE read SEEK Seek for RED

Read red sector After BLUE read SEEK Seek for RED Rotational latency After RED

To access a disk 1. Queue (wait for disk to be free) • 0

File systems • What is a file system? • OS abstraction that makes disks

Intro to file system structure • Overall question: • How do we organize things

Intro to file system structure • Need an initial object to get things going

File system usage patterns 1. 80% of file accesses are reads (20% writes) •

1) Contiguous allocation • Store a file in one contiguous segment • Sometimes called

1) Contiguous allocation • File header contains • Starting location (block #) of file

1) Contiguous allocation • Pros? • Fast sequential access • Easy random access •

2) Indexed files • File header File block # Disk block # 0 18

2) Indexed files • Pros • Easy to grow (don’t have to reserve in

What about large files? • Could just assume it will be really large •

What about large files? • Could use a larger block size • Problem? •

3) Multi-level indexed files • Think of indexed files as a shallow tree •

3) Multi-level indexed files Level 1 Level 2 Level 3 (data) How many accesses

3) Multi-level indexed files Level 1 Level 2 Level 3 (data) How to improve

3) Multi-level indexed files • To reduce number of disk accesses • Cache level

3) Multi-level indexed files Level 1 Level 2 Level 3 What about small files

3) Multi-level indexed files Use a non-uniform tree

3) Multi-level indexed files • Pros • Simple • Files can easily expand •

Multiple updates and reliability • Reliability is only an issue in file systems •

Multi-step updates • Transfer $100 from Melissa’s account to mine 1. Deduct $100 from

Multi-step updates • Same for directories • “mv /tmp/foo. txt /home/” • foo. txt

Multiple updates and reliability • This is a well known, undergrad OS-level problem •

Multi-step updates • Move file from one directory to another 1. Delete from old

Multi-step updates • Create an empty new file 1. Point directory to new file

Multi-step updates • Create an empty new file 1. Initialize new file header 2.

Multi-step updates • What if we also have to update a map of free

inode table: 1 2 … n-1 n Meta-data Direct block Indirect block inode table

Write order and corruption Rule 1: Don’t point to uninitialized data Dir: foo Create

Write order and corruption Rule 2: Don’t re-use before nullifying existing pointers Dir: foo

Write order and corruption Rule 3: Set new pointer before resetting old one Dir:

Ideal file system • • Apps never wait for disk writes Minimize number of

Journaling • Write to journal, then write to file system Dir: foo baz inode

Journaling • Write to journal, then write to file system Dir: foo Do we

Journaling • Write to journal, then write to file system Dir: foo Can we

Journaling • Write to journal, then write to file system Dir: foo Why faster

Soft updates • Maintain dependency information • Only write blocks after those they depend

Soft updates • Solution • Finer-grained dependencies (not coarse, block-based dependencies) • Maintain per-field

Example: create file A, remove file B What happens on recovery? Memory Disk Inode

Soft updates • What is guaranteed about disk state? • Will always be consistent

Soft updates • How are soft updates good for the disk scheduler? • Disk

Next few lectures: storage • Can we hide storage latency w/ speculative execution? •

Slides: 67

Download presentation

Local file systems Landon Cox March 26, 2018

Block-oriented vs byte-oriented • Disks are accessed in terms of blocks • Also called sectors • Similar idea to memory pages • E. g. , 4 KB chunks of data • First problem: programs deal with bytes • E. g. , want to change ‘J’ in “Jello world” to ‘H’ • Disks only support block-sized, bulk accesses

Block-oriented vs byte-oriented • To read less than a block • Read entire block • Return the right portion • How to write less than a block? • Read entire block • Modify the right portion • Write out entire block • Nothing analogous to byte-grained load/store • Flash devices are even more complicated • Can only accomplish via mmap

Disk drives over the years

Disk geometry

Surface organized into tracks

Parallel tracks form cylinders

Tracks broken up into sectors

Disk head position

Rotation is counter-clockwise

About to read a sector

After reading blue sector

Red request scheduled next After BLUE read

Seek to red’s track After BLUE read SEEK Seek for RED

Wait for red sector to reach head After BLUE read SEEK Seek for RED Rotational latency ROTATE

Read red sector After BLUE read SEEK Seek for RED Rotational latency After RED read ROTATE

To access a disk 1. Queue (wait for disk to be free) • 0 -infinity ms 2. Position disk head and arm • Seek + rotation • 0 -10 ms • Pure overhead 3. Access disk data Minimize time spent doing this • Size/transfer rate (e. g. , 1 MB/s) • Useful work Maximize time spent doing this

File systems • What is a file system? • OS abstraction that makes disks easy to use • Place to put persistent data • File system issues 1. How to map file space onto disk space • Structure and allocation: like mem management 2. How to use names instead of sectors • Naming and directories: not like memory • But very similar to DNS

Intro to file system structure • Overall question: • How do we organize things on disk? • Really a data structure question • What data structures do we use on disk?

Intro to file system structure • Need an initial object to get things going • In file systems, this is a file header • Unix: this is called an inode (indexed node) • inode contains info about the file • Size, owner, access permissions • Last modification date • Many ways to organize inodes on disk • Use actual usage patterns to make good decisions

File system usage patterns 1. 80% of file accesses are reads (20% writes) • Ok to save on reads, even if it hurts writes 2. Most file accesses are sequential and full • Form of spatial locality • Put sequential blocks next to each other • Can pre-fetch blocks next to each other 3. Most files are small 4. Most bytes are consumed by large files

1) Contiguous allocation • Store a file in one contiguous segment • Sometimes called an “extent” • Reserve space in advance of writing it • User could declare in advance • If grows larger, move it to a place that fits

1) Contiguous allocation • File header contains • Starting location (block #) of file • File size (# of blocks) • Other info (modification times, permissions) • Exactly like base and bounds memory

1) Contiguous allocation • Pros? • Fast sequential access • Easy random access • Cons? • External/internal fragmentation • Hard to grow files Header B 0 B 1 B 2 Reserved

2) Indexed files • File header File block # Disk block # 0 18 1 50 2 8 3 15 • Looks a lot like a page table

2) Indexed files • Pros • Easy to grow (don’t have to reserve in advance) • Easy random access • Cons • How to grow beyond index size? • Sequential access may be slow. Why? • May have to seek after each block read Why isn’t sequential access a problem with page tables? Memory doesn’t have seek times.

2) Indexed files • Pros • Easy to grow (don’t have to reserve in advance) • Easy random access • Cons • How to grow beyond index size? • Potential for lots of seeks for sequential access How to reduce seeks for sequential access? Don’t want to pre-allocate blocks.

What about large files? • Could just assume it will be really large • Problem? • • Wastes space in header if file is small Max file size is 4 GB File block is 4 KB 1 million pointers 4 MB header for 4 byte pointers • Remember most files are small • 10, 000 small files 40 GB of headers

What about large files? • Could use a larger block size • Problem? • Internal fragmentation (most files are small) • Solution • Use a more sophisticated data structure

3) Multi-level indexed files • Think of indexed files as a shallow tree • Instead could have a multi-level tree • Level 1 points to level 2 nodes • Level 2 points to level 3 nodes • Gives us big files without wasteful headers

3) Multi-level indexed files Level 1 Level 2 Level 3 (data) How many accesses to read one block of data? 3 (one for each level)

3) Multi-level indexed files Level 1 Level 2 Level 3 (data) How to improve performance? Caching.

3) Multi-level indexed files • To reduce number of disk accesses • Cache level 1 and level 2 nodes • Often a useful combination • Indirection for flexibility • Caching to speed up indirection • Can cache lots of small pointers • Where else do we see this strategy? • TLB, DNS

3) Multi-level indexed files Level 1 Level 2 Level 3 What about small files (i. e. most files)?

3) Multi-level indexed files Use a non-uniform tree

3) Multi-level indexed files • Pros • Simple • Files can easily expand • Small files don’t pay the full overhead • Cons • Large files need lots of indirect blocks (slow) • Could have lots of seeks for sequential access

Multiple updates and reliability • Reliability is only an issue in file systems • Don’t care about losing address space after crash • Your files shouldn’t disappear after a crash • Files should be permanent • Multi-step updates cause problems • Can crash in the middle

Multi-step updates • Transfer $100 from Melissa’s account to mine 1. Deduct $100 from Melissa’s account 2. Add $100 to my account • Crash between 1 and 2, we lose $100

Multi-step updates • Same for directories • “mv /tmp/foo. txt /home/” • foo. txt removed from /tmp, added to /home • Acceptable outcomes if crash in middle? • foo. txt in /tmp and /home • foo. txt in /tmp, not in /home • Unacceptable outcome? • foo. txt not in /tmp or in /home

Multiple updates and reliability • This is a well known, undergrad OS-level problem • No modern OS would make this mistake, right? • Video evidence suggests otherwise • • Directory with 3 files Want to move them to external drive Drive “fails” during move Don’t want to lose data due to failure • Roll film …

Bug in OS X Leopard

Multi-step updates • Move file from one directory to another 1. Delete from old directory 2. Add to new directory • Crash between 1 and 2, we lose a file “/home/lpc/names” “/home/chase/names”

Multi-step updates • Create an empty new file 1. Point directory to new file header 2. Initialize new file header • What happens if we crash between 1 and 2? • • Directory will point to uninitialized header Kernel will crash if you try to access it • How do we fix this? • Re-order the writes

Multi-step updates • Create an empty new file 1. Initialize new file header 2. Point directory to new file header • What happens if we crash between 1 and 2? • • File doesn’t exist File system won’t point to garbage

Multi-step updates • What if we also have to update a map of free blocks? 1. 2. 3. Initialize new file header Point directory to new file header Update the free block map • Does this work? • • Bad if crash between 2 and 3 Free block map will still think new file header is free

Multi-step updates • What if we also have to update a map of free blocks? 1. 2. 3. Initialize new file header Update the free block map Point directory to new file header • Does this work? • • • Better, but still bad if crash between 2 and 3 Leads to a disk block leak Could scan the disk after a crash to recompute free map Older versions of Unix and Windows do this (now we have journaling file systems …)

inode table: 1 2 … n-1 n Meta-data Direct block Indirect block inode table pre-allocated in well-known place. Each file has an inode (dirs are special files). Double indirect block

Write order and corruption Rule 1: Don’t point to uninitialized data Dir: foo Create foo/bar/new baz inode 1) assign new inode for new 2) point bar’s block to new’s inode 3) crash before inode is initialized bar inode File: baz Data Dir: bar ?

Write order and corruption Rule 2: Don’t re-use before nullifying existing pointers Dir: foo Delete foo/baz + write foo/bar 1) update free map: baz’s data block free 2) allocate baz’s data block to bar 3) point bar at baz’s data block 4) crash Free Map baz inode File: baz Data bar inode File: bar

Write order and corruption Rule 3: Set new pointer before resetting old one Dir: foo Mv foo/baz foo/bar/ baz inode 1) remove foo’s pointer to baz 2) crash File: baz Data bar inode Dir: bar

Ideal file system • • Apps never wait for disk writes Minimize number of disk writes Minimize memory used for caching Maximize disk scheduler flexibility • Two approaches • Journaling (apply to log then FS) • Soft updates (maintain dependency info)

Journaling • Write to journal, then write to file system Dir: foo baz inode foo ! baz inode bar baz inode Mv foo/baz foo/bar/ journal File: baz Data bar inode Dir: bar

Journaling • Write to journal, then write to file system Dir: foo Do we need begin/end transaction? baz inode foo ! baz inode bar baz inode No, ordering ensures consistency journal File: baz Data bar inode Dir: bar

Journaling • Write to journal, then write to file system Dir: foo Can we reverse the order of operations? baz inode foo ! baz inode bar baz inode No, could crash during replay journal File: baz Data bar inode Dir: bar

Journaling • Write to journal, then write to file system Dir: foo Why faster than sync, ordered FS updates? foo ! baz inode bar baz inode Synchronous FS updates may require seeks Writing to log is sequential Can apply updates to in-memory cache Can flush blocks at leisure journal baz inode File: baz Data bar inode Dir: bar

Soft updates • Maintain dependency information • Only write blocks after those they depend on • Don’t have to write anything synchronously • Example: create file A, remove file B Inode block Inode #4 <-, #0> Inode #5 <B, #5> Inode #6 <C, #7> Inode #7 Dir block

Soft updates • Maintain dependency information • Only write blocks after those they depend on • Don’t have to write anything synchronously • Example: create file A, remove file B Inode block Inode #4 <A, #4> Inode #5 <B, #5> Inode #6 <C, #7> Dir block Inode #7 Need to write inode block before directory block.

Soft updates • Maintain dependency information • Only write blocks after those they depend on • Don’t have to write anything synchronously • Example: create file A, remove file B Inode block Inode #4 <A, #4> Inode #5 <-, #0> Inode #6 <C, #7> Dir block Inode #7 Need to write directory block before inode block.

Soft updates • Maintain dependency information • Only write blocks after those they depend on • Don’t have to write anything synchronously • Example: create file A, remove file B Inode block Inode #4 <A, #4> Inode #5 <-, #0> Inode #6 <C, #7> Dir block Inode #7 Oh no, a cyclic dependency!

Soft updates • Solution • Finer-grained dependencies (not coarse, block-based dependencies) • Maintain per-field and per-pointer • May have to redo/undo updates to fields/pointers • Consider the previous example

Example: create file A, remove file B What happens on recovery? Memory Disk Inode #4 <A, #4> Inode #4 <-, #0> Inode #5 <B, #5> Inode #6 <C, #7> Inode #7 What is odd about the state of this block? Starting point Inode #4 <A, #4> Inode #4 <-, #0> Inode #5 <-, #0> Inode #6 <C, #7> Inode #7 Step 1: safe version of directory block written

Example: create file A, remove file B What happens on recovery? Memory Disk Inode #4 <A, #4> Inode #4 <-, #0> Inode #5 <B, #5> Inode #6 <C, #7> Inode #7 Starting point Inode #4 <A, #4> Inode #4 <-, #0> Inode #5 <-, #0> Inode #6 <C, #7> Inode #7 Step 2: inode block written

Example: create file A, remove file B What happens on recovery? Memory Disk Inode #4 <A, #4> Inode #4 <-, #0> Inode #5 <B, #5> Inode #6 <C, #7> Inode #7 Starting point Inode #4 <A, #4> Inode #5 <-, #0> Inode #6 <C, #7> Inode #7 Step 3: directory block written

Soft updates • What is guaranteed about disk state? • Will always be consistent on recovery • May have orphaned inodes and blocks • Do I need to do anything on recovery? • Don’t need to check consistency • Can check for orphaned inodes/blocks async

Soft updates • How are soft updates good for the disk scheduler? • Disk scheduler can schedule blocks “arbitrarily” • Can optimize for lowest seek time, etc. • Just has to be careful about state of blocks that it writes • What info does the disk scheduler need? • Needs to know dependencies • Needs to be able to undo updates • What is the potential downside of soft updates? • Can cause extra writes • Have to write rolled-back and rolled-forward block versions

Next few lectures: storage • Can we hide storage latency w/ speculative execution? • How often will our speculations be correct? • How costly are the mis-predictions? • How do we use persistent DRAM? • Byte addressable but costly ($) • File system interface? Mmap?