File Systems Design and Implementation Operating Systems Fall

What is it all about? § File system is a service which supports an

Main memory vs. Secondary storage û Small (MB/GB) û Expensive ü Fast (10 -6/10

Some numbers… § 1 GB=230 ~109 Bytes § 1 TB=240 ~1012 (terabyte) § 1

Secondary storage structure § A number of disks directly attached to the computer §

Data Access § Sector size is the minimum read/write unit of data (usually 1

Overview § File system services File system interface § File system implementation Finding files

File System services § File system is a layer between the secondary storage and

What is a file( )קובץ § File is a named persistent collection of data

File system interface § File data access READ: Bring a specified chunk of data

Accessing File Data: File Control Block § A control structure, File Control Block (FCB),

Accessing File Data § Given the file name § Get to the file’s FCB

Accessing File Data: Catalog § The catalog maps a file name to the FCB

The Catalog Organization § FCBs are stored in predefined locations on the disk UNIX:

Searching the UNIX catalog § /a/b/c => i-node of /a/b/c § Get the root

Allocating disk blocks to file data § Assume unstructured files Array of bytes §

Static and Contiguous Allocation § Allocate each file a fixed number of blocks at

Static and Contiguous Allocation Catalog OS Fall’ 02

Extent-based allocation § File get blocks in contiguous chunks called extents Multiple contiguous allocations

Extent-based allocation § Efficient offset lookup and disk access § Support for dynamic growth/shrink

Single-block allocation § Extent-based allocation with a fixed extent size of one disk block

Block Allocation in UNIX § 10 direct pointers § 1 single indirect pointer: points

Block Allocation in UNIX § Optimized for small files Outdated empirical studies indicate that

Linked Allocation § Each file is a linked list of disk blocks § Offset

File Allocation Table (FAT( § A section at the beginning of the disk is

FAT Pros and Cons § Improved random access just search a small table instead

Free space management § Disk bitmap: represent the disk block allocation as an array

Next: File System continued § File I/O Organization, performance § Atomicity and consistency §

File I/O § CPU cannot access the file data directly § Must be first

Read/Write Mapping § File data is made available to applications via a pre-allocated main

Reading data (Disk block=1 K( OS Fall’ 02

Writing data (Disk block=1 K( OS Fall’ 02

Buffer Cache management § All disk I/O goes through the buffer cache Both user

Advantages § Strict separation of concerns Hiding disk access peculiarities from the user w

Disadvantages § Extra copying Disk->buffer cache->user space § Vulnerability to failures Does not care

Memory mapped files § A file (or a portion thereof) is mapped into a

Mmapped files: Pros and Cons § Advantages: reduce copying no need for a pre-allocated

Reliability and Recovery § File system data consists of Control data (metadata), user data

Metadata vs. User data § Lost or corruption of the metadata might lead to

Reliability and caching § Caching affects the WRITE semantics The write operation returns Is

User data reliability in UNIX § Based on write-back policy User data is written

Metadata reliability § Based on write-through policy updates are written to disk immediately §

Metadata reliability using logging § Write-through negatively affects performance Think about random access §

Journal File System (JFS( § Operations logged (journaled): create, link, mkdir, truncate, allocating write,

JFS: Journal maintenance § A cursor (pointer) is maintained § The cursor is advanced

JFS: Pros and Cons § Advantages: Asynchronous metadata write Fast recovery: depends on the

Log Structured File System § Ousterhout & Douglis (1992) § Caching is enough for

Log Structured File System § The idea: everything is log § Each write -

Log structured file system supermap Before supermap After block change supermap After block addition

Next: § Networking and distributed systems § Last: New storage architectures Storage Area Networks,

Slides: 58

Download presentation

File Systems: Design and Implementation Operating Systems Fall 2002 OS Fall’ 02

What is it all about? § File system is a service which supports an abstract representation of the secondary storage Supported by OS § Why is a file system needed? What is so special about the secondary storage (as opposed to the main memory)? OS Fall’ 02

Memory Hierarchy OS Fall’ 02

Main memory vs. Secondary storage û Small (MB/GB) û Expensive ü Fast (10 -6/10 -7 sec) û Volatile ü Directly accessible by CPU § ü Large (GB/TB) ü Cheap û Slow (10 -2/10 -3 sec) ü Persistent û Cannot be directly Interface: (virtual) memory address accessed by CPU Data should be first brought into the main memory OS Fall’ 02

Some numbers… § 1 GB=230 ~109 Bytes § 1 TB=240 ~1012 (terabyte) § 1 PB=250 ~1015 (petabyte) § 1 EB=260 ~1018 (exabyte) § 232 ~ 4 x 109: Genome base pairs § 264 ~ 16 x 1018: Brain electrons § 2256 ~ 65, 536 x 1072: Particles in Universe OS Fall’ 02

Secondary storage structure § A number of disks directly attached to the computer § Network attached disks accessible through a fast network Storage Area Network (SAN) § Simple disks § Smart disks OS Fall’ 02

Internal disk structure OS Fall’ 02

Data Access § Sector size is the minimum read/write unit of data (usually 1 KB) Access: (#surface, #track, #sector) § Smart disk drives hide out the internal disk layout Access: (#sector) § Moving arm assembly (Seek) is expensive Sequential access is x 100 times faster than the random access OS Fall’ 02

Overview § File system services File system interface § File system implementation Finding files and their data Reading and writing Other issues § Performance is the paramount issue for the file system implementation OS Fall’ 02

File System services § File system is a layer between the secondary storage and the application § Presents the secondary storage as a collection of persistent objects with unique names, called files § Provides mechanisms for mapping the data between the secondary storage and the main memory OS Fall’ 02

What is a file( )קובץ § File is a named persistent collection of data § Unstructured, sequential (UNIX) Data is accessed by specifying the offset § Collection of records (database systems) Supports associative access w give me all records with “Name=Yossi” § Attributes: owner, permissions, modification time, size, etc… OS Fall’ 02

File system interface § File data access READ: Bring a specified chunk of data from file into the process virtual address space WRITE: Write a specified chunk of data from the process virtual address space to the file § CREATE, DELETE, SEEK, TRUNCATE § open, close, set_attributes OS Fall’ 02

Accessing File Data: File Control Block § A control structure, File Control Block (FCB), is associated with each file in the file system Each FCB has a unique identifier (FCB ID) UNIX: i-node, identified by i-node number § FCB structure: File attributes A data structure for accessing the file’s data OS Fall’ 02

Accessing File Data § Given the file name § Get to the file’s FCB using the file system catalog § Use the FCB to get to the desired offset within the file data OS Fall’ 02

Accessing File Data: Catalog § The catalog maps a file name to the FCB Checks permissions § This can be done for each file data access Inefficient: Do this once when the file is first referenced § file_handle=open(file_name): search the catalog and bring FCB into the memory UNIX: in-memory FCB: in-core i-node § close(file_handle): release FCB from memory OS Fall’ 02

The Catalog Organization § FCBs are stored in predefined locations on the disk UNIX: i-node list § Hierarchical structure: Some FCBs are just a list of pointers to other FCBs w Directories w UNIX: directory is a file whose data is an array of (file_name, i-node#) pairs Recursive mapping OS Fall’ 02

Searching the UNIX catalog § /a/b/c => i-node of /a/b/c § Get the root i-node: § § § § The i-node number of ‘/’ is pre-defined (2) Use the root i-node to get to the ‘/’ data Search (a, i-node#) in the root’s data Get the a’s i-node Get to the a’s data and search for (b, i-node#) Get the b’s i-node Etc… Permissions are checked all along the way Each dir in the path must be (at least) executable OS Fall’ 02

Allocating disk blocks to file data § Assume unstructured files Array of bytes § Efficient offset -> disk block mapping § Efficient disk access for both sequential and random patterns Minimizing number of seeks § Efficient space utilization Minimizing external/internal fragmentation OS Fall’ 02

Static and Contiguous Allocation § Allocate each file a fixed number of blocks at the creation time § Efficient offset lookup Only the block # of the offset 0 is needed § Efficient disk access § Inefficient space utilization Internal, external fragmentation § No support for dynamic extension OS Fall’ 02

Static and Contiguous Allocation Catalog OS Fall’ 02

Extent-based allocation § File get blocks in contiguous chunks called extents Multiple contiguous allocations § For large files, B-tree is used for efficient offset lookup OS Fall’ 02

Extent-based allocation OS Fall’ 02

Extent-based allocation § Efficient offset lookup and disk access § Support for dynamic growth/shrink § Dynamic memory allocation techniques are used (e. g. , first-fit) § Suffers from external fragmentation Use compaction OS Fall’ 02

Single-block allocation § Extent-based allocation with a fixed extent size of one disk block File blocks are scattered anywhere on the disk w Inefficient sequential access § UNIX block allocation § Linked allocation MS-DOS File Allocation Table (FAT) OS Fall’ 02

Block Allocation in UNIX § 10 direct pointers § 1 single indirect pointer: points to a block of N pointers to blocks § 1 double indirect pointer: points to a block of N pointers each of which points to a block of N pointers to blocks § 1 triple indirect pointer… § Overall addresses 10+N+N 2+N 3 disk blocks OS Fall’ 02

Block Allocation in UNIX OS Fall’ 02

Block Allocation in UNIX § Optimized for small files Outdated empirical studies indicate that 98% of all files are under 80 KB § Poor performance for random access of large files § No external fragmentation § Wasted space in pointer blocks for large sparse files § Modern UNIX implementations use the extentbased allocation OS Fall’ 02

Linked Allocation § Each file is a linked list of disk blocks § Offset lookup: Efficient for sequential access Inefficient for random access § Access to large files may be inefficient as the blocks are scattered Solution: block clustering § No fragmentation, wasted space for pointers in each block OS Fall’ 02

Linked Allocation Catalog OS Fall’ 02

File Allocation Table (FAT( § A section at the beginning of the disk is set aside to contain the table Indexed by the block numbers on disk An entry for each disk block (or for a cluster thereof) § Blocks belonging to the same file are chained The last file block, unused blocks and bad blocks have special markings OS Fall’ 02

FAT Catalog entry OS Fall’ 02

FAT Pros and Cons § Improved random access just search a small table instead of the whole disk § Inefficient sequential access Seek back to the table and forth to the block for each file block! § Block allocation is easy just find the first 0 marked block OS Fall’ 02

Free space management § Disk bitmap: represent the disk block allocation as an array of bits Bit for each disk block: 1 - non-allocated block, 0 - allocated block Simple and efficient in finding free blocks Wastes space on disk § Linked list of free blocks (UNIX) Efficient for finding a single free block OS Fall’ 02

Next: File System continued § File I/O Organization, performance § Atomicity and consistency § Etc. . . OS Fall’ 02

File I/O § CPU cannot access the file data directly § Must be first brought to the main memory How to do this efficiently? § Read/Write mapping using buffer cache § Memory mapped files OS Fall’ 02

Read/Write Mapping § File data is made available to applications via a pre-allocated main memory region Buffer cache § The file systems transfers data between the buffer cache and disk in granularity of disk blocks § The data is explicitly copied from/to buffer cache to/from the application address space OS Fall’ 02

Read/Write Mapping OS Fall’ 02

Reading data (Disk block=1 K( OS Fall’ 02

Writing data (Disk block=1 K( OS Fall’ 02

Buffer Cache management § All disk I/O goes through the buffer cache Both user data and control data (e. g. , i-node) are cached § LRU replacement § Dirty (modified) marker to indicate whether write-back is needed OS Fall’ 02

Advantages § Strict separation of concerns Hiding disk access peculiarities from the user w Block size, memory alignment, memory allocation in multiples of the block size, etc… § Disk blocks are cached Aggregation for small transfers (locality) Block re-use across processes Transient data might be never written to disk OS Fall’ 02

Disadvantages § Extra copying Disk->buffer cache->user space § Vulnerability to failures Does not care about the user data blocks The control data blocks (metadata) is the real problem w E. g. , i-nodes, pointer blocks can be in cache when a failure occurs w As a result the file system internal state might be corrupted OS Fall’ 02

A complete UNIX example OS Fall’ 02

Memory mapped files § A file (or a portion thereof) is mapped into a contiguous region of the process virtual memory UNIX: mmap system call § Mapping operation is very efficient: just marking § The access to file is governed by the virtual memory subsystem OS Fall’ 02

Mmapped files: Pros and Cons § Advantages: reduce copying no need for a pre-allocated buffer cache in the main memory § Disadvantages: less or no control over the actual disk writing: the file data becomes volatile A mapped area must fit the virtual address space OS Fall’ 02

Reliability and Recovery § File system data consists of Control data (metadata), user data § Failures can cause data loss and corruption Cached data Power failure during the sector write may corrupt physically the data stored in the sector OS Fall’ 02

Metadata vs. User data § Lost or corruption of the metadata might lead to a massive user data loss File systems must care about the metadata File systems usually do not care much about the user data w Operation semantics? w Users must care about their data themselves (e. g. , backups) OS Fall’ 02

Reliability and caching § Caching affects the WRITE semantics The write operation returns Is it guaranteed that the requested data is indeed written on disk? What if some data blocks in cache are the metadata blocks? § Solutions write-through: writes bypass cache write-back: dirty blocks are written asynchronously OS Fall’ 02

User data reliability in UNIX § Based on write-back policy User data is written back to disk periodically POSIX compatible semantics Commands like sync and fsync are used forced write of the dirty blocks OS Fall’ 02

Metadata reliability § Based on write-through policy updates are written to disk immediately § Some data is not written in-place Can go back to the last consistent version § Some data is replicated w UNIX superblock § File system goes through consistency check/repair cycle at the boot time w fsck, Scan. Disk OS Fall’ 02

Metadata reliability using logging § Write-through negatively affects performance Think about random access § Solution: maintain a sequential log of metadata updates: Journal IBM’s Journal File System (JFS) OS Fall’ 02

Journal File System (JFS( § Operations logged (journaled): create, link, mkdir, truncate, allocating write, … Each operation may involve several metadata updates (transaction) § Once operation is logged it returns write ahead logging § The disk writes are performed asynchronously aggregation possible OS Fall’ 02

JFS: Journal maintenance § A cursor (pointer) is maintained § The cursor is advanced once the updated blocks associated with the transaction are written to disk (hardened) hardened transaction records can be deleted from the journal § Upon recovery: Re-do all the operations starting from the last cursor position OS Fall’ 02

JFS: Pros and Cons § Advantages: Asynchronous metadata write Fast recovery: depends on the Journal size and not on the file-system size § Disadvantages extra write space wasted by journal (insignificant) OS Fall’ 02

Log Structured File System § Ousterhout & Douglis (1992) § Caching is enough for good read performance § Writes is the real performance bottleneck writing-back cached user blocks may require many random disk accesses write-through for reliability denies optimizations w logging solves the problem for metadata OS Fall’ 02

Log Structured File System § The idea: everything is log § Each write - both data and control - is appended to the sequential log § The problem: how to locate files and data efficiently for random access by Reads § The solution: use a floating file map OS Fall’ 02

Log structured file system supermap Before supermap After block change supermap After block addition OS Fall’ 02

Next: § Networking and distributed systems § Last: New storage architectures Storage Area Networks, Network Attached Storage, Object Disks, file systems, etc. . . OS Fall’ 02