Lecture Note 10 File System Basic May 2016

Lecture Note 10. File System Basic May, 2016 Jongmoo Choi Dept. of software Dankook University http: //embedded. dankook. ac. kr/~choijm J. Choi, DKU

Contents From Chap 39~42 of the OSTEP Chap 39. Interlude: Files and Directories ü APIs for file, directory and file system Chap 40. File System Implementation ü ü Layout: superblock, bitmap, inode, data blocks Access method: open, read, write Chap 41. Locality and the Fast File System ü ü Performance requirement Storage-aware performance enhancement Chap 42. Crash Consistency: FSCK and Journaling ü ü Consistency requirement Journaling mechanism 2 J. Choi, DKU

Chap. 39 Interlude: Files and Directories Computer system ü ü Four key abstractions: process, virtual memory, lock, and file Files are in Storage: Storage vs Memory § Persistence: store information permanently (at least, for a long time) § OS must take special care for persistence issues in this lecture note ü Several objects for storage: file, directory, inode, FAT, superblock, … § APIs for these objects issues in this chapter (Source: Special Thanks to Juhyoung Son @ DKU) 3 J. Choi, DKU

39. 1 Files and Directories File ü Definition: A linear array of bytes, stored persistently § Each file has various data structure (record, text, multimedia, c code, …) § But, OS don’t care its content, just treating it as a stream of bytes ü ü Each file has its name (absolute path, relative path) It also has some kind of low-level name in OS (e. g. inode) § Like each process has a unique PID (program name, pid) Directory ü A special file that constructs a directory hierarchy (file hierarchy) § Root directory § Home directory § Working directory ü Contain <file name, inode> § or low-level name or first disk block Others are also treated as a file ü Device, pipe, socket and even process 4 J. Choi, DKU

39. 2 File System Interfaces APIs ü ü Creating, accessing, and deleting files (including directories) Some are straightforward while others are mysterious 5 J. Choi, DKU

39. 3 Creating Files / 39. 4 Reading and Writing Files Create API ü open() with create flag § Arguments: 1) name, 2) flags, 3) permissions § Return: fd (file descriptor) ü creat(): less used (but famous by Ken Thompson’s answer about redesigning UNIX) Read/Write API ü ü read_size = read(fd, buf, request_size); written_size = write(fd, buf, request_size); § Arguments: 1) fd, 2) buffer that points memory space for data, 3) request size § Return: read or written size 6 J. Choi, DKU

39. 4 Reading and Writing Files Read and write example ü Command line viewpoint ü System call viewpoint (using strace) 7 J. Choi, DKU

39. 5 Reading and Writing, But Not Sequentially Conventional accessing mechanism for a file ü ü Sequential From the begin, increasing the offset while reading or writing An array of byte start current offset (Position) end (size) How to access random position? (not sequentially) ü lseek() § Arguments: 1) fd, 2) relative offset from whence, 3) reference point § Explicit update the current offset (c. f. read/write: implicit update) § Do not confuse lseek() with disk seek : -) 8 J. Choi, DKU

39. 6 Writing Immediately with fsync() Performance consideration for write ü ü Write to DRAM vs Disk: 100 ns vs 10, 000 ns (10 ms) Delayed write § Write data into DRAM (called buffer or page cache) and set them dirty § Later write all dirty data into disk in a clustering fashion (5 or 30 seconds periodically) § Write grouping and write reordering indeed enhance performance Concern of delayed write ü Durability § User think his/her data is permanent but not in actuality ü How to guarantee durability § fsync() system call 9 J. Choi, DKU

39. 7 Renaming Files / 39. 9 Removing Files Change a file name ü Command line viewpoint ü System call viewpoint (editor example) § rename(old name, new name) § conducted atomically Remove a file ü API § unlink(file name) Why not remove() or delete() instead of unlink()? Then, what is link()? 10 J. Choi, DKU

39. 8 Getting Information about Files Contents in a file system ü ü ü User data (or just data): data written by users Metadata: data written by a file system for managing files (in inode) and file system (in superblock) API to see the metadata for a certain file § stat(file_name, struct stat) § fstat(fd, struct stat) 11 J. Choi, DKU

39. 10 Making Directories / 39. 12 Deleting Directories API for making directory ü mkdir(name, permission) ü After making § Two entries: parent directory and itself API for deleting directory ü ü rmdir(file_name) We need to use it carefully 12 J. Choi, DKU

39. 11 Reading Directories APIs for reading directory ü ü opendir(dp), readdir(dp), closedir(dp) “ls”: like the below example (c. f. “ls –l”: readdir() + stat()) Why there is no writedir()? 13 J. Choi, DKU

39. 11 Reading Directories Directory name convention 14 J. Choi, DKU

39. 13 Hard Links Link ü Make another file name to access an existing file § Connect a file name with an inode ü Command line viewpoint § Either file or file 2 ü API § link(old_name, new_name) ü After remove one of them § Use unlink() § Still remain data ü Link count § Delete data when link count is 0 15 J. Choi, DKU

39. 14 Symbolic Links Link ü Hard link: share inode number § Create a new file name and share the existing inode ü Symbolic link (Soft link): different inode number, but its data is the linked file name § Create not only a new file name but also a new inode (set it as a symbolic link) § Can link between different file systems, Can link to a directory ü Dangling reference in symbolic link 16 J. Choi, DKU

39. 15 Making and Mounting a File System File system ü Make a file system § Assemble directories and files § Related metadata: superblock, bitmap, … § Command: mkfs • ü Make an empty file system (only root directory) in a disk partition Mount § Make a file system visible to users § Connect multiple file systems within the uniform directory tree • mount point root of the mounted FS 17 J. Choi, DKU

39. 15 Making and Mounting a File System Multiple file systems in a system ü Examples: Ext 2/3/4, proc, sysfs, AFS, … ü Mount example $mount –t ext 3 /dev/sda 4 /mnt Before mount After mount 18 J. Choi, DKU

Chap. 40 File System Implementation Objective of this chapter ü Make a new file system: called VSFS(Very Simple File System) § § ü Simplified version of UFS (Unix File System) On-disk structures Access method Various policies More complex file systems § FFS, EXT 2/3/4, JFS, LFS, NTFS, F 2 FS, FUSE, RAMFS, NFS, AFS, ZFS, …. next chapters 19 J. Choi, DKU

40. 1 The Way to Think Two key aspects for understanding file system ü Disk layout § File system structure § Metadata format ü Access method § open(), creat(), mount(), … § read(), write(), lseek(), stat(), … Importance of mental model for OS study 20 J. Choi, DKU

40. 2 Overall Organization Disk ü ü Consist of partitions A file system is created in each partition Partition ü ü ü Consist of dick blocks User data is stored in a disk block (usually same size with the page) Assume a partition having 64 disk blocks (or simply blocks) Now consider what data structures are required for making a FS? 21 J. Choi, DKU

40. 2 Overall Organization Layout of a file system ü User data: 8 ~ 63 blocks (can be dynamically adjusted) § Data written by users ü Inode: 3~7 blocks § Metadata for managing files (one per a file) § Inode size = 256 B 16 inodes per a block 5 blocks for inode total 80 files can be created ü Bitmap: 1~2 blocks § Metadata for managing free space (allocation structure) § Two bitmaps: one for data blocks and the other for inodes ü Superblock: 0 blocks § Metadata for managing a file system (one per a file system) • Information: how many data blocks, inodes, where they begin, … § Used during a mount function 22 J. Choi, DKU

40. 3 File Organization: The inode How to manage metadata for a file ü inode (index node) § File information such as mode, uid, size, time, link count, blocks, … • Can be accessed using stat() § Locations of User data blocks Multi-Level index (or imbalanced tree) • • ü Direct block pointers (15), Single/Double/Triple indirect block pointers(1/1/1) Benefit: Fast for a short file and Big size support for a large file Other approach: FAT (linked based), Extent-based, Log-based, . . How large size can be supported by direct block pointers? How about an indirect pointer? 23 J. Choi, DKU

40. 3 File Organization: The inode manipulation example ü ü When we create a new file (named hello. c whose size is 7 KB) in a root directory? Then, we compile it? (a. out whose size is 70 KB) inode for / times … locations: 8 inode for hello. c times … locations: 9, 10 inode for a. out times … locations: 11, 12, 13, 14, …, 25, 26 . : 0 hello. c: 1 a. out: 2 #include <stdio. h> int main() … 457 f 464 c 0102 0001 0000 … 27, 28, 29 J. Choi, DKU Which block when we want to read the file 24 “a. out” with the current_offset = 10000?

40. 3 File Organization: The inode Find a location: inode and data ü How to find the location of a inode? § Directory entry: <file name, i_number> § i_number: index in inode region (inode table) • ü e. g. ) i_number = 33 / (inodes per block) = 33/16 = 2 … 1 inode table start + 4 KB x 2 = 12 KB + 8 KB = 20 KB read a block starting 20 KB go to the offset of inode_size x 1 = 256 B How to find the location of User data? § 1) Find inode, 2) current_offset / disk block size = quotient … remainder, 3) quotient is used to find a pointer in the inode (multi-level index), 4) remainder is used as the offset in the disk block 25 J. Choi, DKU

40. 4 Directory Organization / 40. 5 Free Space Mgmt. Directory ü Basically, a list of pairs <file name, inode number> For fast search, add the file name length and record length (total bytes including left over space) ü Can use more complex structure for directory (e. g. B-tree in XFS) ü Free space ü ü ü Bitmap: one bit per disk block (or inode), indicating whether it is free or used Alternative approach: free-list, tree, … Pre-allocation: allocate free disk blocks in a batch manner less overhead, contiguous allocation, … 26 J. Choi, DKU

40. 6 Access Paths: Reading and Writing Reading a file from disk ü ü open a file “/foo/bar” whose size is 12 KB, read data and close it Timeline § Open: directory tree traverse connect fd to inode § Read: current_offset find disk block location using the inode and read it update the last access time in the inode § Close: deallocate fd and related data structure in OS, No actions in disk § Note: repeated reads for the bar’s inode How about caching it! 27 J. Choi, DKU

40. 6 Access Paths: Reading and Writing a file into disk ü ü Create a file “/foo/bar”, write data and close it Timeline § Open: 1) create a new inode for bar and update i-bitmap, 2) insert a new entry into foo’s data block (10 I/Os for just creating a file) § Write: 5 I/Os per a write (d-bitmap read/update, inode read/update, actual user data write) 28 J. Choi, DKU

40. 7 Caching and Buffering Issues Solutions ü Caching § Caching directories (e. g. / inode, / data, current directory, …) in DRAM § Caching recently used file’s inodes and data in DRAM § Management: LRU replacement policy, dynamic cache size management ü Write buffering (Delayed write) § § Consolidate several writes into a single one: e. g. ) d-bitmap Schedule multiple writes so that they have less seek overhead: e. g. ) bar data Avoid writes: e. g. ) temporary file (create and delete immediately) Concern: Data loss due to power fault or crash fsync() or direct I/O (Source: http: //www. atmarkit. co. jp/ait/articles/0810/01/news 134_2. html”) 29 J. Choi, DKU

Chap. 41 Locality and The Fast File System UFS (Unix File System) ü Layout § Superblock: how big FS is, how many inodes, where is inode, where is the root inode, … § (bitmap) + Inode + User data § Simple and easy-to-use ü Access method § Inode access, data access alternately § Concerns: 1) Long seek time, 2) Consistency • • Performance issue this chapter Consistency issue next chapter 30 J. Choi, DKU

41. 1 Poor Performance / 41. 2 FFS: Disk Awareness UFS: poor performance ü 1) Inode and User data are located in different tracks 2) A file is fragmented as time goes (external fragmentation) long seek New proposal: FFS ü Place inodes and user data blocks as close as possible § This idea is also used in Ext 2) ü Disk-awareness § Disk structure: Head, ARM + Platter (Surface), Track, Sector, Cylinder § Disk access time: seek time + rotational latency + transmission time § Data in the same cylinder no seek distance (or closer cylinder less seek distance) • Cylinder is a set of tracks on different surfaces that are the same distance from the center 31 J. Choi, DKU

41. 3 Organizing Structure: The Cylinder Group FFS in detail ü ü Disk: divided into a number of cylinder groups Cylinder group § N consecutive cylinders § Structure of each cylinder group • • Superblock (duplication for reliability) Per-group bitmap, inode and data blocks § Management • • ü Allocate an inode and data at the same group small seek distance Ext 2: similar approach called block group Feature of FFS: Different internal implementation, but same external interfaces 32 J. Choi, DKU

41. 4 Policies: How to Allocate Files and Directories Allocation in FFS ü Idea: keep related stuff together § Data and related inode, file and its related directory, … ü Allocation rules § Rule 1. Directory: place it into a cylinder group with a low number of allocated directories and a high number of free inodes • • To balance directories across groups To be able to allocate a bunch of files subsequently § Rule 2. File: 1) put files in the cylinder group of the directory they are in, 2) allocate data blocks of a file in the same group as its inode • • E. g. ) create four files, /a/c /a/d, /a/e, /b/f FFS would place the first three files near (same group) and the fourth far away (in some other group) Assumption: 1) Group: 10 inodes and data blocks. 2) Directory: 1 block, 3) file: 2 blocks (FFS allocation) (Even allocation) 33 J. Choi, DKU

41. 5 Measuring File Locality FFS relies on Common Sense (What CS stands for ^^) ü ü Files in a directory are often accessed together (namespace locality) Measurement: Fig. 41. 1 § Using real trance called SEER traces § Path difference: how far up the directory tree you have to travel to find the common ancestor btw the consecutive opens in the trace • E. g. ) same file: 0, /a/b and /a/c: 1, /a/b/e and /a/d/f: 2, … § Observation: 60% of opens in the trace less than 2. • E. g. ) OSproject/src/a. c, OSproject/include/a. h, OSproject/obj/a. o, … 34 J. Choi, DKU

41. 6 The Large-File Exception How to handle a large file for allocation in FFS? ü ü Large file fill up a cylinder group with its own data undesirable with the consideration of the namespace locality Rule 3. For a large file § Allocate some number of blocks in a group. Then, go to another group and allocate some number of blocks there. Then, move another one. … • Some number: Direct blocks, blocks that pointed by an index block § Assumption: 1) Group: 10 inodes, 40 data blocks, 2) file A: 30 blocks • • Pros) locality among files, Cons) locality in a file Can amortize with large chunk size between seek (Fig 41. 2, Seek=10 ms, Bandwidth = 40 MB/s: if (4 MB per group) 100 ms for transmission: 90% bandwidth, if (400 K per group) 10 ms for transmission: 50% bandwidth) (FFS allocation) (Without Rule 3) 35 J. Choi, DKU

41. 7 A Few Other Things about FFS Promising features in FFS ü Larger disk block size: 512 B in UFS 4 KB in FFS § Pros) Larger size Less seek and more transfer Higher Bandwidth usage in disk § Cons) Internal fragmentation • ü Sub-blocks (fragment) allocation Parameterization § Block request: 1, 2, 3, …. , But when the request 2 is arrived in disk, the head has already passed the location of 2 in the standard placement solution: parameterized placement § c. f) Modern disk: use track buffer ü Others: Long file name, symbolic link, atomic rename(), … 36 J. Choi, DKU

Chap. 42 Crash Consistency: FSCK and Journaling Non-volatility: no-free lunch ü ü Can retain data while power-off But, requires maintaining file system consistency Consistency ü Changes in a file system are guaranteed from a valid state to another valid state § E. g. ) inconsistent state: bitmap says that a block is free even though it is used by a file ü What happen if, right in the middle of creating a file, a system loses power? Crash-consistency problem Solutions ü ü ü FSCK (File System Check) Journaling: employed many file systems such as Ext 3/4, JFS, … Others: Soft update, COW, Integrity checking, Optimistic, … 37 J. Choi, DKU

42. 1 A Detailed Example ü Simple FS: 8 inodes, 8 disk blocks, i-bitmap, d-bitmap One file: size=4 KB, owner =Remzi ü Modify the file: appending, size=8 KB ü § Note that we need to change three locations need three writes 38 J. Choi, DKU

42. 1 A Detailed Example Crash scenario ü ü Three writes: Db, I[v 2], B[v 2] Delayed write using cache Unexpected power loss or system crash Some writes can be done while others are not. § Db only is written to disk: no problem § B[v 2] only is written to disk: space leak § I[v 2] only is written to disk: 1) garbage read, 2) inconsistency: inode vs. bitmap § Db and B[v 2] are written to disk (except I[v 2]): inconsistency § Db and I[v 2] are written to disk (except B[v 2]): inconsistency § I[v 2] and B[v 2] are written to disk (except Db): Garbage read ü Key problem: lack of atomicity 39 J. Choi, DKU

42. 2 Solution #1: The File System Checker Traditional solution: fsck (file system checker) ü Consist of several passes § Superblock: metadata for FS, usually sanity check § Free blocks: check all inodes and their used blocks. If there is an inconsistent case in bitmaps, correct it (usually follow inode info. ) § Inode state: validity check in each inode. reclaim wrong inodes § Inode links: link counts check by scanning the entire directory tree. Move the missed file (there is an inode but no directory entry points it) into the lost+found directory § Duplicates: find blocks which are pointed by two or more inodes § Bad blocks: pointer that points outside its valid ranges § Directory checks: fs-specific knowledge based directory check (e. g. “. ” and “. . ” are the first entries ü Issue: too slow § Remzi says that “the fsck looks like that, even though you drop the key in your bedroom, you start a search-the-entire-house-for-key algorithm, scanning from the basement, kitchen, and every room. ” J. Choi, DKU

42. 3 Solution #2: Journaling (or WAL) Journaling ü ü ü A Kind of WAL (Write-ahead logging) Key idea: When updating disks, before overwriting the structure in place, first write down a little note to somewhere in a well-known location, describing what you are about to do. Crash occur The note can say what you intended redo or undo Journaling FS ü ü Linux Ext 3/4, IBM JFS, SGI XFS, NTFS, Reiserfs, … Features of Ext 3 file system § Integrate journaling into ext 2 file system § Three types: journal (data journal), ordered (metadata journal, ordered, default), writeback (metadata journal, non-ordered) (Ext 2 disk layout) (Ext 3 disk layout) J. Choi, DKU

42. 3 Solution #2: Journaling (or WAL) Data Journaling ü ü Assume we want to do three writes (I[v 2], B[v], and Db) Before writing them to their final locations, we first write them to the log (journal). § Tx. B: Transaction begin, include Tid and writes information § Log • • Physical logging: same contents to the final locations Logical logging: intent (save space, but more complex) § Tx. E: End with Tid ü ü After making this transaction safe on disk, we are ready to update the original data checkpointing In the case of failures btw journaling and checkpointing, we can replay journal (redo) In the case of failures btw Tx. B and Tx. E, we can remove journal (undo) Conclusion 2 steps: 1) Journal write, and 2) Checkpoint J. Choi, DKU

42. 3 Solution #2: Journaling (or WAL) How to reduce journaling overhead? ü For journaling, we need to write a set of blocks § e. g. Tx. B, i[v 2], B[v 2], Db, Tx. E ü Approach 1: issue each request at a time, wait for each to complete, then issuing the next § Too slow ü Approach 2: issue all writes at once § Unsafe, might be loss some requests § Transaction looks valid (it has begin and end). Thus, replaying journal leads wrong data to be updated. ü ü Approach 3: issue all writes at once and apply checksum using all contents in the journal Approach 4: employ commit § Separate Tx. E from all other writes § Recovery: 1) not committed skip it, 2) committed, but not in the original locations redo logging J. Choi, DKU

42. 3 Solution #2: Journaling (or WAL) Metadata Journaling ü ü Data journaling writes data twice, which increases I/O traffic (reducing performance), especially painful for sequential writes Metadata journaling § User data is not written to the journal (metadata only) ü Question? § Does the writing order btw user data and journal become matter? Yes, writing journal before user data causes problems (garbage pointing) ü Conclusion: metadata journaling § Data write Journal metadata write Journal commit Checkpoint Free ü Real world § Ext 3: support both ordered and non-ordered (writeback) § Windows NTFS and SGI’s XFS use non-ordered metadata journaling J. Choi, DKU

42. 3 Solution #2: Journaling (or WAL) Revoke record in journal: for block reuse handling ü Scenario: 1) there is a directory called foo, 2) a user adds an entry to foo (create a file), 3) foo’s contents are written to block 1000, 4) log are like the following figure (note that directory is metadata, which is also logged) ü 5) The user deletes the foo (and its subfiles), 6) The user creates another file (say foobar), which uses the block 1000, 7) Writes for foobar are logged (note that file contents themselves are not logged) ü 8) At this point, a crash occurs. 9) recovery performs “redo” from the beginning of the log. 10) overwrites the user data of the file foobar with the old directory contents. Solution ü § Ext 3 adds a new type of record, a revoke record, for the deleted file or directory. When do replaying, any revoked records are not redo J. Choi, DKU

42. 3 Solution #2: Journaling (or WAL) Timeline ü Data journaling vs. Metadata Journaling § Horizontal dashed line is “write barrier” § Note that, in this figure, the order btw Data and Journaling is not guaranteed in the metadata journaling timeline (writeback mode in the ext 3. ) J. Choi, DKU

42. 4 Solution #3: Other Approaches (Optional) Alternatives ü ü fsck: A lazy approach Journaling: An active approach § Ext 3, Reiserfs, IBM’s JFS, … ü Soft update § Suggested by G. Ganger and Y. Patt § Carefully order all writes so that on-disk structures are never left in an inconsistent state (e. g. data block is always written before its inode) § Soft update is not easy to implement since it requires intricate knowledge about file system (On contrary, journaling can be implemented with relatively little knowledge about FS) ü COW § Used in Btrfs and Sun’s ZFS § LFS can be considered as an early example of a COW ü Optimistic crash consistency § Enhance performance by issuing as many writes to disk as possible § Exploit checksum as well as a few other techniques J. Choi, DKU

Summary File basic ü ü Layout: superblock, bitmap, inode, data blocks Access methods: open(), read(), write(), … FFS ü ü A watershed moment in file system research Storage-awareness, Simple but effective techniques Journaling ü ü Consistency: change from valid state to another valid state Journaling: Performance and Reliability tradeoff Homework 5: Make a simple file system (like VSFS or UFS) on RAMdisk using FUSE − Requirement: 1) team (3 persons), 2) report with your discussion (and the role of each team member), 3) source code, 4) result snapshot − Environment See Lab. 3 (bmap. c, file. c, dir. c, inode. c) − Due: until the same day of the two weeks later. − Bonus: Your own format and mount program 48 J. Choi, DKU

Appendix Double-edge sword of powerful commands 49 J. Choi, DKU