File System Membrane Kernel Membrane Operating System support
File System Membrane Kernel Membrane: Operating System support for Restartable File Systems Bug Membrane is a layer of material which serves as a selective barrier between two phases and remains impermeable to specific particles, molecules, or substances when exposed to the action of a driving force. Swaminathan Sundararaman, Sriram Subramanian, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Michael M. Swift
Bugs in File-system Code Bugs are common in any large software File systems contain 1, 000 – 100, 000 loc Recent work has uncovered 100 s of bugs [Engler OSDI ’ 00, Musuvathi OSDI ’ 02, Prabhakaran SOSP ‘ 03, Yang OSDI ’ 04, Gunawi FAST ‘ 08, Rubio-Gonzales PLDI ’ 09] Error handling code, recovery code, etc. File systems are part of core kernel A single bug could make the kernel unusable 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 2
Bug Detection in File Systems FS developers are good at detecting bugs “Paranoid” about failures Lots of checks all over the file system code! File assert() BUG() System panic() xfs 2119 18 43 ubifs 369 36 2 ocfs 2 261 531 8 gfs 2 156 60 0 afs 106 38 0 ext 4 42 182 12 reiserfs 1 109 93 ntfs 0 288 2 Number of calls to assert, BUG, and panic in Linux 2. 6. 27 Detection is easy but recovery is hard 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 3
Why is Recovery Hard? Crash App VFS File System Processes could potentially use corrupt in-memory file-system objects No fault isolation App VFS Inode App 0 x 00002 i_count Address mapping File System File systems manage their Process killed on crash own in-memory objects Inconsistent kernel state Hard to free FS objects Common solution: crash file system and hope problem goes away after OS reboot 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 4
Why not Fix Source Code? To develop perfect file systems Tools do not uncover all file system bugs Bugs still are fixed manually Code constantly modified due to new features Make file systems handle all error cases Interacts with many external components ▪ VFS, memory mgmt. , network, page cache, and I/O Cope with bugs than hope to avoid them 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 5
Restartable File Systems Membrane: OS framework to support lightweight, stateful recovery from FS crashes Upon failure transparently restart FS Restore state and allow pending application requests to be serviced Applications oblivious to crashes A generic solution to handle all FS crashes Last resort before file systems decide to give up 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 6
Results Implemented Membrane in Linux 2. 6. 15 Evaluated with ext 2, VFAT, and ext 3 Evaluation Transparency: hide failures (~50 faults) from appl. Performance: < 3% for micro & macro benchmarks Recovery time: < 30 milliseconds to restart FS Generality: < 5 lines of code for each FS 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 7
Outline Motivation Restartable file systems Evaluation Conclusions 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 8
Components of Membrane Fault Detection Fault Anticipation Helps detect faults quickly Membrane Fault Anticipation Records file-system state Fault Recovery Executes recovery protocol to cleanup and restart the failed file system 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 9
Fault Detection Correct recovery requires early detection Membrane best handles “fail-stop” failures Both hardware and software-based detection H/W: null pointer, general protection error, . . . S/W: asserts(), BUG_ON(), panic() Assume transient faults during recovery Non-transient faults: return error to that process 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 10
Components of Membrane Fault Anticipation Membrane 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 11
Fault Anticipation Additional work done in anticipation of a failure Issue: where to restart the file system from? File systems constantly updated by applications Possible solutions: Make each operation atomic Leverage in-built crash consistency mechanism Not all FS have crash consistency mechanism Generic mechanism to checkpoint FS state 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 12
Checkpoint File-system State Checkpoint: consistent state of the file system that can be safely rolled back to in the event of a crash App App VFS All requests enter via VFS layer ext 3 VFAT File System File systems write to disk through page cache Control requests to FS & dirty pages to disk Page Cache Disk 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 13
Generic COW based Checkpoint App App VFS VFS STOP File System Disk Page Cache Consistent Page Consistent Cache Image #1 Image #2 File System Disk ✓✓ ✓✓ Consistent image File System Disk STOP Page Cache Consistent Image #3 STOP ✓ Can be written back to disk Disk Copy-on-Write On. Diskcrash roll back to last consistent Image Regular 2/25/2021 During Checkpoint After Checkpoint Membrane: Operating System Support for Restartable File Systems (FAST '10) 14
State after checkpoint? App VFS File System ✓ ✓ On crash: flush dirty pages of last checkpoint Throw away the in-memory state Remount from the last checkpoint Crash ✓✓ Consistent file-system image on disk Page Cache STOP Issue: state after checkpoint would be lost Operations completed after checkpoint returned back to applications Disk Need to recreate state after checkpoint On. Recovery Crash After 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 15
Operation-level Logging Log operations along with their return value Replay completed operations after checkpoint Operations are logged at the VFS layer File-system independent approach Logs are maintained in-memory and not on disk How long should we keep the log records? Log thrown away at checkpoint completion 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 16
Components of Membrane Fault Anticipation Membrane 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 17
Fault Recovery Important steps in recovery: 1. Cleanup state of partially-completed operations 2. Cleanup in-memory state of file system 3. Remount file system from last checkpoint 4. Replay completed operations after checkpoint 5. Re-execute partially complete operations 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 18
Partially completed Operations Crash App VFS File System FS code should not be trusted after crash Application threads killed? - application state will be lost User App Intertwined execution App VFS File System Kernel Multiple threads inside file system Page Cache Processes cannot be killed after crash Clean way to undo incomplete operations 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 19
A Skip/Trust Unwind Protocol Skip: file-system code Trust: kernel code (VFS, memory mgmt. , …) - Cleanup state on error from file systems How to prevent execution of FS code? Control capture mechanism: marks file-system code pages as non-executable Unwind Stack: stores return address (of last kernel function) along with expected error value 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 20
Skip/Trust Unwind Protocol in Action E. g. , create code path in ext 2 sys_open() do_sys_open() filp_open() open_namei() vfs_create() 1 1 fn 2 regs 2 Release fd vfs_create rax Release rbp rsi rdinamei rbx data rcx rdx r 8 … 3 Clear buffer Zero page Mark not dirty fn blk. . . _write rval -ENOMEM ext 2_create() rax rbp rsi ext 2_addlink() regs rdi rbx rcx membrane fault rdx r 8 … ext 2_prepare_write() 3 block_prepare_write() rval -EIO membrane fault ext 2_get_block() Crash Unwind Stack Kernel is restored to a consistent File system Non-executable state 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 21
Components of Membrane Fault Anticipation Membrane 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 22
Putting All Pieces Together 5 3 Open (“file”) write() read() write() link() Close() Periodically create checkpoints 2 File System Crash 3 Unwind in-flight processes 4 Move to recent checkpoint 5 Replay completed operations 6 Re-execute unwound process Application VFS checkpoint 1 6 File System 2 4 1 T 0 T 1 T 2 time Legend: 2/25/2021 Completed In-progress Crash Membrane: Operating System Support for Restartable File Systems (FAST '10) 23
Outline Motivation Restartable file systems Evaluation Conclusions 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 24
Evaluation Questions that we want to answer: Can membrane hide failures from applications? What is the overhead during user workloads? Portability of existing FS to work with Membrane? How much time does it take to recover the FS? Setup: 2. 2 GHz Opteron processor & 2 GB RAM Two 80 GB western digital disk Linux 2. 6. 15 64 bit kernel, 5. 5 K LOC were added File systems: ext 2, VFAT, ext 3 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 25
How Transparent are Failures? FS Usable? o ✗ ✗ ✗ d ✓ ✓ ✓ get_blk_handle bh_result o ✗ ✗ ✗ d ✓ ✓ ✓ follow_link nd_set_link o ✗ ✗ ✓ d ✓ ✓ ✓ mkdir d_instantiate o ✗ ✗ ✗ d ✓ ✓ ✓ free_inode clear_inode o ✗ ✗ ✗ d ✓ ✓ ✓ read_blk_bmap sb_bread o ✗ ✓ ✗ d ✓ ✓ ✓ readdir null-pointer o ✗ ✗ ✗ d ✓ ✓ ✓ file_write file_aio_write G ✗ ✓ ✓ d ✓ ✓ ✓ FS Consistent? Application? null-pointer FS Consistent? create Application? Fault Detected? Ext 3_Function Detected? Ext 3 + Membrane FS Usable? Ext 3 + Native Membrane successfully hides faults o Legend: O – oops, G- prot. fault, d – detected, – cannot unmount, ✗ - no, ✓ - yes 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 26
Overheads during User Workloads? Time in Seconds Workload: Copy, untar, make of Open. SSH 4. 51 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 27
Overheads during User Workloads? Workload: Copy, untar, make of Open. SSH 4. 51 2. 3% 1. 4% Time in Seconds 28. 5 28. 9 30. 1 30. 8 1. 4% 28. 7 29. 1 Reliability almost comes for free 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 28
Generality of Membrane? No crash-consistency File System Added Modified Deleted Ext 2 4 0 0 VFAT 5 0 0 Ext 3 1 0 0 JBD 4 0 0 Individual file system changes Existing code remains unchanged Additions: track allocations and write super block Minimal changes to port existing FS to Membrane 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 29
Outline Motivation Restartable file systems Evaluation Conclusions 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 30
Conclusions Failures are inevitable in file systems Learn to cope and not hope to avoid them Membrane: Generic recovery mechanism Users: Build trust in new file systems (e. g. , btrfs) Developers: Quick-fix bug patching Encourage more integrity checks in FS code Detection is easy but recovery is hard 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 31
Thank You! Questions? Advanced Systems Lab (ADSL) University of Wisconsin-Madison http: //www. cs. wisc. edu/adsl 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 32
Are Failures Always Transparent? Files may be recreated during recovery Inode numbers could change after restart File 1: inode# 12 Inode# Mismatch create (“file 1”) stat (“file 1”) write (“file 1”, 4 k) Application File 1: inode# 15 create (“file 1”) write (“file 1”, 4 k) stat (“file 1”) VFS File : file 1 Inode# : 12 Epoch 0 Before Crash File System File : file 1 Inode# : 15 Epoch 0 After Crash Recovery Solution: make create() part of a checkpoint 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 33
Postmark Benchmark 3000 files (sizes 4 K to 4 MB), 60 K transactions Time in Seconds 478. 2 484. 1 0. 6% 46. 9 2/25/2021 47. 2 1. 2% 1. 6% 43. 1 43. 8 Membrane: Operating System Support for Restartable File Systems (FAST '10) 34
Recovery Time Recovery time is a function of: Dirty blocks, open sessions, and log records We varied each of them individually Data (Mb) Recovery Time (ms) Open Sessions Recovery Time (ms) Log Records Recovery Time (ms) 10 12. 9 200 11. 4 1 K 15. 3 20 13. 2 400 14. 6 10 K 16. 8 40 16. 1 800 22. 0 100 K 25. 2 Recovery time is in the order of a few milliseconds 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 35
Recovery Time (Cont. ) Restart ext 2 during random-read benchmark 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 36
Generality and Code Complexity Individual file system changes No Checkpoint File System Added Ext 2 4 0 FS VFAT 5 0 Ext 3 1 JBD 4 2/25/2021 Modified Kernel changes Components Added With Checkpoint Modified Added Modified 1929 30 2979 64 MM 779 5 867 15 0 Arch 0 0 733 4 0 Headers 522 6 552 6 Module 238 0 Total 3468 41 5369 89 Membrane: Operating System Support for Restartable File Systems (FAST '10) 37
Interaction with Modern FSes Have built-in crash consistency mechanism Journaling or Snapshotting Seamlessly integrate with these mechanism Need FSes to indicate beginning and end of an transaction Works for data and ordered journaling mode Need to combine writeback mode with COW 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 38
Page Stealing Mechanism Goal: Reduce the overhead of logging writes Soln: Grab data from page cache during recovery Write (fd, buf, offset, count) VFS VFS File System Page Cache Before Crash 2/25/2021 Page Cache During Recovery After Recovery Membrane: Operating System Support for Restartable File Systems (FAST '10) 39
Handling Non-Determinism During log replay could data be written in different order? Log entries need not represent actual order Not a problem for meta-data updates Only one of them succeed and is recorded in log Deterministic data-block updates with page stealing mechanism Latest version of the page is used during replay 2/25/2021 Membrane: Operating System Support for Restartable File Systems (FAST '10) 40
Possible Solutions 1. Code to recover from all failures Not feasible in reality Restart on failure this approach FS need: stateful & lightweight recovery 2/25/2021 Stateful Previous work have taken Lightweight Stateless 2. Heavyweight Curi. OS EROS Safe. Drive Singularity Membrane: Operating System Support for Restartable File Systems (FAST '10) Nooks/Shadow Xen, Minix L 4, Nexus 41
- Slides: 41