Conquest Preparing for Life After Disks October 2
Conquest: Preparing for Life After Disks October 2, 2003 An-I Andy Wang
Conquest Overview n File systems are optimized for disks Performance problem n Complexity n Now we have tons of inexpensive RAM n What can we do with that RAM? n 2
Conquest Approach n Combine disk and persistent RAM (e. g. , battery-backed RAM) in a novel way n Simplification n At least 20% smaller code base than ext 2, reiserfs, and SGI XFS n Performance (under popular benchmarks) n 24% to 1900% faster than LRU disk caching n Best performance boost since Berkeley FFS 3
Performance Problem of Disks CPU (50% /yr) memory (50% /yr) 1 GHz accesses 1 MHz per second 1 KHz (log scale) 1990 (1 sec : 6 days) 106 105 disk (15% /yr) 1995 2000 (1 sec : 3 months) Genesis • Conquest Design • Performance Evaluation • Conclusion 4
Inside Pandora’s Box Disk arm n Disk platters n Access time = seek time (disk arm) + rotational delay (disk platter) + transfer time Genesis • Conquest Design • Performance Evaluation • Conclusion 5
Disk Optimization Methods n n n Disk arm scheduling Group information on disk Disk readahead Buffered writes Disk caching Data mirroring n Hardware parallelism n Genesis • Conquest Design • Performance Evaluation • Conclusion 6
Complexity Bytes predictive readahead synchronization cache replacement elevator algorithm data consistency asynchronous write data clustering Genesis • Conquest Design • Performance Evaluation • Conclusion 7
Storage Media Alternatives $/MB (log scale) magnetic RAM? 10 -3 100 10 -3 tape 10 -6 103 disk 106 accesses/sec (log scale) battery-backed DRAM (write once) flash memory persistent RAM Genesis • Conquest Design • Performance Evaluation • Conclusion [Caceres et al. , 1993; Hillyer et al. , 1996; Qualstar 1998; Tanisys 1999; Quantum 2000; Micron Semiconductor Products 2002] 8
The Genesis of Conquest n Idea: persistent-RAM-only file system Improved performance n Remove disk-related complexity n Genesis • Conquest Design • Performance Evaluation • Conclusion 9
The Genesis of Conquest (2) n Problem: wrong growth curves Disk prices dropping faster than RAM prices n Disks will stay around n 102 101 $/MB 100 (log scale) 10 -1 10 -2 1995 booming of digital photography persistent RAM 1" HDD 3. 5" HDD 2000 year 2005 Genesis • Conquest Design • Performance Evaluation • Conclusion [Grochowski 2002] 10
The Genesis of Conquest (3) n New idea: hybrid system for transition Takes advantage of RAM speed n Still simplifies code n 102 101 $/MB 100 (log scale) 10 -1 10 -2 1995 booming of digital photography 4 to 10 GB of persistent RAM paper/film persistent RAM 1" HDD 3. 5" HDD 2000 year 2005 Genesis • Conquest Design • Performance Evaluation • Conclusion [Grochowski 2002] 11
Conquest Design Questions n How to make effective use of RAM? Common usage patterns n Physical characteristics of RAM storage n n Where and how to reduce complexity? Data paths n Data structures and associated management n Shutdown/boot sequence n n How to assure the integrity of file system components that reside in BB-DRAM? Genesis • Conquest Design • Performance Evaluation • Conclusion 12
User Access Patterns n Small files Take little space (10%) n Represent most accesses (90%) n n Large files Take most space n Mostly sequential accesses n n Not characteristic of database applications Genesis • Conquest Design • Performance Evaluation • Conclusion [Ousterhout 1985; Baker et al. , 1991; Iram 1993; Douceur and Bolosky 1999; Roselli et al. , 2000; Evans and Kuenning 2002] 13
Characteristics of Storage Media n RAM Fast random accesses n Cost-effective in performance n n Disk Fast sequential accesses n Cost-effective in storage n Genesis • Conquest Design • Performance Evaluation • Conclusion 14
The Design of Conquest n Deliver all file system services from memory, with the exception of high-capacity storage n Persistent RAM n Data content of small files (smaller than 1 MB) n Metadata (file descriptions for large and small files, directories, and data structures) n Disk n Data n content of large files Two separate data paths to memory and disk Genesis • Conquest Design • Performance Evaluation • Conclusion 15
Conquest Alternatives n Disk caching Assumption of scarce memory n Use disk as the final storage destination n Complex mechanisms to maintain consistency n n RAM drives and RAM file systems Not meant to be persistent n Use disk-related mechanisms n Limitations on storage capacity n Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion [Mc. Kusick et al. , 1990; Ganger et al. , 2000; Roselli et al. , 2000; Seltzer et al. , 2000] 16
Simplification of Data Paths Genesis • Conquest Design • Performance Evaluation • Conclusion 17
Content of Persistent RAM n Data content of small files (< 1 MB) No seek time or rotational delays n Fast byte-level accesses n Virtual contiguous allocation n n Metadata (e. g. , directories, file system states) Fast synchronous update n No dual representations n For both large and small files n Genesis • Conquest Design • Performance Evaluation • Conclusion 18
Memory Data Path of Conquest Conventional File Systems Conquest Memory Data Path storage requests I/O buffer management persistence support I/O buffer battery-backed RAM persistence support small file and metadata storage disk management disk Genesis • Conquest Design • Performance Evaluation • Conclusion 19
Large-File-Only Disk Storage Only store the data content of large files n Allocate in big chunks n Lower access overhead n Reduced management overhead n No fragmentation management n No tricks for small files n n n Storing data in metadata No elaborate data structures n Wrapping a balanced tree onto disk cylinders Genesis • Conquest Design • Performance Evaluation • Conclusion [Namesys 2002] 20
Sequential-Access Large Files n Sequential disk accesses n Near-raw bandwidth Well-defined readahead semantics n Read-mostly n n Little synchronization overhead (between memory and disk) Genesis • Conquest Design • Performance Evaluation • Conclusion 21
Disk Data Path of Conquest Conventional File Systems Conquest Disk Data Path storage requests I/O buffer management I/O buffer persistence support I/O buffer battery-backed RAM small file and metadata storage disk management disk large-file-only file system Genesis • Conquest Design • Performance Evaluation • Conclusion 22
Random-Access Large Files n Random access? Common definition: nonsequential access n A typical movie has 150 scene changes n MP 3 stores the title at the end of the files n n Near sequential access? n Simplifies large-file metadata representation significantly Genesis • Conquest Design • Performance Evaluation • Conclusion [Baker et al. , 1991; Vogels 1999; Roselli et al. , 2000] 23
Simplification of Data Structures Genesis • Conquest Design • Performance Evaluation • Conclusion 24
Logical File Representation Name(s) n n i-node n File attributes Data File Genesis • Conquest Design • Performance Evaluation • Conclusion 25
Physical File Representation Name(s) n n i-node n File attributes n Data locations Data blocks File Genesis • Conquest Design • Performance Evaluation • Conclusion 26
Ext 2 Data Representation data block location 10 index block location i-node (stored on disk) Genesis • Conquest Design • Performance Evaluation • Conclusion 27
Disadvantages with Ext 2 Design n n Optimization for small files makes things complex Designed for disk storage Random-access data structure for large files that are accessed mostly sequentially Data access time dependent on the byte position in a file Maximum file size is limited Genesis • Conquest Design • Performance Evaluation • Conclusion 28
Conquest Representation n Persistent RAM n Single-level dynamically allocated index array location i-node (stored in RAM) data block location Fast data access for files stored in RAM n Genesis • Conquest Design • Performance Evaluation • Conclusion 29
Conquest Representation (2) n Disk segment list location begin block location i-node (stored in RAM) end block location begin block location (stored on disk) end block location Worst case: sequential memory search for random disk locations n Maximum file size limited by physical storage n Genesis • Conquest Design • Performance Evaluation • Conclusion 30
Conquest Directories Per-directory hash tables stored in memory n Collisions resolved by rehashing n Hard links: multiple names point to same data n Problem: n Dynamic resizing of directories n Need to handle the current file position n Important for rm -fr n Genesis • Conquest Design • Performance Evaluation • Conclusion 31
The Difficulty With Shrinking n rm –fr hash table location 1000 |<empty> dirfile 1 filei-node location NULL i-node (stored in RAM) 1001 |<empty> file 2 file 1 NULL filei-node location 0110 |<empty> file 1 filei-node location NULL <deleted> NULL Genesis • Conquest Design • Performance Evaluation • Conclusion 32
The Difficulty With Shrinking n rm -fr hash table location <deleted> NULL i-node (stored in RAM) 1001 |<empty> file 2 file 1 NULL filei-node location 0110 |<empty> file 1 filei-node location NULL <deleted> NULL Genesis • Conquest Design • Performance Evaluation • Conclusion 33
The Difficulty With Shrinking n rm -fr hash table location <deleted> NULL i-node (stored in RAM) 1001 |<empty> file 2 file 1 NULL filei-node location 0110 |<empty> file 1 filei-node location NULL <deleted> NULL Genesis • Conquest Design • Performance Evaluation • Conclusion 34
The Difficulty With Shrinking n rm -fr hash table location 0110 |<empty> file 1 filei-node location NULL i-node (stored in RAM) 1001 |<empty> file 2 file 1 filei-node location NULL Genesis • Conquest Design • Performance Evaluation • Conclusion 35
The Difficulty With Shrinking n rm -fr hash table location 0110 |<empty> file 1 filei-node location NULL i-node (stored in RAM) n <empty> NULL Quick fixes n n Never shrink hash tables (for rm –fr) No promises for ls while adding files Genesis • Conquest Design • Performance Evaluation • Conclusion 36
Extensible Hash Tables n Use top, not bottom, bits of hash code hash table location 0110 |<empty> file 1 filei-node location NULL i-node (stored in RAM) 1001 |<empty> file 2 file 1 filei-node location NULL Genesis • Conquest Design • Performance Evaluation • Conclusion [Fagin et al. , 1979] 37
Extensible Hash Tables n Preserve ordering of entries when resizing hash table location <empty> NULL i-node (stored in RAM) 0110 |<empty> file 1 NULL filei-node location 1001 |<empty> file 2 file 1 filei-node location NULL <empty> NULL Genesis • Conquest Design • Performance Evaluation • Conclusion 38
Additional Engineering Details Dynamic file positioning n Need to handle collisions n Memory overhead and complexity tradeoffs n Genesis • Conquest Design • Performance Evaluation • Conclusion 39
Simplification of Metadata Management Genesis • Conquest Design • Performance Evaluation • Conclusion 40
Metadata Allocation n Requirements n n n Keep track of usage status of metadata entries Avoid duplicate allocation with unique IDs Fast retrieval of metadata with a given ID ID: 30| free ID: 81| in use ID: 58| free ID: 16| free ID: 89| in use ID: 88| free Genesis • Conquest Design • Performance Evaluation • Conclusion 41
Existing Memory Allocation n Services n n n Keep track of unallocated memory No duplicate allocation of physical addresses Hmm… ADDR 0 xe 000000| free ADDR 0 xe 000038| in use ADDR 0 xe 000070| free ADDR 0 xe 0000 A 8| free ADDR 0 xe 0000 E 0| free ADDR 0 xe 000118| in use Genesis • Conquest Design • Performance Evaluation • Conclusion 42
Conquest Metadata Management Metadata = memory allocated by memory manager n Metadata ID = physical address of metadata n Unique IDs and fast retrieval ID: 30| free ADDR 0 xe 000000| free ID: 81| in use ADDR 0 xe 000038| in use ID: 58| free ADDR 0 xe 000070| free ID: 16| free ADDR 0 xe 0000 A 8| free ID: 89| in use ADDR 0 xe 0000 E 0| free ID: 88| free ADDR 0 xe 000118| in use Usage status Genesis • Conquest Design • Performance Evaluation • Conclusion 43
Simplification of Shutdown/Boot Sequence Genesis • Conquest Design • Performance Evaluation • Conclusion 44
Persistence Support n Restore file system states after a reboot Data n Metadata n Memory manager n n Keep track of metadata allocation n Reinitialized at boot time n No knowledge of persistently allocated data Genesis • Conquest Design • Performance Evaluation • Conclusion 45
Linux Memory Manager n Page allocator maintains individual pages Page allocator Genesis • Conquest Design • Performance Evaluation • Conclusion 46
Linux Memory Manager (2) n Zone allocator allocates memory in power-oftwo sizes Zone allocator Page allocator Genesis • Conquest Design • Performance Evaluation • Conclusion 47
Linux Memory Manager (3) n Slab allocator groups allocations by sizes to reduce internal memory fragmentation Slab allocator Zone allocator Page allocator Genesis • Conquest Design • Performance Evaluation • Conclusion 48
Memory Allocation Example n Allocate a 455 -byte data structure Slab allocator One page of data structures Zone allocator One page from DMA zone Page allocator Page address 0 x 0000 d 000 Genesis • Conquest Design • Performance Evaluation • Conclusion 49
Linux Memory Manager (4) n Difficult to restore the persistent states Three layers of pointer-rich mappings n Mixing of persistent and temporary allocations n Slab allocator Zone allocator Page allocator Genesis • Conquest Design • Performance Evaluation • Conclusion 50
Conquest Persistence n Create memory zones with own instantiations of memory managers Slab allocator Zone allocator Page allocator Genesis • Conquest Design • Performance Evaluation • Conclusion 51
Conquest Persistence n n n Reuse existing memory manager code Encapsulate all pointers within each zone Pointers can survive reboots No serialization and deserialization Swapping and paging Disabled for Conquest memory zones n Enabled for non-Conquest zones n Genesis • Conquest Design • Performance Evaluation • Conclusion 52
Integrity of Content in RAM n User-level program crashes n Same file system interface as others n Access control n Memory protection n Operating system crashes 1. 5% of crashes lead to memory corruption n Lose about one data block a decade n Genesis • Conquest Design • Performance Evaluation • Conclusion [Ng et al. , 1996] 53
Other Reliability Mechanisms Instantaneous metadata commit n Daily backups n Pointer-switch commit semantics n pointer Genesis • Conquest Design • Performance Evaluation • Conclusion 54
Implementation Status Kernel module under Linux 2. 4. 2 n Operational and POSIX compliant n Modified memory manager to support Conquest persistence n Need to overcome BIOS limitations for distribution n Genesis • Conquest Design • Performance Evaluation • Conclusion 55
Performance Evaluation n Architectural simplification n n Feature count Performance improvement Memory-only workloads n Memory-and-disk workloads n Genesis • Conquest Design • Performance Evaluation • Conclusion 56
Conventional Data Path Conventional File Systems storage requests I/O buffer management I/O buffer n n n persistence support disk management n n n disk n n Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management Genesis • Conquest Design • Performance Evaluation • Conclusion 57
Memory Path of Conquest Memory Data Path n storage requests n n Persistence support n battery-backed RAM small file and metadata storage n n Memory manager encapsulation n n Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management Genesis • Conquest Design • Performance Evaluation • Conclusion 58
Disk Path of Conquest Disk Data Path n storage requests n I/O buffer n management n battery-backed I/O buffer RAM small file and metadata storage n n disk n management n n disk n n large-file-only file system n Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management Genesis • Conquest Design • Performance Evaluation • Conclusion 59
Post. Mark Benchmark (1) ISP workload (emails, web-based transactions) n Conquest is comparable to ramfs n At least 24% faster than the LRU disk cache n 40 to 250 MB working set with 2 GB physical RAM Genesis • Conquest Design • Performance Evaluation • Conclusion [Card et al. , 1994; Sweeney et al. , 1996; Katcher 1997; Namesys 2002] 60
Post. Mark Benchmark (2) n When both memory and disk components are exercised, Conquest can be several times faster than ext 2 fs, reiserfs, and SGI XFS 10, 000 files, 80 MB to 3. 5 GB working set with 2 GB physical RAM <= RAM > RAM Genesis • Conquest Design • Performance Evaluation • Conclusion 61
Post. Mark Benchmark (3) n When working set > RAM, Conquest is 1. 4 to 2 times faster than ext 2 fs, reiserfs, and SGI XFS 10, 000 files, 80 MB to 3. 5 GB working set with 2 GB physical RAM Genesis • Conquest Design • Performance Evaluation • Conclusion 62
Sprite LFS Microbenchmarks n Small-file benchmark n Operates on 10, 000 1 -KB files in three phases Genesis • Conquest Design • Performance Evaluation • Conclusion [Rosenblum and Ousterhout 1991] 63
Sprite LFS Microbenchmarks (2) n Modified large-file microbenchmark: ten -MB files (Conquest in-core files) 1 Genesis • Conquest Design • Performance Evaluation • Conclusion 65
Sprite LFS Microbenchmarks (3) n Modified large-file microbenchmark: ten 1. 01 -MB files (Conquest on-disk files) Genesis • Conquest Design • Performance Evaluation • Conclusion 66
Sprite LFS Microbenchmarks (4) n Large-file microbenchmark: forty 100 -MB files (Conquest on-disk files) Genesis • Conquest Design • Performance Evaluation • Conclusion 67
istory’s Mystery Puzzling Microbenchmark Numbers… Geoff Kuenning: “If Conquest is slower than ext 2 fs, I will toss you off of the balcony…” Genesis • Conquest Design • Performance Evaluation • Conclusion 68
With me hanging off a balcony… n Original large-file microbenchmark: one -MB file (Conquest in-core file) 1 Genesis • Conquest Design • Performance Evaluation • Conclusion 69
Odd Microbenchmark Numbers n Why are random reads slower than sequential reads? Genesis • Conquest Design • Performance Evaluation • Conclusion 70
Odd Microbenchmark Numbers n Why are RAM-based file systems slower than disk-based file systems? Genesis • Conquest Design • Performance Evaluation • Conclusion 71
A Series of Hypotheses n Warm-up effect? Maybe n Why do RAM-based systems warm up slower? n n Bad initial states? n n No Pentium III streaming I/O option? n No Genesis • Conquest Design • Performance Evaluation • Conclusion [Keshava and Penkovski 1999; Torvalds 2001; Abraham 2002] 72
Effects of L 2 Cache Footprints Large L 2 cache footprint Small L 2 cache footprint write a file sequentially footprint file end write a file sequentially footprint read the same file sequentially footprint read file end read the same file sequentially footprint flush file end read file flush file end Genesis • Conquest Design • Performance Evaluation • Conclusion 73
LFS Sprite Microbenchmarks n Modified large-file microbenchmark: ten -MB files (Conquest in-core files) 1 Genesis • Conquest Design • Performance Evaluation • Conclusion 74
Related Work n Main-Memory Databases n n Memory-based data structures and query mechanisms File-system applications of persistent RAM Write buffers n Flash-memory-based file systems n Disk emulators n Rio file cache n MRAM enabled storage n Genesis • Conquest Design • Performance Evaluation • Conclusion [Baker et al. , 1992; Garcia-Molina and Salem 1992; Wu and Zwaenepoel 1994; Chen et al. , 1996; Riedel 1998; Quantum 2000; Miller et al. , 2001] 76
Related Work (2) n PDA operating systems n n Designed with severe memory constraints Slice Distributed storage system n Dedicated servers for metadata, small files, and large files n Genesis • Conquest Design • Performance Evaluation • Conclusion [Anderson et al. , 2000; Palm 2000; IBM 2002; Microsoft 2002] 77
Lessons Learned n Faster than LRU caching, unexpected Heavyweight disk handling n Severe penalty for accessing memory content n n Matching user access patterns to storage media offers considerable simplification and better performance Not an automatic result n Need careful design n Genesis • Conquest Design • Performance Evaluation • Conclusion 78
More Lessons Learned Effects of L 2 caching become highly visible in memory workloads (modern workloads) n Cannot blindly apply existing disk-based microbenchmarks to measure memory performance of file systems n Need to consider states of L 2 cache and memory behaviors at each stage of microbenchmarking n Genesis • Conquest Design • Performance Evaluation • Conclusion 79
Additional Lessons Learned n Don’t discuss your performance numbers next to a balcony…unless… Genesis • Conquest Design • Performance Evaluation • Conclusion 80
Going Beyond Conquest n Matching usage patterns with heterogeneous machines in the distributed domain Specialized tasks for machines within a cluster n Preferably self-organizing and self-evolving n n State-rich computing Caching of runtime data structures n Similar to specialized temporary file system n Genesis • Conquest Design • Performance Evaluation • Conclusion 81
Going Beyond Conquest (2) n Separate storage of metadata from data n n Benchmarking memory performance of file systems n n Opportunity for hierarchical replication across devices with different calibers Developing new memory benchmarks Why are modern operating systems so complicated? n More places to expand Conquest approach Genesis • Conquest Design • Performance Evaluation • Conclusion 82
Contributions Demonstrated the feasibility of disk-memory hybrid file systems n Showed performance does not preclude simplicity n Pinpointed cache-related problems with modern benchmarks n Opened doors to many exciting areas of research n Genesis • Conquest Design • Performance Evaluation • Conclusion 83
Conclusion n Conquest demonstrates how rethinking changes in underlying assumptions can lead to significant architectural and performance improvements n Radical changes in hardware, applications, and user expectations in the past decade should lead us to rethink other aspects of OS as well. Genesis • Conquest Design • Performance Evaluation • Conclusion 84
Questions. . . Conquest: http: //www. cs. fsu. edu/~awang/conquest Andy Wang: awang@cs. fsu. edu 85
- Slides: 83