IMPROVING THE RELIABILITY OF COMMODITY OPERATING SYSTEMS MICHAEL
IMPROVING THE RELIABILITY OF COMMODITY OPERATING SYSTEMS MICHAEL M. SWIFT, BRIAN N BERSHARD, HENRY M. LEVY Presenter: Shyam Sunder Santoshi Visamsetty E-Mail: vsss@cs. vt. edu 11/13/2007 1
Outline • • Introduction Motivation Previous work Nooks – Architecture – Implementation • Performance • Conclusion 11/13/2007 2
Features of a Good Operating System • High Performance • High Scalability • High Reliability 11/13/2007 3
Reliability Problems in Operating Systems • Crashes caused by: – Device Drivers – Other Extensions such as File Systems, Virus Detectors, Network Protocols etc. . 11/13/2007 4
Causes of System Crashes in Windows NT Source: http: //www. dependability. org/wg 10. 4/meeting 38/07 -Murph. pdf, June 2000 11/13/2007 5
Crashes in Windows XP Source: http: //msdn 2. microsoft. com/en-us/library/ms 838661. aspx, Jan 2003 11/13/2007 6
“The most notable reality is that the Windows operating system is not responsible for a majority of PC crashes in our data set. Poorly-written device drivers contribute most of the crashes in our data. ” -- Windows XP Kernel Crash Analysis by Archana Ganapathi, Viji Ganapathi and David Patterson, University of California, Berkeley, 2006 http: //www. cs. berkeley. edu/~archanag/publications/lisa. pdf 11/13/2007 7
Why Device Drivers? • Device Drivers access the system memory and hardware directly. • Device Drivers and other Extensions account for 70% of the code as in Linux-2. 4. 1 release. • Faulty Code might cause the crash. 11/13/2007 8
Motivation • Reliability remains a crucial but an unsolved problem. • Rising Costs of Failures • Increasing Prevalence of OS Extensions • Extensions are leading cause of OS Failure 11/13/2007 9
Previous Approaches to Enhance Reliability • Microkernels • Type Safe Languages • New Hardware : Ring and Segment Architectures • Transaction-based systems 11/13/2007 10
Nooks Approach • • • Conventional Processor Architecture Conventional Programming Language Conventional OS Architecture Existing Extensions Nooks virtualizes only the interface between the kernel and extension. 11/13/2007 11
Goals • Isolation • Recovery • Backward Compatibility 11/13/2007 12
Nooks Architecture • Two Core Principles: Design for fault resistance, not fault tolerance. Design for mistakes, not abuse. 11/13/2007 13
Nooks: Implementation • Implemented on Linux 2. 4. 18 Kernel. • Isolated Kernel Extensions are wrapped by Nooks wrapper stubs. • All extensions execute at ring 0. • Nooks does not use Intel x 86 protection rings or memory segmentation mechanisms. 11/13/2007 14
Nooks Layered Architecture 11/13/2007 15
Functions of Nooks 11/13/2007 16
Isolation • Prevent extension errors from damaging the kernel. • Every extension executes within its lightweight kernel protection domain. • Tasks: – Protection-Domain Management – Inter-Domain Control Transfer 11/13/2007 17
Isolation(Contd…) • Extension Procedure Call (XPC) • XPC is a control-transfer mechanism for isolating extensions within the kernel. • XPC occurs between asymmetric trusted domains. 11/13/2007 18
Isolation: Implementation • Two Parts: – Memory Management – Extension Procedure Call 11/13/2007 19
Protection of Kernel Address Space 11/13/2007 20
Isolation (Contd. . ) • Extension Procedure Call (XPC): – Transfer control between extension and kernel domains. – Two Functions: • nooks_driver_call • nooks_kernel_call 11/13/2007 21
Isolation (Contd…) • Deferred Call Mechanism • Maintains two queues: • Extension-domain-queue • Kernel-domain-queue • Changes to the Linux-Kernel: – Maintain Coherency between the Kernel and Extension page tables. – Handle Exceptions. – Handle Co-location of task structure. 11/13/2007 22
Interposition • Integrates existing extensions into the Nooks Environment. • Tasks: – All Extension to Kernel and Kernel to Extension control flows through the XPC mechanism – All data transfer between the kernel and extension is viewed and managed by Nook’s object-tracking mechanism. 11/13/2007 23
Interposition ( Contd…) • Wrapper Stubs: – Interface between the extension, Nooks Isolation Manager (NIM) and the Kernel. – Kernel views the stub as an extension’s function entry point. – Extensions view the stub as the Kernel’s extension API. 11/13/2007 24
Interposition: Implementation • Interposes Wrapper stubs between extensions and the kernel • Wrappers provide transparency and protects control and data transfers in both directions • Changes to the Kernel: – Standard Module Loader – Module Initialization Code – Protection of Data Transfers 11/13/2007 25
Wrappers • Two types of Wrappers: – Kernel Wrappers – Extension Wrappers • Performs three tasks: – Checks Parameters for Validity by verifying with the object tracker and memory manager that pointers are valid. – Object Tracking Code creates a copy of kernel objects on the local heap or stack within the extension’s protection domain. – Wrappers perform an XPC into the kernel or extension to execute the desired function. 11/13/2007 26
Control Flow of Extension and kernel Wrappers 11/13/2007 27
Wrappers (Contd…) • Wrapper Code Sharing: – 248 Wrappers were implemented to isolate 463 imported and exported functions. – Implies that wrapper code is shared among multiple drivers. 11/13/2007 28
Code Sharing among Wrappers 11/13/2007 29
Object-Tracking • Tasks: – Maintains a list of kernel data structures that are manipulated by an extension. – Controls all modification to those structures. – Provides object information for clean-up when an extension fails. 11/13/2007 30
Object Tracking : Implementation • Manages Manipulation of Kernel Objects by extensions. • Records all kernel objects and types in use by extensions. • Performs Two tasks: – Records the addresses of all objects in use by an extension – Records an association between the kernel and extension versions of the object. • Garbage Collection 11/13/2007 31
Recovery • Software Faults: – Occurs when extension invokes a kernel service improperly. – Recovery policy determines whether Nooks triggers recovery or returns control to the extension with an error code when possible. • Hardware Faults: – Occurs when extension attempts to read unmapped memory. – Triggers Recovery. 11/13/2007 32
Recovery: Implementation • Two parts: • Release of resource by Recovery Manager. • Coordination of Recovery through the user-mode agent. • Nooks recovery manager is tasked with returning the system to a clean state from which it can continue. • The user-mode recovery agent facilitates flexible recovery. 11/13/2007 33
Recovery: Implementation (contd. . ) • Recovery Manager walks the list of objects known to the object tracker and releases, frees or unregisters all objects that will not be accessed by external devices. • It uses a recovery function which releases the objects to the kernel and removes all the references from the kernel into the extension. 11/13/2007 34
Implementation Limitations • Complete Isolation or fault-tolerance is not achieved. • Runs extensions in kernel mode, so cannot prevent extensions from deliberately executing privileged instructions. • Limited to drivers that can be killed and restarted safely. • As a result of the above limitations, crashes may still occur. 11/13/2007 35
Reliability Test • Test Methodology: synthetic fault-injection • Extensions Isolated: 11/13/2007 36
Test Environment • Four Programs: – Sound Drivers: play a short MP 3 file. – Network Drivers: ICMP ping and TCP streaming tests. – VFAT: untars and compiles a number of files. – k. HTTPd: Web Load Generator. • VMware Virtual Machine • 400 trials were run for each extensions in both Native and Nooks mode. 11/13/2007 37
Test Results • System Crashes: – Native Mode: 317 crashes for 400 trials – Nooks : Eliminated 313 (99%) , 4 resulted in deadlock. – e 1000, pcnet 32 are interrupt oriented. – VFAT, sb, k. HTTPd are process-oriented. 11/13/2007 38
Test Results (Contd…) 11/13/2007 39
Test Results (contd…) • Non-Fatal Extension Failures: – For e 1000 and pcnet 32, failures that left the device in a non-functional state were not detected by Nooks. – For VFAT and sb, Nooks reduced the number of non-fatal extensions. – For k. HTTPd, only a small number of injected faults were caught by Nooks. 11/13/2007 40
Test Results (Contd…) 11/13/2007 41
Recovery Errors • For network, sb and k. HTTPd extensions, errors are recovered straight forwardly. • For VFAT, 90% of the cases resulted in on-disk corruption. • Reason: Fault injection occurs after files and directories are created and abrupt shutdown and restart of file system leaves it in a corrupted state. 11/13/2007 42
Recovery Errors (Contd…) • Solution: – Synchronize the disks with in-memory disk cache before releasing resources on a VFAT recovery. • Result: – No. of corruption cases reduced from 90% to 10% 11/13/2007 43
Other Tests • For Manually Injected Errors, such as improper initializations, removing Null Checks, Nooks automatically detected and recovered from all such failures. • Latent Bugs: – Nooks revealed several latent bugs in existing kernel extensions such as k. HTTPd and 3 COM 3 c 90 x Ethernet Driver. 11/13/2007 44
Summary of Reliability Tests • 99% of the system crashes were detected and recovered. • Nearly 60% of non-fatal extension failures were recovered. 11/13/2007 45
Performance: Benchmarks Benchmark Extension XPC Rate (per sec) Nooks Relative Performance Native CPU Util. (%) Nooks CPU Util(%) Play-mp 3 (128 Kbps) sb 150 1 4. 8 4. 6 Receive Stream e 1000 8, 923 0. 92 15. 5 Send-Stream e 1000 60, 352 0. 91 21. 4 39. 3 Compile-Local VFAT 22, 653 0. 78 97. 5 96. 8 Serve-simpleweb-page k. HTTPd 61, 183 0. 44 96. 6 96. 8 1, 960 0. 97 90. 5 92. 6 Serve-complex e 1000 -web-page 11/13/2007 46
Comparative Time-chart for Compilation Bench. Mark 11/13/2007 47
Summary of Benchmark Results: • Nooks provides a substantial reliability improvement at costs that depends on extensions being isolated. • Moreover, performance depends on the CPU utilization imposed by the workload. 11/13/2007 48
Conclusion • Nooks can be implemented with modest engineering efforts. • Extensions can be isolated without any change to extension code. • Isolation and Recovery dramatically improve system reliability • But, when performance matters for high XPC frequency extensions, isolation may not be appropriate. 11/13/2007 49
QUESTIONS AND COMMENTS 11/13/2007 50
THANK YOU 11/13/2007 51
- Slides: 51