Recovering Device Drivers Michael M Swift Muthukaruppan Annamalai

  • Slides: 35
Download presentation
Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Introduction ► Device drivers fail more than anything else § XP: 85% of all

Introduction ► Device drivers fail more than anything else § XP: 85% of all crashes § Linux: 7 x the bug rate of the mainline kernel ► Existing work protects the kernel ► Applications left to fend for themselves

Principles ► Device driver failures should be concealed from the driver’s clients ► Recovery

Principles ► Device driver failures should be concealed from the driver’s clients ► Recovery logic should be centralized in a single subsystem ► Driver recovery logic should be generic ► Recovery services should have low overhead when not needed

Shadow Drivers ► Conceals driver failure from application ► Logs driver activity § Driver

Shadow Drivers ► Conceals driver failure from application ► Logs driver activity § Driver state (ioctls) § IO requests/calls ► On failure § Intercepts IO requests § Resets driver state by replaying log ► Model is abstract enough to be implemented for wide range of drivers

Why programs crash ► “Most drivers fail due to bugs that result from unexpected

Why programs crash ► “Most drivers fail due to bugs that result from unexpected inputs or events [34]” § [34] V. Orgovan, Systems Crash Analyst, Windows Core OS Group, Microsoft Corp. private communication, 2004 § Do we really need a reference for this? § What sort of reference is that anyway?

Driver Faults ► Deterministic § Set sequence of repeatable configuration or IO requests §

Driver Faults ► Deterministic § Set sequence of repeatable configuration or IO requests § Unrecoverable with generic tools ► Transient § Infrequent inputs or environment settings ► Fail-stop § Kernel is protected from failing drivers § Faults are detected before collateral damage occurs ► Shadow drivers require transient and fail-stop behavior

Nooks ► Earlier work in kernel protection ► Provides fail-stop facilities § Detects memory

Nooks ► Earlier work in kernel protection ► Provides fail-stop facilities § Detects memory violations § Excessive CPU usage § Bad kernel parameters § 75% success rate ► Simply reboots the driver after a fault

Shadow Driver Operation ► Passive Mode § Normal operation § Monitors all explicit communication

Shadow Driver Operation ► Passive Mode § Normal operation § Monitors all explicit communication ► Replicated ► Not procedure calls DMA § Logs driver configuration ► Active § § § Mode Recovery operation Reinitializes driver to known state Impersonates driver to the kernel

Taps ► Mechanism allowing replication and redirection of communication channels ► Passive Operation §

Taps ► Mechanism allowing replication and redirection of communication channels ► Passive Operation § Calls driver function then shadow function ► Active mode § Redirects all calls to shadow driver

Passive Taps

Passive Taps

Active Taps

Active Taps

Shadow Manager ► Controls all shadow drivers ► Manages recovery operations ► Controls Tap

Shadow Manager ► Controls all shadow drivers ► Manages recovery operations ► Controls Tap insertion ► Monitors device failures

General Infrastructure ► Nooks § Isolation service § Redirection mechanism § Object tracking service

General Infrastructure ► Nooks § Isolation service § Redirection mechanism § Object tracking service ► Shadow Manager § Installs shadow drivers

Architecture

Architecture

Passive Monitoring ► Tracks IO requests § Connection-oriented: offset/positioning § Request-oriented: pending request log

Passive Monitoring ► Tracks IO requests § Connection-oriented: offset/positioning § Request-oriented: pending request log ► Logs configuration commands § Only information stored in a persistent log § Does not replicate driver state ► Tracks kernel objects obtained § Prevents memory leaks ► Many of the replicated calls § Read/write to sound device by driver are no-ops

Active Mode Recovery ► Impersonates driver to kernel and applications ► Recovers driver §

Active Mode Recovery ► Impersonates driver to kernel and applications ► Recovers driver § Stops failed driver § Reinitializes driver § Transfers state back into driver

Stopping the Failed Driver ► Shadow manager § Signals shadow driver of failure §

Stopping the Failed Driver ► Shadow manager § Signals shadow driver of failure § Switches taps to redirection ► Shadow Driver § Disables hardware device § Garbage collects unnecessary resources

Reinitializing the Driver ► Shadow driver uses cached data section ► Initializes driver ►

Reinitializing the Driver ► Shadow driver uses cached data section ► Initializes driver ► Reattaches driver to kernel resources ► Reenables hardware resources

Transferring Driver State ► Shadow Driver resubmits any outstanding IO requests § Possible replication

Transferring Driver State ► Shadow Driver resubmits any outstanding IO requests § Possible replication of IO § If device cannot handle duplicate IO, request is canceled ► Replays logged configuration commands ► Shadow Driver signals Shadow Manager ► Taps set back to passive mode

Proxying of Requests ► Depends on driver mechanics and interface ► Possible actions §

Proxying of Requests ► Depends on driver mechanics and interface ► Possible actions § Respond with recorded information § Silently drop request § Queue request for later § Block request § Report driver busy

Limitations ► Requires dynamic loading and unloading ► Requires explicit communication channels § DMA

Limitations ► Requires dynamic loading and unloading ► Requires explicit communication channels § DMA doesn’t work ► Assumes driver failure has no external effects ► Requires effective isolation and protection service ► Cannot make real-time guarantees

Evaluation ► Performance § Overhead during passive mode ► Fault-Tolerance § Does it work

Evaluation ► Performance § Overhead during passive mode ► Fault-Tolerance § Does it work ► Limitations § How many failures can be dealt with ► Code Size § Amount of kernel modification needed ► Either the advisor is a jerk or the grad students need a social life

Tested Drivers

Tested Drivers

Tested Applications

Tested Applications

Performance ► Three configurations § Linux-Native: Stock kernel § Linux-Nooks: kernel protection § Linux-SD:

Performance ► Three configurations § Linux-Native: Stock kernel § Linux-Nooks: kernel protection § Linux-SD: Shadow driver implementation ► No additional penalty vs Linux-Nooks ► Only 1 -3% performance hit vs Linux-Native

Relative Performance

Relative Performance

CPU Utilization

CPU Utilization

Fault Tolerance ► Bugs culled from bug-fixes posted to the linuxkernel mailing list ►

Fault Tolerance ► Bugs culled from bug-fixes posted to the linuxkernel mailing list ► Bugs were replicated inside each driver ► Placed bugs in rarely taken paths § Unusual hardware conditions ► Forced ► What driver to take unusual path is the difference between that and adding a faulting ioctl?

Fault Tolerance

Fault Tolerance

Recovery Behavior ► Not completely seamless ► Noticeable gap during recovery ► Possible temporary

Recovery Behavior ► Not completely seamless ► Noticeable gap during recovery ► Possible temporary data loss

Limitations ► How do shadow drivers perform with non fail-stop errors ► Large scale

Limitations ► How do shadow drivers perform with non fail-stop errors ► Large scale fault injection experiments ► Cases § Failure detected ►Recovery hidden from application? § Failure not detected

What would you do for a Ph. D? ► In total, we ran 2100

What would you do for a Ph. D? ► In total, we ran 2100 trials across the three drivers and six applications. Between trials, we reset the system and reloaded the driver. For each trial, we injected five random errors into the driver while the application was using it. We ensured the errors were transient by removing them during recovery. After injection, we visually observed the impact on the application and the system to determine whether a failure or recovery had occurred.

Undetected Failures ► 3 Cases § IO requests that never complete § Driver <->

Undetected Failures ► 3 Cases § IO requests that never complete § Driver <-> Device interaction § Certain bad parameters/return codes ► Need better understanding of driver semantics

Fault Outcomes

Fault Outcomes

Code Size

Code Size