Tolerating Hardware Device Failures in Software Asim Kadav

  • Slides: 29
Download presentation
Tolerating Hardware Device Failures in Software Asim Kadav, Matthew J. Renzelmann, Michael M. Swift

Tolerating Hardware Device Failures in Software Asim Kadav, Matthew J. Renzelmann, Michael M. Swift University of Wisconsin-Madison

Current state of OS-hardware interaction • Many Linux device drivers assume device perfection »

Current state of OS-hardware interaction • Many Linux device drivers assume device perfection » Common Linux network driver: 3 c 59 x. c While (ioread 16(ioaddr + Wn 7_Master. Status)) & 0 x 8000) ; HANG! Hardware dependence bug: Device malfunction can crash the system 9/18/2020 Tolerating Hardware Device Failures in Software

Current state of OS-hardware interaction • Hardware dependence bugs present across driver classes void

Current state of OS-hardware interaction • Hardware dependence bugs present across driver classes void hptitop_iop_request_callback(. . . ) arg = readl(. . . ); . . . if (readl(&req->result) == IOP_SUCCESS) arg->result = HPT_IOCTL_OK; } } { { Highpoint SCSI driver(hptiop. c) *Code simplified for presentation purposes 9/18/2020 Tolerating Hardware Device Failures in Software

How do the hardware bugs manifest? • Drivers often trust hardware to work correctly

How do the hardware bugs manifest? • Drivers often trust hardware to work correctly » Drivers use device data in critical control and data paths » Drivers do not report device malfunctions to system log » Drivers do not detect or recover from device failures 9/18/2020 Tolerating Hardware Device Failures in Software

Carburizer • Goal: Tolerate hardware device failures in software through hardware failure detection and

Carburizer • Goal: Tolerate hardware device failures in software through hardware failure detection and recovery • Static analysis tool - analyze and insert code to: » Detect and fix hardware dependence bugs » Detect and generate missing error reporting information • Runtime » Handle interrupt failures » Transparently recover from failures 9/18/2020 Tolerating Hardware Device Failures in Software

Outline • • Background Hardening drivers Reporting errors Conclusion 9/18/2020 Tolerating Hardware Device Failures

Outline • • Background Hardening drivers Reporting errors Conclusion 9/18/2020 Tolerating Hardware Device Failures in Software

Hardware unreliability • Sources of hardware misbehavior: » » Device wear-out, insufficient burn-in Bridging

Hardware unreliability • Sources of hardware misbehavior: » » Device wear-out, insufficient burn-in Bridging faults Electromagnetic radiation Firmware bugs • Result of misbehavior: » Corrupted/stuck-at inputs » Timing errors/unpredictable DMA » Interrupt storms/missing interrupts 9/18/2020 Tolerating Hardware Device Failures in Software

Vendor recommendations for driver developers Recommendation Summary Recommended by Intel Validation Timing Sun Input

Vendor recommendations for driver developers Recommendation Summary Recommended by Intel Validation Timing Sun Input validation � � Read once& CRC data � � DMA protection � � Infinite polling � � Stuck interrupt MS Linux � � request � Goal: Automatically. Lost implement as many recommendations as Avoid excess delay in OS � possible in commodity drivers Unexpected events � Reporting Report all failures � Recovery Handle all failures 9/18/2020 Cleanup correctly � Do not crash on failure � Wrap I/O memory access � Tolerating Hardware Device Failures in Software � � �

Carburizer architecture Hardware dependency bug detection Recovery and detection of interrupt issues OS Kernel

Carburizer architecture Hardware dependency bug detection Recovery and detection of interrupt issues OS Kernel Interface Carburizer If (c==0) {. print (“Driver init”); }. . Driver List of bugs Compiler If (c==0) {. print (“Driver init”); }. . Hardened Driver Binary Faulty Hardware 9/18/2020 Tolerating Hardware Device Failures in Software Carburizer Runtime

Outline • Background • Hardening drivers » Finding sensitive code » Repairing code •

Outline • Background • Hardening drivers » Finding sensitive code » Repairing code • Reporting errors • Conclusion 9/18/2020 Tolerating Hardware Device Failures in Software

Hardening drivers • Goal: Remove hardware dependence bugs » Find driver code that uses

Hardening drivers • Goal: Remove hardware dependence bugs » Find driver code that uses data from device » Ensure driver performs validity checks • Carburizer detects and fixes hardware bugs from » » 9/18/2020 Infinite polling Unsafe static/dynamic array reference Unsafe pointer dereferences System panic calls on non-debug path Tolerating Hardware Device Failures in Software

Hardening drivers • Finding sensitive code » First pass: Identify variables that contain data

Hardening drivers • Finding sensitive code » First pass: Identify variables that contain data from the device » We call them as tainted variables. 9/18/2020 Tolerating Hardware Device Failures in Software

Finding sensitive code First pass: Identify tainted variables int test () { a =

Finding sensitive code First pass: Identify tainted variables int test () { a = readl(); b = inb(); c = b; d = c + 2; return d; } int set() { e = test(); } 9/18/2020 Tolerating Hardware Device Failures in Software Tainted Variables a b c d test() e

Detecting risky uses of tainted variables • Second pass: Finding hardware dependence bugs »

Detecting risky uses of tainted variables • Second pass: Finding hardware dependence bugs » Identify risky uses of tainted variables • Example: Infinite polling » Driver waiting for device to enter particular state » Solution: Detect loops where all terminating conditions depend on tainted variables 9/18/2020 Tolerating Hardware Device Failures in Software

Example: Infinite polling Tainted variables used for critical timing decisions static int amd 8111

Example: Infinite polling Tainted variables used for critical timing decisions static int amd 8111 e_read_phy(………) {. . . reg_val = readl(mmio + PHY_ACCESS); while (reg_val & PHY_CMD_ACTIVE) reg_val = readl(mmio + PHY_ACCESS). } AMD 8111 e network driver(amd 8111 e. c) 9/18/2020 Tolerating Hardware Device Failures in Software

Not all bugs are obvious while (DAC 960_PD_Status. Available. P(Controller. Base. Address)) { DAC

Not all bugs are obvious while (DAC 960_PD_Status. Available. P(Controller. Base. Address)) { DAC 960_V 1_Command. Identifier_T Command. Identifier= DAC 960_PD_Read. Status. Command. Identifier (Controller. Base. Address); DAC 960_Command_T *Command = Controller ->Commands [Command. Identifier-1]; DAC 960_V 1_Command. Mailbox_T *Command. Mailbox = &Command->V 1. Command. Mailbox; DAC 960_V 1_Command. Opcode_T Command. Opcode=Command. Mailbox->Common. Command. Opcode; Command->V 1. Command. Status =DAC 960_PD_Read. Status. Register(Controller. Base. Address); DAC 960_PD_Acknowledge. Interrupt(Controller. Base. Address); DAC 960_PD_Acknowledge. Status(Controller. Base. Address); switch (Command. Opcode) { case DAC 960_V 1_Enquiry_Old: DAC 960_P_To_PD_Translate. Read. Write. Command(Command. Mailbox); … } DAC 960 Raid Controller(DAC 960. c) 9/18/2020 Tolerating Hardware Device Failures in Software

Detecting risky uses of tainted variables • Example II: Unsafe array accesses » Tainted

Detecting risky uses of tainted variables • Example II: Unsafe array accesses » Tainted variables used as array index into static or dynamic arrays » Tainted variables used as pointers 9/18/2020 Tolerating Hardware Device Failures in Software

Example: Unsafe array accesses Tainted variables used to index kernel memory w/o checks static

Example: Unsafe array accesses Tainted variables used to index kernel memory w/o checks static void __init attach_pas_card(. . . ) { if ((pas_model = pas_read(0 x. FF 88))) {. . . sprintf(temp, “%s rev %d”, pas_model_names[(int) pas_model], pas_read(0 x 2789)); . . . } Pro Audio Sound driver (pas 2_card. c) 9/18/2020 Tolerating Hardware Device Failures in Software

Analysis results over the Linux kernel • Analyzed drivers in 2. 6. 18. 8

Analysis results over the Linux kernel • Analyzed drivers in 2. 6. 18. 8 Linux kernel » 6300 driver source files » 2. 8 million lines of code » 37 minutes to analyze and compile code • Additional analyses to detect existing validation code • Re-ran analysis for 2. 6. 37. 6 Linux kernel 9/18/2020 Tolerating Hardware Device Failures in Software

Analysis results over Linux 2. 6. 18. 8 Driver class Infinite polling Static array

Analysis results over Linux 2. 6. 18. 8 Driver class Infinite polling Static array Dynamic array Panic calls net 117 2 21 2 scsi sound video other Total 2. 6. 37. 6 298 64 174 381 860 1164 31 1 0 9 43 55 22 0 22 57 89 156 121 2 22 32 179 214 • Found 992 bugs in driver code with 7. 4% false positive rate (manual sampling of 190 bugs) 9/18/2020 Tolerating Hardware Device Failures in Software

Repairing drivers • Hardware dependence bugs difficult to test • Carburizer automatically generates repair

Repairing drivers • Hardware dependence bugs difficult to test • Carburizer automatically generates repair code » » 9/18/2020 Inserts timeout code for infinite loops Inserts checks for unsafe array/pointer references Replaces calls to panic() with recovery service Triggers generic recovery service on device failure Tolerating Hardware Device Failures in Software

Outline • • Background Hardening drivers Reporting errors Conclusion 9/18/2020 Tolerating Hardware Device Failures

Outline • • Background Hardening drivers Reporting errors Conclusion 9/18/2020 Tolerating Hardware Device Failures in Software

Reporting errors • Drivers often fail silently and fail to report device errors »

Reporting errors • Drivers often fail silently and fail to report device errors » Drivers should proactively report device failures » Fault management systems require these inputs • Driver already detects failure but does not report them • Carburizer analysis performs two functions » Detect when there is a device failure » Report unless the driver is already reporting the failure 9/18/2020 Tolerating Hardware Device Failures in Software

Detecting driver-detected device failures • Detect code that depends on tainted variables » Perform

Detecting driver-detected device failures • Detect code that depends on tainted variables » Perform unreported loop timeouts » Returns negative error constants » Jumps to common cleanup code while (ioread 16 (reg. A) == 0 x 0 f) { if (timeout++ == 200) { sys_report(“Device timed out %s. n”, mod_name); return (-1); } Reporting code } added by Carburizer 9/18/2020 Tolerating Hardware Device Failures in Software

Detecting existing reporting code Carburizer detects function calls with string arguments Carburizer detects existing

Detecting existing reporting code Carburizer detects function calls with string arguments Carburizer detects existing reporting code static u 16 gm_phy_read(. . . ) {. . . if (__gm_phy_read(. . . )) printk(KERN_WARNING "%s: . . . n”, . . . ); Sys. Konnect network driver(skge. c) 9/18/2020 Tolerating Hardware Device Failures in Software

Evaluation • Fixed 1135 cases of unreported timeouts and 467 cases of unreported device

Evaluation • Fixed 1135 cases of unreported timeouts and 467 cases of unreported device failures in Linux drivers • Evaluation: Manual analysis of drivers of different classes Driver bnx 2 mptbase ens 1371 Class network scsi sound Carburizer reported/Driver detected device failures 17/24 17/28 9/10 • No. Carburizer false positives automatically improves the fault diagnosis capabilities of the system 9/18/2020 Tolerating Hardware Device Failures in Software

Conclusion Recommendation Summary Recommended by Intel Validation Timing Input validation � � Read once&

Conclusion Recommendation Summary Recommended by Intel Validation Timing Input validation � � Read once& CRC data � � DMA protection � � Infinite polling � � Stuck interrupt MS Linux � � Lost request � Avoid excess delay in OS � Unexpected events � Reporting Report all failures � Recovery Handle all failures 9/18/2020 Sun Cleanup correctly � Do not crash on failure � Wrap I/O memory access � Tolerating Hardware Device Failures in Software � � �

Conclusion Recommendation Summary Recommended by Intel Validation Timing Sun Input validation � � Read

Conclusion Recommendation Summary Recommended by Intel Validation Timing Sun Input validation � � Read once& CRC data � � DMA protection � � Infinite polling � � Stuck interrupt MS Carburizer Ensures Linux � � � � Lost request � Carburizer improves system reliability by automatically ensuring � Avoid excessfailures delay in OSare tolerated in software that hardware � Unexpected events � Reporting Report all failures � Recovery Handle all failures 9/18/2020 Cleanup correctly � Do not crash on failure � Wrap I/O memory access � � � Tolerating Hardware Device Failures in Software � � �

Thank You • Contact for driver verification/tool access » kadav@cs. wisc. edu • Details

Thank You • Contact for driver verification/tool access » kadav@cs. wisc. edu • Details on carburizer » http: //cs. wisc. edu/~kadav/carb/ Hardware dependence bug detection Carburizer If (c==0) {. print (“Drive r init”); }. . Driver Compiler Recovery and detection of interrupt failures Kernel Interface Kernel If (c==0) {. print (“Driver init”); }. . Hardened Driver Binary Faulty Hardware 9/18/2020 Tolerating Hardware Device Failures in Software OS Carburizer Runtime