Triage Diagnosing Production Run Failures at the Users

Motivation �Software failures are a major contributor to system downtime. �Security holes. �Software has

Motivation �Result: Software failures during production runs at user’s site. �One Solution: Offsite software

Goal: automatically diagnosing software failures occurring at end-user site production runs. �Understand a failure

Current state of the art Offsite diagnosis: Primitive onsite diagnosis: �Interactive debuggers. �Unprocessed failure

Onsite Diagnosis �Efficiently reproduce the occurred failure (i. e. fast and automatically). �Impose little

Triage �Capturing the failure point and conducting just-in- time failure diagnosis with checkpoint-reexecution. �Delta

Triage Architecture 3 groups of components: 1. Runtime Group. 2. Control Group. 3. Analysis

Checkpoint & Reexecution �Uses Rx (Previous work by authors). �Rx checkpointing: �Use fork()-like operations.

Lightweight Monitoring for detecting failures �Must not impose high overhead. �Cheapest way: catch fault

Control layer �Implements the Triage Diagnosis protocol. �Controls reexecutions with different inputs based on

TDP: Triage Diagnosis Protocol Simple Replay Deterministic bug Coredump analysis Stack/Heap OK. Segmentation fault:

TDP: Triage Diagnosis Protocol Example report

Protocol extensions and variations �Add different debugging techniques. �Reorder diagnosis steps. �Omit steps (e.

Delta Generation �Two Goals: Generate many similar replays: some that fail and some that

Delta Generation Changing the input � Replay previously stored client requests via proxy –

Delta Generation �Results passed to the next stage: �Break code to basic blocks. �For

Example revisited Good run: Trace: AHIKBDEFEF…EG Block vector: {A: 1, B: 1, D: 1,

Delta Analysis Follows three steps: 1. Basic Block Vector (BBV) Comparison: Find a pair

Delta Analysis: BBV Comparison �The number of times each block is executed is recorded

Delta Analysis: Path Comparison �Consider execution order. �Find where the failing and non-failing runs

Delta Analysis: Backward Slicing �Want to eliminate differences that have no effect on the

Backward Slicing and result Intersection

Limitations and Extensions �Need to define a privacy policy for the results sent to

Evaluation Methodology �Experimented with 10 real software failures in 9 applications. �Triage is implemented

Bugs used for Evaluation Name Program App Description #L OC Bug Type Root Cause

Experimental Results �For application bugs, Delta generation only worked for BC and TAR. �In

Case Study 1: Apache � Failure at ap_gregsub. � Bug detector catches a stack

Case Study 2: Squid � Coredump analysis suggests a heap overflow. � Happens at

Efficiency and Overhead Normal Execution overhead: �Negligble effect caused by checkpointing. �In no case

Efficiency and Overhead Diagnosis Efficiency: � Except for Delta Analysis, all steps are efficient.

User Study �Real bugs: �On average, programmers took 44. 6% less time debugging using

Slides: 35

Download presentation

Triage: Diagnosing Production Run Failures at the User’s Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos and Yuanyuan Zhou University of Illinois at Urbana Champaign

Motivation �Software failures are a major contributor to system downtime. �Security holes. �Software has grown in size, complexity and cost. �Software testing has become more difficult. �Software packages inevitably contain bugs (even production ones).

Motivation �Result: Software failures during production runs at user’s site. �One Solution: Offsite software diagnosis: �Difficult to reproduce failure triggering conditions. �Cannot provide timely online recovery (e. g. from fast Internet Worms). �Programmers cannot be provided to every site. �Privacy concerns.

Goal: automatically diagnosing software failures occurring at end-user site production runs. �Understand a failure that has happened. �Find the root causes. �Minimize manual debugging.

Current state of the art Offsite diagnosis: Primitive onsite diagnosis: �Interactive debuggers. �Unprocessed failure �Program slicing. information collections. �Deterministic replay tools. �Core Dump analysis (Partial execution path construction). Large overhead makes it impractical for production sites. All require manual analysis. Privacy concerns.

Onsite Diagnosis �Efficiently reproduce the occurred failure (i. e. fast and automatically). �Impose little overhead during normal execution. �Require no human involvement. �Require no prior knowledge.

Triage �Capturing the failure point and conducting just-in- time failure diagnosis with checkpoint-reexecution. �Delta Generation and Delta Analysis. �Automated top-down human-like software failure diagnosis protocol. �Reports: �Failure nature and type. �Failure-triggering conditions. �Failure-related code/variable and the fault propagation chain.

Triage Architecture 3 groups of components: 1. Runtime Group. 2. Control Group. 3. Analysis Group.

Checkpoint & Reexecution �Uses Rx (Previous work by authors). �Rx checkpointing: �Use fork()-like operations. �Keeps a copy of accessed files and file pointers. �Record messages using a network proxy. �Replay may be potentially modified.

Lightweight Monitoring for detecting failures �Must not impose high overhead. �Cheapest way: catch fault traps: �Assertions �Access violations �Divide by zero �More… �Extensions: Branch histories, system call trace… �Triage only uses exceptions and assertions.

Control layer �Implements the Triage Diagnosis protocol. �Controls reexecutions with different inputs based on past results. �Choice of analysis technique. �Collects results and sends to off-site programmers.

Analysis Layer Techniques:

TDP: Triage Diagnosis Protocol Simple Replay Deterministic bug Coredump analysis Stack/Heap OK. Segmentation fault: Dynamic strln() Null-pointer dereference bug detection Delta Generation Collection of good and bad inputs Delta Analysis Code paths leading to fault Report

TDP: Triage Diagnosis Protocol Example report

Protocol extensions and variations �Add different debugging techniques. �Reorder diagnosis steps. �Omit steps (e. g. memory checks for java programs). �Protocol may be costume-designed for specific applications. �Try and fix bugs: �Filter failure triggering inputs. �Dynamically delete code – risky. �Change variable values. � Automatic patch generation – future work?

Delta Generation �Two Goals: Generate many similar replays: some that fail and some that don’t. 2. Identify signature of failure triggering inputs. 1. � Signatures may be used for: � Failure analysis and reproduction. � Input filtering e. g. Vigilante, Autograph , etc.

Delta Generation Changing the input � Replay previously stored client requests via proxy – try different subsets and combinations. � Isolate bug-triggering part – data “fuzzing”. � Find non-failing inputs with minimum distance from failing ones. � Make protocol aware changes. � Use a “normal form” of the input, if specific triggering portion is known. Changing the Environment � Pad or zero-fill new allocations. � Change messages order. � Drop messages. � Manipulate thread scheduling. � Modify the system environment. � Make use of prior steps information (e. g. target specific buffers).

Delta Generation �Results passed to the next stage: �Break code to basic blocks. �For each replay extract a vector of exercise count of each block and block trace. �Possible to change granularity.

Example revisited Good run: Trace: AHIKBDEFEF…EG Block vector: {A: 1, B: 1, D: 1, E: 11, F: 10, G: 1, H: 1, I: 1, K: 1} Bad run: Trace: AHIJBCDE Block vector: {A: 1, B: 1, C: 1, D: 1, E: 1, H: 1, I: 1, J: 1, K: 1}

Delta Analysis Follows three steps: 1. Basic Block Vector (BBV) Comparison: Find a pair of most similar failing and non-failing replays F and S. 2. Path comparison: Compare the execution path of F and S. 3. Intersection with backward slice: Find the difference that contributes to the failure.

Delta Analysis: BBV Comparison �The number of times each block is executed is recorded using instrumentation. �Calculate the Manhattan distance between every pair of failing and non-failing replays (can relax the minimum demand settle for similar). �In the Example: {c: -1, E: 10, F: 10, G: 1, J: -1, K: 1} giving a Manhattan distance of 24.

Delta Analysis: Path Comparison �Consider execution order. �Find where the failing and non-failing runs diverge. �Compute: Minimum Edit Distance i. e. the minimum number of insertion, deletion, and substitution operations needed to transform one to the other. �Example:

Delta Analysis: Backward Slicing �Want to eliminate differences that have no effect on the failure. �Dynamic Backward Slicing: extracts a program slice consisting of all and only those that lead to a given instruction’s execution. �Starting point may be supplied by earlier steps of the protocol. �Overhead is acceptable in post-hoc analysis. �Optimization: Dynamically build dependencies during replays. �Experiments show that overhead is acceptably low.

Backward Slicing and result Intersection

Limitations and Extensions �Need to define a privacy policy for the results sent to programmers. �Very limited success with patch generation. �Does not handle memory leaks well. �Failure must occur. Does not handle incorrect operation. �Difficult to reproduce bugs that take a long time to manifest. �No support for deterministic replay on multiprocessor architectures. �False positives.

Evaluation Methodology �Experimented with 10 real software failures in 9 applications. �Triage is implemented in Linux OS (2. 4. 22). �Hardware: 2. 4 GHz Pentium-4, 512 K L 2 cache, 1 G memory and 100 Mbs Ethernet. �Triage checkpoints every 200 ms and keeps 20 checkpoint. �User study: 15 programmers were given 5 bugs and Triage’s report for some of the bugs. Compared time to locate the bug with and without the report.

Bugs used for Evaluation Name Program App Description #L OC Bug Type Root Cause Description Apache 1 apache-1. 3. 27 A web server 114 K Stack Smash Long alias match pattern overflows a local array Apache 2 apache-1. 3. 12 A web server 102 K Semantic (NULL ptr) Missing certain part of url causes NULL pointer dereference CVS cvs-1. 11. 4 GNU version control server 115 K Double Free Error-handling code placed at wrong order leads to double free Ny. SQL msql-4. 0. 12 A database server 102 8 K Data Race Database logging error in case of data race Squid squid-2. 3 A web proxy cache server 94 K Heap Buffer Overflow Buffer length calculation misses special character cases BC bc-1. 06 Interactive algebraic language 17 K Heap Buffer Overflow Using wrong variable in for-loop endcondition Linux linux-extract Extracted from linux-2. 6. 6 0. 3 K Semantic (copy -paste error) Forget-to-change variable identifier due to copy-paste MAN man-1. 5 h 1 Documentation tools 4. 7 K Global Buffer Overflow Wrong for-loop end-condition NCOMP ncompress-1. 2. 4 File (de)compressio n 1. 9 K Stack Smash Fixed length array can not hold long input file name TAR tar-1. 13. 25 GNU tar archive tool 27 K Semantic (NULL ptr) Directory property corner case is not well handled

Experimental Results No input testing

Experimental Results �For application bugs, Delta generation only worked for BC and TAR. �In all cases Triage correctly diagnoses the nature of the bug (deterministic or non-deterministic). �In all 6 applicable cases Triage correctly pinpoints the bug type, buggy instruction, and memory location. �When Delta Analysis is applied, it reduces the amount of data to be considered by 63% (Best: 98% worse: 12%). �For My. SQL – Finds an example interleaving pair as a trigger.

Case Study 1: Apache � Failure at ap_gregsub. � Bug detector catches a stack smash in lmatcher. � How can lmatcher affect try_alias_list? � Stack smash overwrites the stack frame above it, invalidating r. � Trace shows how lmatcher is called by try_alias_list. � Failure is independent of the headers. � Failure is triggered by requests for a specific resource.

Case Study 2: Squid � Coredump analysis suggests a heap overflow. � Happens at strcat of two buffers. � Fault propagation shows how buffers were allocated. � t has strlen(usr) while the other buffer has strlen(user)*3. � Input testing gives failure-triggering input. � Gives minimally different non-failing

Efficiency and Overhead Normal Execution overhead: �Negligble effect caused by checkpointing. �In no case over 5%. �With 400 ms checkpointing intervals – overhead is 0. 1%

Efficiency and Overhead Diagnosis Efficiency: � Except for Delta Analysis, all steps are efficient. � All (other) diagnostic steps finish within 5 minutes. � Delta analysis time is governed by the Edit Distance D in the O(ND) computation (N – number of blocks). � Comparison step of Delta Analysis may run in the background.

User Study �Real bugs: �On average, programmers took 44. 6% less time debugging using Triage reports. �Toy bugs: �On average, programmers took 18. 4% less time debugging using Triage reports.

Questions?