bddbddb Using Datalog and BDDs for Program Analysis
bddbddb: Using Datalog and BDDs for Program Analysis John Whaley Stanford University and moka 5 Inc. June 11, 2006 Using Datalog and BDDs for Program Analysis
Implementing Program Analysis vs. … 56 pages! June 11, 2006 Using Datalog and BDDs for Program Analysis • 2 x faster • Fewer bugs • Extensible 1
Is it really that easy? • Requires: – A different way of thinking – Knowledge, experience, and intuition – Perseverance to try different techniques – A lot of tuning and tweaking – Luck • Despite all this, people who use it swear by it and could “never go back” June 11, 2006 Using Datalog and BDDs for Program Analysis 2
Tutorial Structure Part I: Essential Background – … Part II: Using the Tools – … Part III: Developing Advanced Analyses – … Part IV: Profiling, Debugging, Avoiding Gotchas – … Short break every 30 minutes June 11, 2006 Using Datalog and BDDs for Program Analysis 3
Tutorial Structure Part I: Essential Background – Datalog for Program Analysis – Binary Decision Diagrams Part II: Using the Tools – … Part III: Developing Advanced Analyses – … Part IV: Profiling, Debugging, Avoiding Gotchas – … June 11, 2006 Using Datalog and BDDs for Program Analysis 4
Tutorial Structure Part I: Essential Background – … Part II: Using the Tools – – bddbddb Compiler interface (Joeq compiler) Datalog editor in Eclipse Interactive mode Part III: Developing Advanced Analyses – … Part IV: Profiling, Debugging, Avoiding Gotchas – … June 11, 2006 Using Datalog and BDDs for Program Analysis 5
Tutorial Structure Part I: Essential Background – … Part II: Using the Tools – … Part III: Developing Advanced Analyses – – Context sensitivity Combining multiple analyses Race detection examples Using advanced bddbddb features Part IV: Profiling, Debugging, Avoiding Gotchas – … June 11, 2006 Using Datalog and BDDs for Program Analysis 6
Tutorial Structure Part I: Essential Background – … Part II: Using the Tools – … Part III: Developing Advanced Analyses – … Part IV: Profiling, Debugging, Avoiding Gotchas – – Variable ordering Iteration order Machine learning What it’s good for, what it isn’t good for June 11, 2006 Using Datalog and BDDs for Program Analysis 7
Try it yourself… • Available as moka 5 Live. PC – Non-intrusive installation in a VM – Automatically kept up to date – Easy to try, easy to share – Complete environment on a USB stick June 11, 2006 Using Datalog and BDDs for Program Analysis 8
Part I: Essential Background Program Analysis in Datalog June 11, 2006 Using Datalog and BDDs for Program Analysis 9
Datalog • Declarative language for deductive databases [Ullman 1989] – Like Prolog, but no function symbols, no predefined evaluation strategy June 11, 2006 Using Datalog and BDDs for Program Analysis 10
Datalog Basics Predicate Atom = Reach(d, x, i) Arguments: variables or constants Literal = Atom or NOT Atom Rule = Atom : - Literal & … & Literal Make this atom true (the head ). June 11, 2006 The body : For each assignment of values to variables that makes all these true … Using Datalog and BDDs for Program Analysis 11
Datalog Example parent(x, y) : - child(y, x). grandparent(x, z) : - parent(x, y), parent(y, z). ancestor(x, y) : - parent(x, y). ancestor(x, z) : - parent(x, y), ancestor(y, z). June 11, 2006 Using Datalog and BDDs for Program Analysis 12
Datalog • Intuition: subgoals in the body are combined by “and” (strictly speaking: “join”). • Intuition: Multiple rules for a predicate (head) are combined by “or. ” June 11, 2006 Using Datalog and BDDs for Program Analysis 13
Another Datalog Example has. Child(x) : - child(_, x). has. No. Child(x) : - !child(_, x). “!” inverts the relation, not the atom! has. Sibling(x) : - child(x, y), child(z, y), z!=x. only. Child(x) : - child(x, _), !has. Sibling(x). _ means “Dont-care” (at least one) ! means “Not” June 11, 2006 Using Datalog and BDDs for Program Analysis 14
Reaching Defs in Datalog Reach(d, x, j) : - Reach(d, x, i), Statement. At(i, s), !Assign(s, x), Follows(i, j). Reach(s, x, j) : - Statement. At(i, s), Assign(s, x), Follows(i, j). June 11, 2006 Using Datalog and BDDs for Program Analysis 15
Definition: EDB Vs. IDB Predicates • Some predicates come from the program, and their tuples are computed by inspection. – Called EDB, or extensional database predicates. • Others are defined by the rules only. – Called IDB, or intensional database predicates. June 11, 2006 Using Datalog and BDDs for Program Analysis 16
Negation • Negation makes things tricky. • Semantics of negation – No negation allowed [Ullman 1988] – Stratified Datalog [Chandra 1985] – Well-founded semantics [Van Gelder 1991] June 11, 2006 Using Datalog and BDDs for Program Analysis 17
Stratification • A risk occurs if there are negated literals involved in a recursive predicate. – Leads to oscillation in the result. • Requirement for stratification : – Must be able to order the IDB predicates so that if a rule with P in the head has NOT Q in the body, then Q is either EDB or earlier in the order than P. June 11, 2006 Using Datalog and BDDs for Program Analysis 18
Example: Nonstratification P(x) : - E(x), !P(x). • If E(1) is true, is P(1) true? • It is after the first round. • But not after the second. • True after the third, not after the fourth, … June 11, 2006 Using Datalog and BDDs for Program Analysis 19
Iterative Algorithm for Datalog • Start with the EDB predicates = “whatever the code dictates, ” and with all IDB predicates empty. • Repeatedly examine the bodies of the rules, and see what new IDB facts can be discovered from the EDB and existing IDB facts. June 11, 2006 Using Datalog and BDDs for Program Analysis 20
Datalog evaluation strategy • “Semi-naïve” evaluation – Remember that a new fact can be inferred by a rule in a given round only if it uses in the body some fact discovered on the previous round. • Evaluation strategy – Top-down (goal-directed) [Ullman 1985] – Bottom-up (infer from base facts) [Ullman 1989] June 11, 2006 Using Datalog and BDDs for Program Analysis 21
Our Dialect of Datalog • Totally-ordered finite domains – Domains are of a given, finite size – Makes all Datalog programs “safe” – Cannot mix variables of different domains • Constants (named/integers) • Comparison operators: = != < <= > >= • Dont-care: _ Universe: * June 11, 2006 Using Datalog and BDDs for Program Analysis 22
Why Datalog? • Developed a tool to translate inference rules to BDD implementation • Later, discovered Datalog (Ullman, Reps) • Semantics of BDDs match Datalog exactly – – Obvious implementation of relations Operations occur a set-at-a-time Fast set compare, set difference Wealth of literature about semantics, optimization, etc. June 11, 2006 Using Datalog and BDDs for Program Analysis 23
Inference Rules Assign(v 1, v 12, ), vv. Points. To(v 2), v. Points. To(v 2, o). v. Points. To(v 1, 1, o)o) : - • Datalog rules directly correspond to inference rules. June 11, 2006 Using Datalog and BDDs for Program Analysis 24
Flow-Insensitive Pointer Analysis Input Tuples v. Points. To(p, o 1) v. Points. To(q, o 2) Store(p, f, q) Load(p, f, r) o 1: p = new Object(); o 2: q = new Object(); p. f = q; r = p. f; p q June 11, 2006 o 1 o 2 f r Output Relations h. Points. To(o 1, f, o 2) v. Points. To(r, o 2) Using Datalog and BDDs for Program Analysis 25
Inference Rule in Datalog Assignments: v. Points. To(v 1, o) : - Assign(v 1, v 2), v. Points. To(v 2, o). v 1 = v 2; v 2 o v 1 June 11, 2006 Using Datalog and BDDs for Program Analysis 26
Inference Rule in Datalog Stores: h. Points. To(o 1, f, o 2) : - Store(v 1, f, v 2), v. Points. To(v 1, o 1), v. Points. To(v 2, o 2). v 1. f = v 2; v 1 o 1 f v 2 June 11, 2006 o 2 Using Datalog and BDDs for Program Analysis 27
Inference Rule in Datalog Loads: v. Points. To(v 2, o 2) : - Load(v 1, f, v 2), v. Points. To(v 1, o 1), h. Points. To(o 1, f, o 2). v 2 = v 1. f; v 1 o 1 f v 2 June 11, 2006 o 2 Using Datalog and BDDs for Program Analysis 28
The Whole Algorithm v. Points. To(v, o) : - v. Points. To 0(v, o). v. Points. To(v 1, o) : - Assign(v 1, v 2), v. Points. To(v 2, o). h. Points. To(o 1, f, o 2) v. Points. To(v 2, o 2) June 11, 2006 : - Store(v 1, f, v 2), v. Points. To(v 1, o 1), v. Points. To(v 2, o 2). : - Load(v 1, f, v 2), v. Points. To(v 1, o 1), h. Points. To(o 1, f, o 2). Using Datalog and BDDs for Program Analysis 29
Format of a Datalog file • Domains Name V H Size ( map file ) 65536 32768 var. map • Relations Name ( <attribute list> ) Store (v 1 : V, f : F, v 2 : V) Points. To (v : V, h : H) flags input, output • Rules Head : - Body. Points. To(v 1, h) : - Assign(v 1, v), Points. To(v, h). June 11, 2006 Using Datalog and BDDs for Program Analysis 30
Key Point • Program information is stored in a relational database. – Everything in the program is numbered. • Write declarative inference rules to infer new facts about the program. • Negations OK if they are not in a recursive cycle. June 11, 2006 Using Datalog and BDDs for Program Analysis 31
Take a break… (Next up: Binary Decision Diagrams) June 11, 2006 Using Datalog and BDDs for Program Analysis 32
Part I: Essential Background Binary Decision Diagrams June 11, 2006 Using Datalog and BDDs for Program Analysis 33
Call graph relation • Call graph expressed as a relation. – Five edges: • • • June 11, 2006 Calls(A, B) Calls(A, C) Calls(A, D) Calls(B, D) Calls(C, D) A B C D Using Datalog and BDDs for Program Analysis 34
Call graph relation • Relation expressed as a binary function. – A=00, B=01, C=10, D=11 Calls(A, B) Calls(A, C) Calls(A, D) Calls(B, D) Calls(C, D) June 11, 2006 → 00 01 → 00 10 → 00 11 → 01 11 → 10 11 Using Datalog and BDDs for Program Analysis A 00 01 B C 10 D 11 35
from x 1 0 0 0 0 1 1 1 1 x 2 0 0 0 0 1 1 1 1 June 11, 2006 Call graph relation to x 3 0 0 1 1 x 4 0 1 0 1 f 0 1 1 1 0 0 0 1 0 0 • Relation expressed as a binary function. – A=00, B=01, C=10, D=11 A 00 01 B C 10 D 11 Using Datalog and BDDs for Program Analysis 36
Binary Decision Diagrams (Bryant 1986) • Graphical encoding of a truth table. x 1 0 edge 1 edge x 2 x 3 x 4 0 June 11, 2006 x 4 1 1 x 3 x 4 1 0 x 4 0 0 x 3 x 4 1 0 x 4 0 Using Datalog and BDDs for Program Analysis 0 x 4 1 0 x 4 0 0 0 37
Binary Decision Diagrams • Collapse redundant nodes. x 1 0 edge 1 edge x 2 x 3 x 4 0 June 11, 2006 x 4 1 1 x 3 x 4 1 0 x 4 0 0 x 3 x 4 1 0 x 4 0 Using Datalog and BDDs for Program Analysis 0 x 4 1 0 x 4 0 0 0 38
Binary Decision Diagrams • Collapse redundant nodes. x 1 0 edge 1 edge x 2 x 3 x 4 x 4 0 June 11, 2006 x 3 x 4 x 4 1 Using Datalog and BDDs for Program Analysis 39
Binary Decision Diagrams • Collapse redundant nodes. x 1 0 edge 1 edge x 2 x 3 x 4 0 June 11, 2006 x 3 x 4 1 Using Datalog and BDDs for Program Analysis 40
Binary Decision Diagrams • Collapse redundant nodes. x 1 0 edge 1 edge x 2 x 3 x 4 0 June 11, 2006 x 3 x 4 1 Using Datalog and BDDs for Program Analysis 41
Binary Decision Diagrams • Eliminate unnecessary nodes. x 1 0 edge 1 edge x 2 x 3 x 4 0 June 11, 2006 x 3 x 4 1 Using Datalog and BDDs for Program Analysis 42
Binary Decision Diagrams • Eliminate unnecessary nodes. x 1 0 edge 1 edge x 2 x 3 x 4 0 June 11, 2006 1 Using Datalog and BDDs for Program Analysis 43
Binary Decision Diagrams • Size depends on amount of redundancy, NOT size of relation. – Identical subtrees share the same representation. – As set gets very large, more nodes have identical zero and one successors, so the size decreases. June 11, 2006 Using Datalog and BDDs for Program Analysis 44
BDD Variable Order is Important! x 1 x 2 + x 3 x 4 x 1 x 3 x 2 x 4 0 x 2 x 4 1 0 x 1<x 2<x 3<x 4 June 11, 2006 x 3 1 x 1<x 3<x 2<x 4 Using Datalog and BDDs for Program Analysis 45
Variable ordering is NP-hard • No good general heuristic solutions • Dynamic reordering heuristics don’t work well on these problems • We use: – Trial and error – Active learning June 11, 2006 Using Datalog and BDDs for Program Analysis 46
Apply Operation • Concept – Basic technique for building OBDD from Boolean formula. A op B a a b c | d 0 A op B c d 1 0 1 Result Arguments A, B, op n n A and B: Boolean Functions ® Represented as OBDDs op: Boolean Operation (e. g. , ^, &, |) June 11, 2006 Using Datalog and BDDs for Program Analysis n OBDD representing composite function n. A op B 47
Apply Execution Example Argument A Recursive Calls Argument B A 1, B 1 A 2, B 2 Operation | A 6, B 2 A 3, B 2 A 4, B A 5, B 2 A 6, B 5 A 3, B 4 A 5, B 4 3 • Optimizations – Dynamic programming – Early termination rules June 11, 2006 Using Datalog and BDDs for Program Analysis 48
Apply Result Generation Recursive Calls Without Reduction With Reduction a A 1, B 1 b A 2, B 2 A 6, B 2 A 3, B 2 A 4, B A 5, B 2 c A 6, B 5 d A 3, B 4 0 A 5, B 4 1 3 – Recursive calling structure implicitly defines unreduced BDD – Apply reduction rules bottom-up as return from recursive calls June 11, 2006 Using Datalog and BDDs for Program Analysis 49
BDD implementation • ‘Unique’ table – Huge hash table – Each entry: level, left, right, hash, next • Operation cache – Memoization cache for operations • Garbage collection – Mark and sweep, free list. June 11, 2006 Using Datalog and BDDs for Program Analysis 50
Code for BDD ‘and’. Base case: Memo cache lookup: Recursive step: Memo cache insert: June 11, 2006 Using Datalog and BDDs for Program Analysis 51
BDD Libraries • Bu. DDy – Simple, fast, memory-friendly – Identifies BDD by index in unique table • Java. BDD – 100% Java, based on Bu. DDy – Also native interface to Bu. DDY, CUDD, CAL, JDD • CUDD – Most popular, most feature-complete – Not as fast as Bu. DDy – Other types: ZDD, ADD • JDD – 100% Java, fresh implementation – Still under development June 11, 2006 Using Datalog and BDDs for Program Analysis 52
Depth-first vs. breadth-first • BDD algorithms have natural depth-first recursive formulations. • Some work on using breadth-first evaluation for better parallelism and locality – CAL: breadth-first BDD package • General idea: Assume independent, fixup if not. • Doesn’t perform well in practice. June 11, 2006 Using Datalog and BDDs for Program Analysis 53
Take a break… (Next up: Using the Tools) June 11, 2006 Using Datalog and BDDs for Program Analysis 54
Tutorial Structure Part I: Essential Background – Datalog for Program Analysis – Binary Decision Diagrams Part II: Using the Tools – – bddbddb Compiler interface (Joeq compiler) Datalog editor in Eclipse Interactive mode Part III: Developing Advanced Analyses – – Context sensitivity Combining multiple analyses Race detection examples Using advanced bddbddb features Part IV: Profiling, Debugging, Avoiding Gotchas – – Variable ordering Iteration order Machine learning What it’s good for, what it isn’t good for June 11, 2006 Using Datalog and BDDs for Program Analysis 55
Part II: Using the Tools bddbddb (BDD-based deductive database) June 11, 2006 Using Datalog and BDDs for Program Analysis 56
bddbddb System Overview Java bytecode Joeq frontend Input relations Datalog program June 11, 2006 Output relations Using Datalog and BDDs for Program Analysis 57
Compiler Frontend • Convert IR into tuples • Tuples format: # V 0: 16 F 0: 11 V 1: 16 001 012 1470 0 1464 June 11, 2006 Using Datalog and BDDs for Program Analysis header line one tuple per line 58
Compiler Frontend • Robust frontends: – Joeq compiler – Soot compiler – SUIF compiler (for C code) • Still experimental: – Eclipse frontend – gcc frontend –… June 11, 2006 Using Datalog and BDDs for Program Analysis 59
Extracting Relations • Idea: Iterate thru compiler IR, numbering and dumping relations of interest. – Types – Methods – Fields – Variables –… June 11, 2006 Using Datalog and BDDs for Program Analysis 60
joeq. Main. Gen. Relations • Generate initial relations for points-to analysis. – Does initial pass to discover call graph. • Options: -fly: dump on-the-fly call graph info -cs: dump context-sensitive info -ssa : dump SSA representation -partial : no call graph discovery -Dpa. dumppath= : where to save files -Dpa. icallgraph= : location of initial call graph -Dpa. dumpdotgraph : dump call graph in dot June 11, 2006 Using Datalog and BDDs for Program Analysis 61
Demo of joeq Gen. Relations June 11, 2006 Using Datalog and BDDs for Program Analysis 62
Part II: Using the Tools bddbddb: From Datalog to BDDs June 11, 2006 Using Datalog and BDDs for Program Analysis 63
An Adventure in BDDs • Context-sensitive numbering scheme – Modify BDD library to add special operations. – Can’t even analyze small programs. Time: • Improved variable ordering – Group similar BDD variables together. – Interleave equivalence relations. – Move common subsets to edges of variable order. Time: 40 h • Incrementalize outermost loop Time: 36 h – Very tricky, many bugs. • Factor away control flow, assignments – Reduces number of variables. June 11, 2006 Using Datalog and BDDs for Program Analysis Time: 32 h 64
An Adventure in BDDs • Exhaustive search for best BDD order – Limit search space by not considering intradomain orderings. Time: 10 h • Eliminate expensive rename operations – When rename changes relative order, result is not isomorphic. Time: 7 h • Improved BDD memory layout – Preallocate to guarantee contiguous. Time: 6 h • BDD operation cache tuning – Too small: redo work, too big: bad locality – Parameter sweep to find best values. Time: 2 h June 11, 2006 Using Datalog and BDDs for Program Analysis 65
An Adventure in BDDs • Simplified treatment of exceptions – Reduce number of variables, iterations necessary for convergence. Time: 1 h • Change iteration order – Required redoing much of the code. Time: 48 m • Eliminate redundant operations – Introduced subtle bugs. Time: 45 m • Specialized caches for different operations – Different caches for and, or, etc. June 11, 2006 Using Datalog and BDDs for Program Analysis Time: 41 m 66
An Adventure in BDDs • Compacted BDD nodes – 20 bytes 16 bytes Time: 38 m • Improved BDD hashing function – Simpler hash function. Time: 37 m • Total development time: 1 year – 1 year per analysis? !? • Optimizations obscured the algorithm. • Many bugs discovered, maybe still more. June 11, 2006 Using Datalog and BDDs for Program Analysis 67
bddbddb: BDD-Based Deductive Data. Base • Automatically generate from Datalog – Optimizations based on my experience with handcoded version. – Plus traditional compiler algorithms. • bddbddb even better than handcoded – handcoded: 37 m June 11, 2006 bddbddb: 19 m Using Datalog and BDDs for Program Analysis 68
Datalog BDDs Datalog BDDs Relations Boolean functions Relation ops: ⋈, ∪, select, project Boolean function ops: ∧, ∨, −, ∼ Relation at a time Function at a time Semi-naïve evaluation Incrementalization Fixed-point Iterate until stable June 11, 2006 Using Datalog and BDDs for Program Analysis 69
Compiling Datalog to BDDs 1. 2. 3. 4. Apply Datalog source level transforms. Stratify and determine iteration order. Translate into relational algebra IR. Optimize IR and replace relational algebra ops with equivalent BDD ops. 5. Assign relation attributes to physical BDD domains. 6. Perform more optimizations after domain assignment. 7. Interpret the resulting program. June 11, 2006 Using Datalog and BDDs for Program Analysis 70
High-Level Transform: Magic Set Transformation • Add “magic” predicates to control generated tuples [Bancilhon 1986, Beeri 1987] – Combines ideas from top-down and bottom-up evaluation • Doesn’t always help – Leads to more iterations – BDDs are good at large operations June 11, 2006 Using Datalog and BDDs for Program Analysis 71
Predicate Dependency Graph v. Points. To 0 Assign Load v. Points. To Store h. Points. To(o v. Points. To(v 1, f, 2, oo 2)2) v. Points. To(v, o) 1, o) June 11, 2006 add edge from RHS to LHS : : - Store(v Load(v 11, , f, f, vv 22), ), : , 0 v(v, ), , o). : - Assign(v v. Points. To 1 2 v. Points. To(v 11, o o 11), ), v. Points. To(v , , o). 2 v. Points. To(v h. Points. To(o 21, of, 2). o 2). Using Datalog and BDDs for Program Analysis 72
Determining Iteration Order • Tradeoff between faster convergence and BDD cache locality • Static heuristic – Visit rules in reverse post-order – Iterate shorter loops before longer loops • Profile-directed feedback • User can control iteration order – pri=# keywords on rules/relations June 11, 2006 Using Datalog and BDDs for Program Analysis 73
Predicate Dependency Graph v. Points. To 0 Assign Load v. Points. To Store h. Points. To June 11, 2006 Using Datalog and BDDs for Program Analysis 74
Datalog to Relational Algebra v. Points. To(v 1, o) : - Assign(v 1, v 2), v. Points. To(v 2, o). t 1 = ρvariable→source(v. Points. To); t 2 = assign ⋈ t 1; t 3 = πsource(t 2); t 4 = ρdest→variable(t 3); v. Points. To = v. Points. To ∪ t 4; June 11, 2006 Using Datalog and BDDs for Program Analysis 75
Incrementalization t 1 = ρvariable→source(v. P); t 2 = assign ⋈ t 1; t 3 = πsource(t 2); t 4 = ρdest→variable(t 3); v. P = v. P ∪ t 4; June 11, 2006 v. P’’ = v. P – v. P’; v. P’ = v. P; assign’’ = assign – assign’; assign’ = assign; t 1 = ρvariable→source(v. P’’); t 2 = assign ⋈ t 1; t 5 = ρvariable→source(v. P); t 6 = assign’’ ⋈ t 5; t 7 = t 2 ∪ t 6; t 3 = πsource(t 7); t 4 = ρdest→variable(t 3); v. P = v. P ∪ t 4; Using Datalog and BDDs for Program Analysis 76
Optimize into BDD operations v. P’’ = v. P – v. P’; v. P’ = v. P; assign’’ = assign – assign’; assign’ = assign; t 1 = ρvariable→source(v. P’’); t 2 = assign ⋈ t 1; t 5 = ρvariable→source(v. P); t 6 = assign’’ ⋈ t 5; t 7 = t 2 ∪ t 6; t 3 = πsource(t 7); t 4 = ρdest→variable(t 3); v. P = v. P ∪ t 4; June 11, 2006 v. P’’ = diff(v. P, v. P’); v. P’ = copy(v. P); t 1 = replace(v. P’’, variable→source); t 3 = relprod(t 1, assign, source); t 4 = replace(t 3, dest→variable); v. P = or(v. P, t 4); Using Datalog and BDDs for Program Analysis 77
Physical domain assignment v. P’’ = diff(v. P, v. P’); v. P’ = copy(v. P); t 1 = replace(v. P’’, variable→source); t 3 = relprod(t 1, assign, source); t 4 = replace(t 3, dest→variable); v. P = or(v. P, t 4); v. P’’ = diff(v. P, v. P’); v. P’ = copy(v. P); t 3 = relprod(v. P’’, assign, V 0); t 4 = replace(t 3, V 1→V 0); v. P = or(v. P, t 4); • Minimizing renames is NP-complete • Renames have vastly different costs • Priority-based assignment algorithm June 11, 2006 Using Datalog and BDDs for Program Analysis 78
Other optimizations • • Dead code elimination Constant propagation Definition-use chaining Redundancy elimination Global value numbering Copy propagation Liveness analysis June 11, 2006 Using Datalog and BDDs for Program Analysis 79
Splitting rules R(a, e) : - A(a, b), B(b, c), C(c, d), R(d, e). Can be split into: T 1(a, c) : - A(a, b), B(b, c). T 2(a, d) : - T 1(a, c), C(c, d). R(a, e) : - T 2(a, d), R(d, e). Affects incrementalization, iteration. Use “split” keyword to auto-split rules. June 11, 2006 Using Datalog and BDDs for Program Analysis 80
Other Tools • Banshee (John Kodumal) – Results are harder to use (not relational) • Paddle/Jedd (Ondrej Lhotak) – Imperative style: more expressive – Not as efficient, doesn’t scale as well June 11, 2006 Using Datalog and BDDs for Program Analysis 81
Jedd code • Jedd code is like bddbddb internal IR before domain assignment: v. P’’ = diff(v. P, v. P’); v. P’ = copy(v. P); t 1 = replace(v. P’’, variable→source); t 3 = relprod(t 1, assign, source); t 4 = replace(t 3, dest→variable); v. P = or(v. P, t 4); June 11, 2006 Using Datalog and BDDs for Program Analysis 82
Demo of using bddbddb June 11, 2006 Using Datalog and BDDs for Program Analysis 83
Tutorial Structure Part I: Essential Background – Datalog for Program Analysis – Binary Decision Diagrams Part II: Using the Tools – – bddbddb Compiler interface (Joeq compiler) Datalog editor in Eclipse Interactive mode Part III: Developing Advanced Analyses – – Context sensitivity Combining multiple analyses Race detection examples Using advanced bddbddb features Part IV: Profiling, Debugging, Avoiding Gotchas – – Variable ordering Iteration order Machine learning What it’s good for, what it isn’t good for June 11, 2006 Using Datalog and BDDs for Program Analysis 84
Part III: Developing Advanced Analyses Context Sensitivity June 11, 2006 Using Datalog and BDDs for Program Analysis 85
Old Technique: Summary-Based Analysis • Idea: Summarize the effect of a method on its callers. – Sharir, Pnueli [Muchnick 1981] – Landi, Ryder [PLDI 1992] – Wilson, Lam [PLDI 1995] – Whaley, Rinard [OOPSLA 1999] June 11, 2006 Using Datalog and BDDs for Program Analysis 86
Old Technique: Summary-Based Analysis • Problems: – Difficult to summarize pointer analysis. – Composed summaries can get large. – Recursion is difficult: Must find fixpoint. – Queries (e. g. which context points to x) require expanding an exponential number of contexts. June 11, 2006 Using Datalog and BDDs for Program Analysis 87
My Technique: Cloning-Based Analysis • Simple brute force technique. – Clone every path through the call graph. – Run context-insensitive algorithm on expanded call graph. • The catch: exponential blowup June 11, 2006 Using Datalog and BDDs for Program Analysis 88
Cloning is exponential! June 11, 2006 Using Datalog and BDDs for Program Analysis 89
Recursion • Actually, cloning is unbounded in the presence of recursive cycles. • Technique: We treat all methods within a strongly-connected component as a single node. June 11, 2006 Using Datalog and BDDs for Program Analysis 90
Recursion A A B C E F G June 11, 2006 D E F B C D E F E G Using Datalog and BDDs for Program Analysis G F G 91
Top 20 Sourceforge Java Apps 1016 1012 108 104 100 June 11, 2006 Using Datalog and BDDs for Program Analysis 92
Cloning is infeasible (? ) • Typical large program has ~1014 paths. • If you need 1 byte to represent a clone: – Would require 256 terabytes of storage – >12 times size of Library of Congress – Registered ECC 1 GB DIMMs: $41. 7 million • Power: 96. 4 kilowatts = Power for 128 homes – 500 GB hard disks: 564 x $195 = $109, 980 • Time to read sequential: 70. 8 days June 11, 2006 Using Datalog and BDDs for Program Analysis 93
Key Insight • There are many similarities across contexts. – Many copies of nearly-identical results. • BDDs can represent large sets of redundant data efficiently. – Need a BDD encoding that exploits the similarities. June 11, 2006 Using Datalog and BDDs for Program Analysis 94
Expanded Call Graph A B C A 0 D E F G H June 11, 2006 F 0 F 1 B 0 C 0 D 0 E 1 E 2 F 2 G 0 G 1 G 2 H 0 H 1 H 2 H 3 H 4 H 5 Using Datalog and BDDs for Program Analysis 95
Numbering Clones 0 A 0 0 B 0 C 1 E 0 -2 0 D 2 0 -2 F G 0 -2 June 11, 2006 H 3 -5 F 0 F 1 B 0 C 0 D 0 E 1 E 2 F 2 G 0 G 1 G 2 H 0 H 1 H 2 H 3 H 4 H 5 Using Datalog and BDDs for Program Analysis 96
Context-sensitive Pointer Analysis Algorithm 1. First, do context-insensitive pointer analysis to get call graph. 2. Number clones. 3. Do context-insensitive algorithm on the cloned graph. • Results explicitly generated for every clone. • Individual results retrievable with Datalog query. June 11, 2006 Using Datalog and BDDs for Program Analysis 97
Counting rule • IEnum(i, m, vc 2, vc 1) : - roots(m), m. I 0(m, i), IE 0(i, m). number • Special rule to define numbering. • Head: result of numbering – First two variables: edge you want to number – Second two variables: context numbering • Subgoals: graph edges – Single variable: roots of graph June 11, 2006 Using Datalog and BDDs for Program Analysis 98
Demo of context-sensitive June 11, 2006 Using Datalog and BDDs for Program Analysis 99
Part III: Developing Advanced Analyses Example: Race Detection June 11, 2006 Using Datalog and BDDs for Program Analysis 100
Object Sensitivity • k-object-sensitivity (Milanova, Ryder, Rountev 2003) • k=3 suffices in our experiments • CHA/context-insensitive/k-CFA too imprecise static main() { h 1: C a = new A(); h 2: C b = new B(); p 1: foo(a); p 2: foo(b); p 3: foo(a); } static foo(C c) { p 4: c. bar(); } June 11, 2006 Contexts of method bar(): 1 -CFA: { p 4 } 2 -CFA: { p 1: p 4, p 2: p 4, p 3: p 4 } 1 -objsens: { h 1, h 2 } 2 -objsens: { h 1, h 2 } Using Datalog and BDDs for Program Analysis 101
Open Programs • Analyzing open programs is important – Many “programs” are libraries – Developers need to understand behavior w/o a client • Standard approach – Write a “harness” manually – A client exercising the interface of the open program • Our approach – Generate the harness automatically June 11, 2006 Using Datalog and BDDs for Program Analysis 102
Race Detection • A multi-threaded program contains a race if: – Two threads can access a memory location – At least one access is a write – No ordering between the accesses • As a rule, races are bad – And common … – And hard to find … June 11, 2006 Using Datalog and BDDs for Program Analysis 103
Running Example public A() { f = 0; } public int get() { return rd(); } public sync int inc() { int t = rd() + (new A()). wr(1); return wr(t); } private int rd() { return f; } private int wr(int x) { f = x; return x; } June 11, 2006 static public void main() { A a; a = new A(); a. get(); a. inc(); } Harness (Note: Single-threaded) Using Datalog and BDDs for Program Analysis 104
Example: Two Object-Sensitive Contexts static public void main() { A a; a = new A(); a. get(); a. inc(); } public A() { f = 0; } public int get() { return rd(); } public sync int inc() { int t = rd() + (new A()). wr(1); return wr(t); } private int rd() { return f; } private int wr(int x) { f = x; return x; } June 11, 2006 Using Datalog and BDDs for Program Analysis 105
Example: 1 st Context static public void main() { A a; a = new A(); a. get(); a. inc(); } public A() { f = 0; } public int get() { return rd(); } public sync int inc() { int t = rd() + (new A()). wr(1); return wr(t); } private int rd() { return f; } private int wr(int x) { f = x; return x; } June 11, 2006 Using Datalog and BDDs for Program Analysis 106
Example: 2 nd Context static public void main() { A a; a = new A(); a. get(); a. inc(); } public A() { f = 0; } public int get() { return rd(); } public sync int inc() { int t = rd() + (new A()). wr(1); return wr(t); } private int rd() { return f; } private int wr(int x) { f = x; return x; } June 11, 2006 Using Datalog and BDDs for Program Analysis 107
Computing Original Pairs All pairs of accesses such that – Both references of one of the following forms: • e 1. f and e 2. f (the same instance field) • C. g and C. g (the same static field) • e 1[e 3] and e 2[e 4] (any array elements) – At least one is a write June 11, 2006 Using Datalog and BDDs for Program Analysis 108
Example: Original Pairs public A() { f = 0; } public int get() { return rd(); } public sync int inc() { int t = rd() + (new A()). wr(1); return wr(t); } private int rd() { return f; } private int wr(int x) { f = x; return x; } June 11, 2006 Using Datalog and BDDs for Program Analysis 109
Computing Reachable Pairs • Step 1 – Access pairs with at least one write to same field • Step 2 – Consider any access pair (e 1, e 2) – To be a race e 1 must be: – Reachable from a thread-spawning call site s 1 • Without “switching” threads – Where s 1 is reachable from main – (and similarly for e 2) June 11, 2006 Using Datalog and BDDs for Program Analysis 110
Example: Reachable Pairs static public void main() { A a; a = new A(); a. get(); a. inc(); } public A() { f = 0; } public int get() { return rd(); } public sync int inc() { int t = rd() + (new A()). wr(1); return wr(t); } private int rd() { return f; } private int wr(int x) { f = x; return x; } June 11, 2006 Using Datalog and BDDs for Program Analysis 111
Computing Aliasing Pairs • Steps 1 -2 – Access pairs with at least one write to same field – And both are executed in a thread in some context • Step 3 – To have a race, both must access the same memory location – Use alias analysis June 11, 2006 Using Datalog and BDDs for Program Analysis 112
Example: Aliasing Pairs static public void main() { A a; a = new A(); a. get(); a. inc(); } public A() { f = 0; } public int get() { return rd(); } public sync int inc() { int t = rd() + (new A()). wr(1); return wr(t); } private int rd() { return f; } private int wr(int x) { f = x; return x; } June 11, 2006 Using Datalog and BDDs for Program Analysis 113
Computing Escaping Pairs • Steps 1 -3 – Access pairs with at least one write to same field – And both are executed in a thread in some context – And both can access the same memory location • Step 4 – To have a race, the memory location must also be thread-shared – Use thread-escape analysis June 11, 2006 Using Datalog and BDDs for Program Analysis 114
Example: Escaping Pairs static public void main() { A a; a = new A(); a. get(); a. inc(); } public A() { f = 0; } public int get() { return rd(); } public sync int inc() { int t = rd() + (new A()). wr(1); return wr(t); } private int rd() { return f; } private int wr(int x) { f = x; return x; } June 11, 2006 Using Datalog and BDDs for Program Analysis 115
Computing Unlocked Pairs • Steps 1 -4 – Access pairs with at least one write to same field – And both are executed in a thread in some context – And both can access the same memory location – And the memory location is thread-shared • Step 5 – Discard pairs where the memory location is guarded by a common lock in both accesses June 11, 2006 Using Datalog and BDDs for Program Analysis 116
Example: Unlocked Pairs static public void main() { A a; a = new A(); a. get(); a. inc(); } public A() { f = 0; } public int get() { return rd(); } public sync int inc() { int t = rd() + (new A()). wr(1); return wr(t); } private int rd() { return f; } private int wr(int x) { f = x; return x; } June 11, 2006 Using Datalog and BDDs for Program Analysis 117
Counterexamples • Each pair of paths in the context-sensitive call graph from a pair of roots to a pair of accesses along which a common lock may not be held • Different from most other systems – Pairs of paths (instead of single interleaved path) – At call-graph level June 11, 2006 Using Datalog and BDDs for Program Analysis 118
Example: Counterexample // file Harness. java static public void main() { A a; a = new A(); 4: a. get(); 5: a. inc(); } field reference A. f (A. java: 10) [Rd] A. get(A. java: 4) Harness. main(Harness. java: 4) field reference A. f (A. java: 12) [Wr] A. inc(A. java: 7) Harness. main(Harness. java: 5) June 11, 2006 // file A. java public A() { f = 0; } public int get() { 4: return rd(); } public sync int inc() { int t= rd() + (new A()). wr(1); 7: return wr(t); } private int rd() { 10: return f; } private int wr(int x) { 12: f = x; return x; } Using Datalog and BDDs for Program Analysis 119
Race Checker Datalog June 11, 2006 Using Datalog and BDDs for Program Analysis 120
Map Sensitivity. . . String username = request. get. Parameter(“user”) map. put(“USER_NAME”, username); . . . “USER_NAME” ≠ “SEARCH_QUERY” String query = (String) map. get(“SEARCH_QUERY”); stmt. execute. Query(query); . . . • Maps with constant string keys are common • Augment pointer analysis: – Model Map. put/get operations specially June 11, 2006 Using Datalog and BDDs for Program Analysis 121
Resolving Reflection • Reflection is a dynamic language feature • Used to query object and class information – static Class. for. Name(String class. Name) • Obtain a java. lang. Class object • I. e. Class. for. Name(“java. lang. String”) gets an object corresponding to class String – Object Class. new. Instance() • Object constructor in disguise • Create a new object of a given class Class c = Class. for. Name(“java. lang. String”); Object o = c. new. Instance(); • This makes a new empty string o June 11, 2006 Using Datalog and BDDs for Program Analysis 122
What to Do About Reflection? 1. 2. 3. 4. String class. Name =. . . ; Class c = Class. for. Name(class. Name); Object o = c. new. Instance(); T t = (T) o; 1. Anything goes 2. Ask the user 3. Subtypes of T 4. Analyze + + - + - - Obviously conservative Call graph extremely big and imprecise June 11, 2006 Good results A lot of work for user, difficult to find answers More precise T may have many subtypes Using Datalog and BDDs for Program Analysis class. Name Better still Need to know where class. Name comes from 123
Analyzing Class Names • Looking at class. Name seems promising String string. Class = “java. lang. String”; foo(string. Class); . . . void foo(String clazz){ bar(clazz); } void bar(String class. Name){ Class c = Class. for. Name(class. Name); } • This is interprocedural const+copy prop on strings June 11, 2006 Using Datalog and BDDs for Program Analysis 124
Pointer Analysis Can Help Stack variables Heap objects string. Class clazz class. Name java. lang. String June 11, 2006 Using Datalog and BDDs for Program Analysis 125
Reflection Resolution Using Points -to 1. 2. 3. 4. String class. Name =. . . ; Class c = Class. for. Name(class. Name); Object o = c. new. Instance(); T t = (T) o; • Need to know what class. Name is – Could be a local string constant like java. lang. String – But could be a variable passed through many layers of calls • Points-to analysis says what class. Name refers to – class. Name --> concrete heap object June 11, 2006 Using Datalog and BDDs for Program Analysis 126
Reflection Resolution Constants Specification points 1. 2. 3. 4. String class. Name =. . . ; Class c = Class. for. Name(class. Name); Object o = c. new. Instance(); T t = (T) o; 1. String class. Name =. . . ; 2. Class c = Class. for. Name(class. Name); Object o = new T 1(); Object o = new T 2(); Object o = new T 3(); June 11, 2006 Using Datalog and BDDs 4. T t = (T) o; for Program Analysis Q: what object does this create? 127
Resolution May Fail! 1. 2. 3. 4. String class. Name = r. read. Line(); Class c = Class. for. Name(class. Name); Object o = c. new. Instance(); T t = (T) o; • Need help figuring out what class. Name is • Two options 1. Can ask user for help • Call to r. read. Line on line 1 is a specification point • • User needs to specify what can be read from a file Analysis helps the user by listing all specification points 2. Can use cast information • Constrain possible types instantiated on line 3 to subclasses of T • Need additional assumptions June 11, 2006 Using Datalog and BDDs for Program Analysis 128
1. Specification Files n Format: invocation site => class load. Impl() @ 43 Inet. Address. java: 1231 => java. net. Inet 4 Address. Impl load. Impl() @ 43 Inet. Address. java: 1231 => java. net. Inet 6 Address. Impl lookup() @ 86 Abstract. Charset. Provider. java: 126 => sun. nio. cs. ISO_8859_15 lookup() @ 86 Abstract. Charset. Provider. java: 126 => sun. nio. cs. MS 1251 try. To. Load. Class() @ 29 Data. Flavor. java: 64 => java. io. Input. Stream June 11, 2006 Using Datalog and BDDs for Program Analysis 129
2. Using Cast Information 1. 2. 3. 4. • • Providing specification files is tedious, timeconsuming, error-prone Leverage cast data instead – o instanceof T – Can constrain type of o if 1. 2. June 11, 2006 String class. Name =. . . ; Class c = Class. for. Name(class. Name); Object o = c. new. Instance(); T t = (T) o; Cast succeeds We know all subclasses of T Using Datalog and BDDs for Program Analysis 130
Analysis Assumptions 1. Assumption: Correct casts. Type cast operations that always operate on the result of a call to Class. new. Instance are correct; they will always succeed without throwing a Class. Cast. Exception. 2. Assumption: Closed world. We assume that only classes reachable from the class path at analysis time can be used by the application at runtime. June 11, 2006 Using Datalog and BDDs for Program Analysis 131
Casts Aren’t Always Present • Can’t do anything if no cast post-dominating a Class. new. Instance call Object factory(String class. Name){ Class c = Class. for. Name(class. Name); return c. new. Instance(); }. . . Sun. Encoder t = (Sun. Encoder) factory(“sun. io. encoder. ” + enc); Something. Else e = (Something. Else) factory(“Something. Else“); June 11, 2006 Using Datalog and BDDs for Program Analysis 132
Call Graph Discovery Process Program IR Call graph construction Reflection resolution using points-to User-provided spec June 11, 2006 Resolved calls Final call graph Cast-based approximation Specification points Using Datalog and BDDs for Program Analysis 133
Implementation Details • Call graph construction algorithm in the presence of reflection is integrated with pointer analysis – Pointer analysis already has to deal with virtual calls: new methods are discovered, points-to relations for them are created – Reflection analysis is another level of complexity • See Datalog specification June 11, 2006 Using Datalog and BDDs for Program Analysis 134
Reflection Resolution Results • Applied to 6 large Java apps, 190, 000 LOC combined Call graph sizes compared June 11, 2006 Using Datalog and BDDs for Program Analysis 135
Map relations • Need to map from values in one domain to another? • Use special operator “=>” • map. Ato. B(a, b) : - a => b. • Elements in A are appended to domain of B – A must have map file. – B must have enough space. June 11, 2006 Using Datalog and BDDs for Program Analysis 136
Using Code Fragments • Execute a code fragment before/after every rule invocation. A(x, y) : - B(y, a), C(a, z). { code goes here } • Can access: – Relations by name. – Rule information. – Solver information. • Can also add code fragment to relations (triggered on change). • Special keywords: “modifies”, “pre”, “post” June 11, 2006 Using Datalog and BDDs for Program Analysis 137
Take a break… (Next up: Profiling, Debugging) June 11, 2006 Using Datalog and BDDs for Program Analysis 138
Tutorial Structure Part I: Essential Background – Datalog for Program Analysis – Binary Decision Diagrams Part II: Using the Tools – – bddbddb Compiler interface (Joeq compiler) Datalog editor in Eclipse Interactive mode Part III: Developing Advanced Analyses – – Context sensitivity Combining multiple analyses Race detection examples Using advanced bddbddb features Part IV: Profiling, Debugging, Avoiding Gotchas – – Variable ordering Iteration order Machine learning What it’s good for, what it isn’t good for June 11, 2006 Using Datalog and BDDs for Program Analysis 139
Part IV: Profiling, Debugging, Avoiding Gotchas Variable Ordering June 11, 2006 Using Datalog and BDDs for Program Analysis 140
Try. Domain. Orders • Try all possible domain orders for a given operation and inputs. – Bounded: if an order takes longer than current best, abort it. • To profile slow-running operations: Run with -Ddumpslow, -Ddumpcutoff=5000 java net. sf. bddbddb. Try. Domain. Orders • If you know ordering constraints, you can add them to rules/relations – Constraints automatically propagated to other rules/relations June 11, 2006 Using Datalog and BDDs for Program Analysis 141
Variable Numbering: Active Machine Learning • • Must be determined dynamically Limit trials with properties of relations Each trial may take a long time Active learning: select trials based on uncertainty – Can build up trial database to improve accuracy • Several hours • Comparable to exhaustive for small apps June 11, 2006 Using Datalog and BDDs for Program Analysis 142
Using Machine Learning • -Dfindbestorder – Enable machine learning • -Dfbocutoff=# – Minimum runtime (in ms) for an operation to be considered • -Dfbotrials=# – Maximum number of trials to run • -Dtrialfile= – Filename to load/store trial information. June 11, 2006 Using Datalog and BDDs for Program Analysis 143
Changing Iteration Order • bddbddb uses simple iteration order heuristic – not always optimal • If a rule is iterating too many times: – Lower its priority with pri=5 – Increase other rules with pri=-5 – Can also adjust priority of relations • Solver prints iteration order on startup • Also try reformulating the problem or changing input relations June 11, 2006 Using Datalog and BDDs for Program Analysis 144
Reformulate the Problem • Change rule form: A(a, c) : - A(a, b), A(b, c). vs A(a, c) : - A(a, b), A(b, c). • Change input relations – Short-circuit paths • Filter relations as you go June 11, 2006 Using Datalog and BDDs for Program Analysis 145
Debugging • Debugging can be tricky – Relations are huge – Declarative: not so straightforward • Adding code fragments can help. • Try it on a small example with full trace information. • Best: Interactive solver June 11, 2006 Using Datalog and BDDs for Program Analysis 146
“Comes from” query • Special kind of query: A(3, 5) : - ? “What contributed to (3, 5) being added to A? ” • Add ‘single’ keyword to get only one path. • Doesn’t solve the negated problem (missing tuples) June 11, 2006 Using Datalog and BDDs for Program Analysis 147
Solver options -Dnoisy -Dtracesolve -Dfulltracesolve -Dbddvarorder= -Dbddnodes= -Dbddcache= -Dbddminfree= -Dfindbestorder June 11, 2006 -Dbasedir= -Dincludedirs= -Ddumprulegraph -Duseir -Dprintir -Dsplit_all_rules -Dsplit_no_rules Using Datalog and BDDs for Program Analysis 148
Datalog directives • • . include. split_all_rules. report_stats. noisy. strict. singleignore. trace June 11, 2006 • • . bddvarorder. bddnodes. bddcache. bddminfree. findbestorder. incremental. dot. basedir Using Datalog and BDDs for Program Analysis 149
Relation options input / inputtuples output / outputtuples printsize pri=# { code fragment } x<y June 11, 2006 Rule options split number single cacheafterrename findbestorder trace / tracefull pre / post { code } modifies R Using Datalog and BDDs for Program Analysis 150
Experimental Features • • Distributed computation: (dbddbddb? ) Profile-directed feedback of iteration order Eclipse integration Touchgraph integration Debugging interface Tracing information Include rules in come-from query June 11, 2006 Using Datalog and BDDs for Program Analysis 151
What works well • Big sets of mostly redundant data – Pointer analysis – Context-sensitive analysis • Short propagation paths – Each iteration takes quite a bit of time, so >1000 iterations will hurt – Try to preprocess/reformulate problem to shorten paths • Natural ‘flow’ problems • Pure analysis problems (no transformations) June 11, 2006 Using Datalog and BDDs for Program Analysis 152
What doesn’t work well • Long propagation paths – Traditional dataflow analysis (use sparse form instead) • Huge problems with little redundancy – Too much context sensitivity • Domains that are not easily countable – Need to manufacture names on the fly • Problems that have inherently complicated formulations • Problems optimized for particular data structures (union-find, etc. ) June 11, 2006 Using Datalog and BDDs for Program Analysis 153
Using bddbddb in a class • bddbddb has been very useful in Stanford advanced compiler course – Comparing/contrasting analyses becomes easier – Students can implement and evaluate multiple techniques without much overhead • Projects: – Implement an algorithm from a paper in Datalog, make a small change and evaluate its effectiveness – Experiment with different kinds of context sensitivity on a given problem – Improve on BDD solver efficiency – Build a tool based on analysis results June 11, 2006 Using Datalog and BDDs for Program Analysis 154
Questions? June 11, 2006 Using Datalog and BDDs for Program Analysis 155
That’s all, folks! Thanks for sticking around for all 156 slides! June 11, 2006 Using Datalog and BDDs for Program Analysis 156
Experimental Results June 11, 2006 Using Datalog and BDDs for Program Analysis 157
Experimental Results June 11, 2006 Using Datalog and BDDs for Program Analysis 158
Experimental Results June 11, 2006 Using Datalog and BDDs for Program Analysis 159
Experimental Results June 11, 2006 Using Datalog and BDDs for Program Analysis 160
Experimental Results June 11, 2006 Using Datalog and BDDs for Program Analysis 161
Experimental Results June 11, 2006 Using Datalog and BDDs for Program Analysis 162
Experimental Results June 11, 2006 Using Datalog and BDDs for Program Analysis 163
Experimental Results June 11, 2006 Using Datalog and BDDs for Program Analysis 164
Experimental Results June 11, 2006 Using Datalog and BDDs for Program Analysis 165
Experimental Results June 11, 2006 Using Datalog and BDDs for Program Analysis 166
Experimental Results June 11, 2006 Using Datalog and BDDs for Program Analysis 167
Experimental Results June 11, 2006 Using Datalog and BDDs for Program Analysis 168
Related Work • Datalog in Program Analysis – – – Specify as Datalog query [Ullman 1989] Toupie system [Corsini 1993] Demand-driven using magic sets [Reps 1994] Program analysis with logic programming [Dawson 1996] Crocopat system [Beyer 2003] Modular class analysis [Besson 2003] • BDDs in Program Analysis – – Predicate abstraction [Ball 2000] Shape analysis [Manevich 2002, Yavuz-Kahveci 2002] Pointer Analysis [Zhu 2002, Berndl 2003, Zhu 2004] Jedd system [Lhotak 2004] June 11, 2006 Using Datalog and BDDs for Program Analysis 169
Related Work • BDD Variable Ordering – – – • Variable ordering is NP-complete [Bollig 1996] Interleaving [Fujii 1993] Sifting [Rudell 1993] Genetic algorithms [Drechsler 1995] Machine learning for BDD orders [Grumberg 2003] Efficient Evaluation of Datalog – – – – Semi-naïve evaluation [Balbin 1987] Bottom-up evaluation [Ullman 1989, Ceri 1990, Naughton 1991] Top-down evaluation with tabling [Tamaki 1986, Chen 1996] Rule ordering [Ramakrishnan 1990] Magic sets transformation [Bancilhon 1986] Computing with BDDs [Iwaihara 1995] Time and space guarantees [Liu 2003] June 11, 2006 Using Datalog and BDDs for Program Analysis 170
Program Analysis with bddbddb • Context-sensitive Java pointer analysis • C pointer analysis • Escape analysis • Type analysis • External lock analysis • Finding memory leaks • Interprocedural def-use • Interprocedural mod-ref • • • Object-sensitive analysis Cartesian product algorithm Resolving Java reflection Bounds check elimination Finding race conditions Finding Java security vulnerabilities • And many more… Performance better than handcoded! June 11, 2006 Using Datalog and BDDs for Program Analysis 171
Conclusion • bddbddb: new paradigm in program analysis – – – Datalog compiled into optimized BDD operations Efficiently and easily implement context-sensitive analyses Easier to develop correct analyses Easily experiment with new ideas Growing library of program analyses Easily use and build upon work of others • Available as open-source LGPL: http: //bddbddb. sourceforge. net June 11, 2006 Using Datalog and BDDs for Program Analysis 172
My Contribution (2) bddbddb (BDD-based deductive database) – Pointer analysis in 6 lines of Datalog (a database language) • Hard to create & debug efficient BDD-based algorithms (3451 lines, 1 man-year) • Automatic optimizations in bddbddb – Easy to create context-sensitive analyses using pointer analysis results (a few lines) – Created many analyses using bddbddb June 11, 2006 Using Datalog and BDDs for Program Analysis 173
Outline • Pointer Analysis – Problem Overview – Brief History – Pointer Analysis in Datalog • • Context Sensitivity Improving Performance bddbddb: BDD-based deductive database Experimental Results – Analysis Time – Analysis Memory – Analysis Accuracy • Conclusion June 11, 2006 Using Datalog and BDDs for Program Analysis 174
Performance is Tricky! • Context-sensitive numbering scheme – Modify BDD library to add special operations. – Can’t even analyze small programs. Time: • Improved variable ordering – Group similar BDD variables together. – Interleave equivalence relations. – Move common subsets to edges of variable order. Time: 40 h • Incrementalize outermost loop Time: 36 h – Very tricky, many bugs. • Factor away control flow, assignments – Reduces number of variables. June 11, 2006 Using Datalog and BDDs for Program Analysis Time: 32 h 175
Java Security Vulnerabilities Application Name blueblog webgoat blojsom personalblog snipsnap road 2 hiberna pebble roller Total June 11, 2006 Classes 306 349 428 611 653 867 889 989 5356 Reported Errors contextinsensitive 1 81 48 350 >321 15 427 >267 >1508 Using Datalog and BDDs for Program Analysis contextsensitive Actual Errors 1 6 2 2 27 1 1 1 41 1 6 2 2 15 1 1 1 29 176 due to V. Benjamin Livshits
Vulnerabilities Found SQL HTTP Cross-site Path Total injection splitting scripting traversal Header Parameter Cookie Non-Web Total June 11, 2006 0 6 1 2 9 6 5 0 0 11 Using Datalog and BDDs for Program Analysis 4 0 0 0 4 0 2 0 3 5 10 13 1 5 29 177
Summary of Contributions • The first scalable context-sensitive subset-based pointer analysis. – Cloning-based technique using BDDs – Clever context numbering – Experimental results on the effects of context sensitivity • bddbddb: new paradigm in program analysis – – Efficiently and easily implement context-sensitive analyses Datalog compiled into optimized BDD operations Library of program analyses (with many others) Active learning for BDD variable orders (with M. Carbin) • Artifacts: – Joeq compiler and virtual machine – Java. BDD library and Bu. DDy library – bddbddb tool June 11, 2006 Using Datalog and BDDs for Program Analysis 178
Conclusion • The first scalable context-sensitive subset-based pointer analysis. – Accurate: Results for up to 1014 contexts. – Scales to large programs. • bddbddb: a new paradigm in prog analysis – High-level spec Efficient implementation • System is publicly available at: http: //bddbddb. sourceforge. net June 11, 2006 Using Datalog and BDDs for Program Analysis 179
- Slides: 180