Using Datalog with Binary Decision Diagrams for Program
Using Datalog with Binary Decision Diagrams for Program Analysis John Whaley, Dzintars Avots, Michael Carbin, Monica S. Lam Stanford University November 5, 2005 Using Datalog with BDDs November 5, 2005 for Program Analysis
Implementing Program Analysis vs. … 56 pages! November 5, 2005 Using Datalog with BDDs for Program Analysis • 2 x faster • Fewer bugs • Extensible 1
Outline • Introduction • Program Analysis in Datalog – Example of Pointer Analysis • • Binary Decision Diagrams (BDDs) Datalog to Efficient BDDs Experimental Results Conclusion November 5, 2005 Using Datalog with BDDs for Program Analysis 2
Program Analysis in Datalog November 5, 2005 Using Datalog with BDDs for Program Analysis 3
Datalog • Declarative language for deductive databases [Ullman 1989] – Like Prolog, but no function symbols, no predefined evaluation strategy • Semantics of negation – No negation allowed [Ullman 1988] – Stratified Datalog [Chandra 1985] – Well-founded semantics [Van Gelder 1991] • Evaluation strategy – Top-down (goal-directed) [Ullman 1985] – Bottom-up (infer from base facts) [Ullman 1989] • Additional restriction: finite domains November 5, 2005 Using Datalog with BDDs for Program Analysis 4
Flow-Insensitive Pointer Analysis Input Tuples v. Points. To(p, o 1) v. Points. To(q, o 2) Store(p, f, q) Load(p, f, r) o 1: p = new Object(); o 2: q = new Object(); p. f = q; r = p. f; p q November 5, 2005 o 1 o 2 f r Output Relations h. Points. To(o 1, f, o 2) v. Points. To(r, o 2) Using Datalog with BDDs for Program Analysis 5
Inference Rule in Datalog Assignments: v. Points. To(v 1, o) : - Assign(v 1, v 2), v. Points. To(v 2, o). v 1 = v 2; v 2 o v 1 November 5, 2005 Using Datalog with BDDs for Program Analysis 6
Inference Rule in Datalog Stores: h. Points. To(o 1, f, o 2) : - Store(v 1, f, v 2), v. Points. To(v 1, o 1), v. Points. To(v 2, o 2). v 1. f = v 2; v 1 o 1 f v 2 November 5, 2005 o 2 Using Datalog with BDDs for Program Analysis 7
Inference Rule in Datalog Loads: v. Points. To(v 2, o 2) : - Load(v 1, f, v 2), v. Points. To(v 1, o 1), h. Points. To(o 1, f, o 2). v 2 = v 1. f; v 1 o 1 f v 2 November 5, 2005 o 2 Using Datalog with BDDs for Program Analysis 8
The Whole Algorithm v. Points. To(v, o) : - v. Points. To 0(v, o). v. Points. To(v 1, o) : - Assign(v 1, v 2), v. Points. To(v 2, o). h. Points. To(o 1, f, o 2) v. Points. To(v 2, o 2) November 5, 2005 : - Store(v 1, f, v 2), v. Points. To(v 1, o 1), v. Points. To(v 2, o 2). : - Load(v 1, f, v 2), v. Points. To(v 1, o 1), h. Points. To(o 1, f, o 2). Using Datalog with BDDs for Program Analysis 9
Inference Rules Assign(v 1, v 12, ), vv. Points. To(v 2), v. Points. To(v 2, o). v. Points. To(v 1, 1, o)o) : - • Datalog rules directly correspond to inference rules! November 5, 2005 Using Datalog with BDDs for Program Analysis 10
Binary Decision Diagrams November 5, 2005 Using Datalog with BDDs for Program Analysis 11
Call graph relation • Call graph expressed as a relation. – Five edges: • • • Calls(A, B) Calls(A, C) Calls(A, D) Calls(B, D) Calls(C, D) November 5, 2005 A B C D Using Datalog with BDDs for Program Analysis 12
Call graph relation • Relation expressed as a binary function. – A=00, B=01, C=10, D=11 Calls(A, B) Calls(A, C) Calls(A, D) Calls(B, D) Calls(C, D) November 5, 2005 → 00 01 → 00 10 → 00 11 → 01 11 → 10 11 Using Datalog with BDDs for Program Analysis A 00 01 B C 10 D 11 13
from x 1 0 0 0 0 1 1 1 1 x 2 0 0 0 0 1 1 1 1 Call graph relation to x 3 0 0 1 1 November 5, 2005 x 4 0 1 0 1 f 0 1 1 1 0 0 0 1 0 0 • Relation expressed as a binary function. – A=00, B=01, C=10, D=11 A 00 01 B C 10 D 11 Using Datalog with BDDs for Program Analysis 14
Binary Decision Diagrams (Bryant 1986) • Graphical encoding of a truth table. x 1 0 edge 1 edge x 2 x 3 x 4 0 x 4 1 November 5, 2005 1 x 3 x 4 1 0 x 4 0 0 x 3 x 4 1 0 x 4 0 Using Datalog with BDDs for Program Analysis 0 x 4 1 0 x 4 0 0 0 15
Binary Decision Diagrams • Collapse redundant nodes. x 1 0 edge 1 edge x 2 x 3 x 4 0 x 4 1 November 5, 2005 1 x 3 x 4 1 0 x 4 0 0 x 3 x 4 1 0 x 4 0 Using Datalog with BDDs for Program Analysis 0 x 4 1 0 x 4 0 0 0 16
Binary Decision Diagrams • Collapse redundant nodes. x 1 0 edge 1 edge x 2 x 3 x 4 x 4 0 November 5, 2005 x 3 x 4 x 4 1 Using Datalog with BDDs for Program Analysis 17
Binary Decision Diagrams • Collapse redundant nodes. x 1 0 edge 1 edge x 2 x 3 x 4 0 November 5, 2005 x 3 x 4 1 Using Datalog with BDDs for Program Analysis 18
Binary Decision Diagrams • Collapse redundant nodes. x 1 0 edge 1 edge x 2 x 3 x 4 0 November 5, 2005 x 3 x 4 1 Using Datalog with BDDs for Program Analysis 19
Binary Decision Diagrams • Eliminate unnecessary nodes. x 1 0 edge 1 edge x 2 x 3 x 4 0 November 5, 2005 x 3 x 4 1 Using Datalog with BDDs for Program Analysis 20
Binary Decision Diagrams • Eliminate unnecessary nodes. x 1 0 edge 1 edge x 2 x 3 x 4 0 November 5, 2005 1 Using Datalog with BDDs for Program Analysis 21
Binary Decision Diagrams • Size depends on amount of redundancy, NOT size of relation. – Identical subtrees share the same representation. – As set gets very large, more nodes have identical zero and one successors, so the size decreases. November 5, 2005 Using Datalog with BDDs for Program Analysis 22
BDD Variable Order is Important! x 1 x 2 + x 3 x 4 x 1 x 3 x 2 x 4 0 x 2 x 4 1 0 x 1<x 2<x 3<x 4 November 5, 2005 x 3 1 x 1<x 3<x 2<x 4 Using Datalog with BDDs for Program Analysis 23
bddbddb (BDD-based deductive database) November 5, 2005 Using Datalog with BDDs for Program Analysis 24
bddbddb System Overview Java bytecode Joeq frontend Input relations Datalog program November 5, 2005 Output relations Using Datalog with BDDs for Program Analysis 25
Datalog BDDs Datalog BDDs Relations Boolean functions Relation ops: ⋈, ∪, select, project Boolean function ops: ∧, ∨, −, ∼ Relation at a time Function at a time Semi-naïve evaluation Incrementalization Fixed-point Iterate until stable November 5, 2005 Using Datalog with BDDs for Program Analysis 26
Compiling Datalog to BDDs 1. 2. 3. 4. Apply Datalog source level transforms. Stratify and determine iteration order. Translate into relational algebra IR. Optimize IR and replace relational algebra ops with equivalent BDD ops. 5. Assign relation attributes to physical BDD domains. 6. Perform more optimizations after domain assignment. 7. Interpret the resulting program. November 5, 2005 Using Datalog with BDDs for Program Analysis 27
High-Level Transform: Magic Set Transformation • Add “magic” predicates to control generated tuples [Bancilhon 1986, Beeri 1987] – Combines ideas from top-down and bottom-up evaluation • Doesn’t always help – Leads to more iterations – BDDs are good at large operations • Rely on user specification November 5, 2005 Using Datalog with BDDs for Program Analysis 28
Predicate Dependency Graph v. Points. To 0 Assign Load v. Points. To Store h. Points. To(o v. Points. To(v 1, f, 2, oo 2)2) v. Points. To(v, o) 1, o) November 5, 2005 add edge from RHS to LHS : : - Store(v Load(v 11, , f, f, vv 22), ), : , 0 v(v, ), , o). : - Assign(v v. Points. To 1 2 v. Points. To(v 11, o o 11), ), v. Points. To(v , , o). 2 v. Points. To(v h. Points. To(o 21, of, 2). o 2). Using Datalog with BDDs for Program Analysis 29
Determining Iteration Order • Tradeoff between faster convergence and BDD cache locality • Static heuristic – Visit rules in reverse post-order – Iterate shorter loops before longer loops • Profile-directed feedback • User can control iteration order November 5, 2005 Using Datalog with BDDs for Program Analysis 30
Predicate Dependency Graph v. Points. To 0 Assign Load v. Points. To Store h. Points. To November 5, 2005 Using Datalog with BDDs for Program Analysis 31
Datalog to Relational Algebra v. Points. To(v 1, o) : - Assign(v 1, v 2), v. Points. To(v 2, o). t 1 = ρvariable→source(v. Points. To); t 2 = assign ⋈ t 1; t 3 = πsource(t 2); t 4 = ρdest→variable(t 3); v. Points. To = v. Points. To ∪ t 4; November 5, 2005 Using Datalog with BDDs for Program Analysis 32
Incrementalization t 1 = ρvariable→source(v. P); t 2 = assign ⋈ t 1; t 3 = πsource(t 2); t 4 = ρdest→variable(t 3); v. P = v. P ∪ t 4; November 5, 2005 v. P’’ = v. P – v. P’; v. P’ = v. P; assign’’ = assign – assign’; assign’ = assign; t 1 = ρvariable→source(v. P’’); t 2 = assign ⋈ t 1; t 5 = ρvariable→source(v. P); t 6 = assign’’ ⋈ t 5; t 7 = t 2 ∪ t 6; t 3 = πsource(t 7); t 4 = ρdest→variable(t 3); v. P = v. P ∪ t 4; Using Datalog with BDDs for Program Analysis 33
Optimize into BDD operations v. P’’ = v. P – v. P’; v. P’ = v. P; assign’’ = assign – assign’; assign’ = assign; t 1 = ρvariable→source(v. P’’); t 2 = assign ⋈ t 1; t 5 = ρvariable→source(v. P); t 6 = assign’’ ⋈ t 5; t 7 = t 2 ∪ t 6; t 3 = πsource(t 7); t 4 = ρdest→variable(t 3); v. P = v. P ∪ t 4; November 5, 2005 v. P’’ = diff(v. P, v. P’); v. P’ = copy(v. P); t 1 = replace(v. P’’, variable→source); t 3 = relprod(t 1, assign, source); t 4 = replace(t 3, dest→variable); v. P = or(v. P, t 4); Using Datalog with BDDs for Program Analysis 34
Physical domain assignment v. P’’ = diff(v. P, v. P’); v. P’ = copy(v. P); t 1 = replace(v. P’’, variable→source); t 3 = relprod(t 1, assign, source); t 4 = replace(t 3, dest→variable); v. P = or(v. P, t 4); v. P’’ = diff(v. P, v. P’); v. P’ = copy(v. P); t 3 = relprod(v. P’’, assign, V 0); t 4 = replace(t 3, V 1→V 0); v. P = or(v. P, t 4); • Minimizing renames is NP-complete • Renames have vastly different costs • Priority-based assignment algorithm November 5, 2005 Using Datalog with BDDs for Program Analysis 35
Other optimizations • • Dead code elimination Constant propagation Definition-use chaining Redundancy elimination Global value numbering Copy propagation Liveness analysis November 5, 2005 Using Datalog with BDDs for Program Analysis 36
Variable Numbering: Active Machine Learning • • Must be determined dynamically Limit trials with properties of relations Each trial may take a long time Active learning: select trials based on uncertainty • Several hours • Comparable to exhaustive for small apps November 5, 2005 Using Datalog with BDDs for Program Analysis 37
Experimental Results November 5, 2005 Using Datalog with BDDs for Program Analysis 38
Experimental Results November 5, 2005 Using Datalog with BDDs for Program Analysis 39
Experimental Results November 5, 2005 Using Datalog with BDDs for Program Analysis 40
Experimental Results November 5, 2005 Using Datalog with BDDs for Program Analysis 41
Experimental Results November 5, 2005 Using Datalog with BDDs for Program Analysis 42
Experimental Results November 5, 2005 Using Datalog with BDDs for Program Analysis 43
Experimental Results November 5, 2005 Using Datalog with BDDs for Program Analysis 44
Experimental Results November 5, 2005 Using Datalog with BDDs for Program Analysis 45
Experimental Results November 5, 2005 Using Datalog with BDDs for Program Analysis 46
Experimental Results November 5, 2005 Using Datalog with BDDs for Program Analysis 47
Experimental Results November 5, 2005 Using Datalog with BDDs for Program Analysis 48
Experimental Results November 5, 2005 Using Datalog with BDDs for Program Analysis 49
Related Work • Datalog in Program Analysis – – – Specify as Datalog query [Ullman 1989] Toupie system [Corsini 1993] Demand-driven using magic sets [Reps 1994] Program analysis with logic programming [Dawson 1996] Crocopat system [Beyer 2003] Modular class analysis [Besson 2003] • BDDs in Program Analysis – – Predicate abstraction [Ball 2000] Shape analysis [Manevich 2002, Yavuz-Kahveci 2002] Pointer Analysis [Zhu 2002, Berndl 2003, Zhu 2004] Jedd system [Lhotak 2004] November 5, 2005 Using Datalog with BDDs for Program Analysis 50
Related Work • BDD Variable Ordering – – – • Variable ordering is NP-complete [Bollig 1996] Interleaving [Fujii 1993] Sifting [Rudell 1993] Genetic algorithms [Drechsler 1995] Machine learning for BDD orders [Grumberg 2003] Efficient Evaluation of Datalog – – – – Semi-naïve evaluation [Balbin 1987] Bottom-up evaluation [Ullman 1989, Ceri 1990, Naughton 1991] Top-down evaluation with tabling [Tamaki 1986, Chen 1996] Rule ordering [Ramakrishnan 1990] Magic sets transformation [Bancilhon 1986] Computing with BDDs [Iwaihara 1995] Time and space guarantees [Liu 2003] November 5, 2005 Using Datalog with BDDs for Program Analysis 51
Program Analysis with bddbddb • Context-sensitive Java pointer analysis • C pointer analysis • Escape analysis • Type analysis • External lock analysis • Finding memory leaks • Interprocedural def-use • Interprocedural mod-ref • • • Object-sensitive analysis Cartesian product algorithm Resolving Java reflection Bounds check elimination Finding race conditions Finding Java security vulnerabilities • And many more… Performance better than handcoded! November 5, 2005 Using Datalog with BDDs for Program Analysis 52
Conclusion • bddbddb: new paradigm in program analysis – – – Datalog compiled into optimized BDD operations Efficiently and easily implement context-sensitive analyses Easier to develop correct analyses Easily experiment with new ideas Growing library of program analyses Easily use and build upon work of others • Available as open-source LGPL: http: //bddbddb. sourceforge. net November 5, 2005 Using Datalog with BDDs for Program Analysis 53
That’s all, folks! Thanks for sticking around for all 54 slides! November 5, 2005 Using Datalog with BDDs for Program Analysis 54
My Contribution (2) bddbddb (BDD-based deductive database) – Pointer analysis in 6 lines of Datalog (a database language) • Hard to create & debug efficient BDD-based algorithms (3451 lines, 1 man-year) • Automatic optimizations in bddbddb – Easy to create context-sensitive analyses using pointer analysis results (a few lines) – Created many analyses using bddbddb November 5, 2005 Using Datalog with BDDs for Program Analysis 55
Outline • Pointer Analysis – Problem Overview – Brief History – Pointer Analysis in Datalog • • Context Sensitivity Improving Performance bddbddb: BDD-based deductive database Experimental Results – Analysis Time – Analysis Memory – Analysis Accuracy • Conclusion November 5, 2005 Using Datalog with BDDs for Program Analysis 56
Performance is Tricky! • Context-sensitive numbering scheme – Modify BDD library to add special operations. – Can’t even analyze small programs. Time: • Improved variable ordering – Group similar BDD variables together. – Interleave equivalence relations. – Move common subsets to edges of variable order. Time: 40 h • Incrementalize outermost loop Time: 36 h – Very tricky, many bugs. • Factor away control flow, assignments – Reduces number of variables. November 5, 2005 Using Datalog with BDDs for Program Analysis Time: 32 h 57
Performance is Tricky! • Exhaustive search for best BDD order – Limit search space by not considering intradomain orderings. Time: 10 h • Eliminate expensive rename operations – When rename changes relative order, result is not isomorphic. Time: 7 h • Improved BDD memory layout – Preallocate to guarantee contiguous. Time: 6 h • BDD operation cache tuning – Too small: redo work, too big: bad locality – Parameter sweep to find best values. Time: 2 h November 5, 2005 Using Datalog with BDDs for Program Analysis 58
Performance is Tricky! • Simplified treatment of exceptions – Reduce number of variables, iterations necessary for convergence. Time: 1 h • Change iteration order – Required redoing much of the code. Time: 48 m • Eliminate redundant operations – Introduced subtle bugs. Time: 45 m • Specialized caches for different operations – Different caches for and, or, etc. November 5, 2005 Using Datalog with BDDs for Program Analysis Time: 41 m 59
Performance is Tricky! • Compacted BDD nodes – 20 bytes 16 bytes Time: 38 m • Improved BDD hashing function – Simpler hash function. Time: 37 m • Total development time: 1 year – 1 year per analysis? !? • Optimizations obscured the algorithm. • Many bugs discovered, maybe still more. November 5, 2005 Using Datalog with BDDs for Program Analysis 60
bddbddb: BDD-Based Deductive Data. Base • Automatically generate from Datalog – Optimizations based on my experience with handcoded version. – Plus traditional compiler algorithms. • bddbddb even better than handcoded! – handcoded: 37 m November 5, 2005 bddbddb: 19 m Using Datalog with BDDs for Program Analysis 61
Java Security Vulnerabilities Application Name blueblog webgoat blojsom personalblog snipsnap road 2 hiberna pebble roller Total November 5, 2005 Classes 306 349 428 611 653 867 889 989 5356 Reported Errors contextinsensitive 1 81 48 350 >321 15 427 >267 >1508 Using Datalog with BDDs for Program Analysis contextsensitive Actual Errors 1 6 2 2 27 1 1 1 41 1 6 2 2 15 1 1 1 29 62 due to V. Benjamin Livshits
Vulnerabilities Found SQL HTTP Cross-site Path Total injection splitting scripting traversal Header Parameter Cookie Non-Web Total November 5, 2005 0 6 1 2 9 6 5 0 0 11 Using Datalog with BDDs for Program Analysis 4 0 0 0 4 0 2 0 3 5 10 13 1 5 29 63
Summary of Contributions • The first scalable context-sensitive subset-based pointer analysis. – Cloning-based technique using BDDs – Clever context numbering – Experimental results on the effects of context sensitivity • bddbddb: new paradigm in program analysis – – Efficiently and easily implement context-sensitive analyses Datalog compiled into optimized BDD operations Library of program analyses (with many others) Active learning for BDD variable orders (with M. Carbin) • Artifacts: – Joeq compiler and virtual machine – Java. BDD library and Bu. DDy library – bddbddb tool November 5, 2005 Using Datalog with BDDs for Program Analysis 64
Looking Forward • Program analysis for the masses – Integrate into software development process – Programmers, domain-specialists specify their own “patterns” • Important work still to come – Technology issues – User-interface issues – Programmer culture issues November 5, 2005 Using Datalog with BDDs for Program Analysis 65
Conclusion • The first scalable context-sensitive subset-based pointer analysis. – Accurate: Results for up to 1014 contexts. – Scales to large programs. • bddbddb: a new paradigm in prog analysis – High-level spec Efficient implementation • System is publicly available at: http: //bddbddb. sourceforge. net November 5, 2005 Using Datalog with BDDs for Program Analysis 66
- Slides: 67