Petablox Declarative Program Analysis for Big Code Mayur

Petablox: Declarative Program Analysis for Big Code Mayur Naik Joint work with: Ravi Mangal, Xin Zhang Hongseok Yang Georgia Tech Aditya Nori MSR Radu Grigore, Oxford Univ.

Background � Problem: Automatically infer or predict salient behaviors or vulnerabilities in a given program � Long-standing � Difficult problem in program analysis tradeoffs, uncertain or missing specifications, etc. � Idea: Can we leverage collective knowledge amassed from analyzing existing programs? UC Berkeley 10/31/2021

Example: Integer overflow vulnerability (1/3) � CVE-2009 -1570 (GIMP) if (Bitmap_Head. bi. Width < 0) { g_set_error (error, G_FILE_ERROR_FAILED, _("'%s' is not a valid BMP file"), gimp_filename_to_utf 8 (filename)); return -1; }. . . rowbytes = ((Bitmap_Head. bi. Width * Bitmap_Head. bi. Bit. Cnt - 1) / 32) * 4 + 4; . . . buffer = g_malloc (rowbytes); UC Berkeley 10/31/2021

Example: Integer overflow vulnerability (2/3) � CVE-2011 -2194 (VLC Player) if ( p_sys->i_track_id < 0 ) { input_item_node_Append. Node( p_input_node, p_new_node ); vlc_gc_decref( p_new_input ); return true; }. . . input_item_t **pp; pp = realloc( p_sys->pp_tracklist, (p_sys->i_track_id + 1) * sizeof(*pp) ); UC Berkeley 10/31/2021

Example: Integer overflow vulnerability (3/3) � CVE-2013 -0913 (Linux Kernel) if (args->buffer_count < 1) { DRM_ERROR("execbuf 2 with %d buffersn", args->buffer_count); return -EINVAL; } exec 2_list = kmalloc(sizeof(*exec 2_list) * args->buffer_count, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY); UC Berkeley 10/31/2021

What specification to check? Integer overflows? + well-defined − necessary but not sufficient (many benign overflows) The pattern: “Integer overflow on an expression derived from an input variable after some sanitization but before the expression is used to allocate a memory buffer” UC Berkeley 10/31/2021

How to check the specification? Combination of: � Integer overflow analysis � Information-flow analysis � Alias analysis � Concurrency analysis The pattern: “Integer overflow on an expression derived from an input variable after some sanitization but before the expression is used to allocate a memory buffer” UC Berkeley 10/31/2021

What information do the analyses need? Information-flow analysis must know sensitive sink: first argument of g_malloc in GIMP � second argument of realloc in VLC � Environment assumptions � � Behavior of missing program parts � Loop invariants � Function pre/post conditions �… UC Berkeley 10/31/2021

How effective are the analyses? � Necessarily approximate for undecidability reasons � Must strike tradeoffs between soundness, completeness, and scalability UC Berkeley 10/31/2021

Declarative program analysis using Datalog flow(v 1, v 2) : - assign(v 2, e 1), ref(e 1, v 1). e f assign(tmp, e) ref(e, biwidth) assign(rowbytes, f) flow(biwidth, tmp) UC Berkeley ref(f, tmp) flow(tmp, rowbytes) 10/31/2021

Expressing fixpoint computations flow(v 1, v 2) : - assign(v 2, e 1), ref(e 1, v 1). flow(v 1, v 3) : - flow(v 1, v 2), flow(v 2, v 3). assign(tmp, e) ref(e, biwidth) assign(rowbytes, f) flow(biwidth, tmp) ref(f, tmp) flow(tmp, rowbytes) flow(biwidth, rowbytes) UC Berkeley 10/31/2021

Derivations of analysis results � Expressive: enables analytics clients to mine rich features and patterns � Uniform: spans reasoning performed across multiple analyses � Portable: does not require to modify the underlying constraint solver assign(tmp, e) ref(e, biwidth) assign(rowbytes, f) flow(biwidth, tmp) ref(f, tmp) flow(tmp, rowbytes) flow(biwidth, rowbytes) UC Berkeley 10/31/2021

Combining logical and probabilistic reasoning Hard constraints: flow(v 1, v 2) : - assign(v 2, e 1), ref(e 1, v 1). flow(v 1, v 3) : - flow(v 1, v 2), flow(v 2, v 3). Soft constraints: vulnerable(v) : - source(v), overflow(v), sink(v). weight 0. 84 : - flow(v, v 2), arg(v 2, m, k), alloc(m, k). weight 0. 95 sink(v) � Hard optimization problem (Max. SAT) � Two phases: grounding solving; both hard to scale � Where do weights come from? � Crowdsourcing, active learning, … UC Berkeley 10/31/2021

Declarative program analysis: Prevalent view Progra m text Analysi s result Constraint generatio n Datalog constraint s Constraint resolution � Separates analysis specification from implementation � Enables sophisticated implementations � Provides natural program specifications UC Berkeley 10/31/2021

Declarative program analysis: Our view � Goal: extend these benefits in context of common and emerging use-cases of analyses � Client-driven analysis: find good program abstractions � Summary-based analysis: transfer analysis results across programs � User-guided analysis: incorporate analysis users’ feedback � Idea: Automatically synthesize analysis use -cases UC Berkeley 10/31/2021

Example use-case: client-driven analysis Progra m text Analysi s result Constraint generatio n Datalog constraint s Constraint resolution Refined abstraction Counterexamples Constraint resolution Max. SAT constraint s UC Berkeley Constraint generatio n 10/31/2021

Petablox program analysis framework UC Berkeley 10/31/2021

Rest of the talk: Two use-cases � Client-driven analysis: finding suitable program abstractions � User-guided analysis: incorporating analysis users’ feedback UC Berkeley 10/31/2021

Pointer analysis example f(){ v 1 = new. . . ; v 2 = id 1(v 1); v 3 = id 2(v 2); q 2: assert(v 3!= v 1); } g(){ v 4 = new. . . ; v 5 = id 1(v 4); v 6 = id 2(v 5); q 1: assert(v 6!= v 1); } id 1(v){return v; } id 2(v){return v; } UC Berkeley 10/31/2021

Pointer analysis as graph reachability a 1 0 a 0 6’ b 0 3 b 1 6 a 1 c 1 1 6’’ a 0 b 0 c 0 d 0 7’ 4 b 1 d 1 7 c 1 2 c 0 7’’ d 0 5 d 1 UC Berkeley 10/31/2021

Graph reachability in Datalog a 1 0 a 0 6’ b 0 3 b 1 6 a 1 c 1 1 6’’ a 0 b 0 c 0 d 0 7’ 4 c 1 2 Query Tuple c 0 Output relations: path(i, j) b 1 d 1 7 7’’ d 0 5 Input relations: edge(i, j, n), abs(n) d 1 Original Query q 1: path(0, 5) assert(v 6!= v 1) q 2: path(0, 2) assert(v 3!= v 1) Rules: (1) path(i, i). (2) path(i, j) : - path(i, k), edge(k, j, n), abs(n). Input tuples: edge(0, 6, a 0), edge(0, 6’, a 1), edge(3, 6, b 0), … UC Berkeley 16 possible abstractions in total 10/31/2021

Desired result a 1 0 a 0 6’ b 0 3 b 1 6 a 1 c 1 1 6’’ a 0 b 0 c 0 d 0 7’ 4 c 1 2 Query c 0 Output relations: path(i, j) b 1 d 1 7 7’’ d 0 5 d 1 Answer q 1: path(0, 5) a 1 b 0 c 1 d 0 q 2: path(0, 2) Impossibility Input relations: edge(i, j, n), abs(n) Rules: (1) path(i, i). (2) path(i, j) : - path(i, k), edge(k, j, n), abs(n). Input tuples: edge(0, 6, a 0), edge(0, 6’, a 1), edge(3, 6, b 0), … UC Berkeley 10/31/2021

Iteration 1 a 1 0 a 0 b 0 3 b 1 6 6’ a 1 c 1 1 6’’ a 0 b 0 c 0 d 0 7’ 4 b 1 d 1 7 c 1 2 Query q 1: path(0, 5) q 2: path(0, 2) c 0 7’’ d 0 5 Eliminated Abstractions d 1 path(0, 0). path(0, 6) : - path(0, 0), edge(0, 6, a 0), abs(a 0). path(0, 1) : - path(0, 6), edge(6, 1, a 0), abs(a 0). path(0, 7) : - path(0, 1), edge(1, 7, c 0), abs(c 0). path(0, 2) : - path(0, 7), edge(7, 2, c 0), abs(c 0). path(0, 4) : - path(0, 6), edge(6, 4, b 0), abs(b 0). path(0, 7) : - path(0, 4), edge(4, 7, d 0), abs(d 0). path(0, 5) : - path(0, 7), edge(7, 5, d 0), abs(d 0). … UC Berkeley 10/31/2021

Iteration 1 - derivation graph a 1 0 a 0 b 0 3 b 1 6 6’ a 1 c 1 1 6’’ a 0 b 0 c 0 d 0 7’ 4 b 1 d 1 7 c 1 2 Query c 0 7’’ d 0 5 d 1 Eliminated Abstractions q 1: path(0, 5) q 2: path(0, 2) UC Berkeley 10/31/2021

Iteration 1 - derivation graph path(0, 0) edge(0, 6, a 0) abs(a 0)edge(6, 1, a 0) path(0, 6)edge(6, 4, b 0) abs(c 0)edge(1, 7, c 0) path(0, 1) abs(c 0) edge(7, 2, c 0) path(0, 4)edge(4, 7, d 0) abs(d 0) path(0, 7) edge(7, 5, d 0) abs(d 0) path(0, 2) UC Berkeley path(0, 5) 10/31/2021

Iteration 1 - derivation graph a 1 0 a 0 b 0 3 b 1 6 6’ a 1 c 1 1 6’’ a 0 b 0 c 0 d 0 7’ 4 b 1 d 1 7 c 1 2 Query q 1: path(0, 5) q 2: path(0, 2) c 0 7’’ d 0 5 d 1 Eliminated Abstractions a 0 c 0 d 0, a 0 b 0 d 0 (4/16) a 0 c 0 (4/16) UC Berkeley 10/31/2021

Encoded as Max. SAT Avoid all the counterexample s Minimize the abstraction cost UC Berkeley 10/31/2021

Encoded as Max. SAT Query q 1: path(0, 5) q 2: path(0, 2) Eliminated Abstractions a 0 c 0 d 0, a 0 b 0 d 0 (4/16) a 0 c 0 (4/16) UC Berkeley 10/31/2021

Iteration 2 and beyond Iteration 1 Datalog solver Query Max. SAT solver Answer Eliminated Abstractions q 1: path(0, 5) a 0 c 0 d 0, a 0 b 0 d 0, (4/16) q 2: path(0, 2) a 0 c 0, a 1 c 0 (4/16) UC Berkeley 10/31/2021

Iteration 2 and beyond Iteration 2 Datalog solver Query Max. SAT solver Answer Eliminated Abstractions q 1: path(0, 5) a 0 c 0 d 0, a 0 b 0 d 0, (4/16) q 2: path(0, 2) a 0 c 0, a 1 c 0 (4/16) UC Berkeley 10/31/2021

Iteration 2 and beyond Iteration 2 Datalog solver Query Max. SAT solver Answer Eliminated Abstractions q 1: path(0, 5) a 0 c 0 d 0, a 0 b 0 d 0, (4/16) q 2: path(0, 2) a 0 c 0 (4/16) UC Berkeley 10/31/2021

Iteration 2 and beyond Iteration 2 Datalog solver Query Max. SAT solver Answer Eliminated Abstractions q 1: path(0, 5) a 0 c 0 d 0, a 0 b 0 d 0, a 1 c 0 d 0 (6/16) q 2: path(0, 2) a 0 c 0, a 1 c 0 (8/16) UC Berkeley 10/31/2021

Iteration 2 and beyond Iteration 3 Datalog solver Max. SAT solver q 1 is proven. Query Answer Eliminated Abstractions q 1: path(0, 5) a 0 c 0 d 0, a 0 b 0 d 0, a 1 c 0 d 0 (6/16) q 2: path(0, 2) a 0 c 0, a 1 c 0 (8/16) UC Berkeley 10/31/2021

Iteration 2 and beyond Iteration 3 Datalog solver Max. SAT solver q 2 is impossible to prove. q 1 is proven. Query q 1: path(0, 5) q 2: path(0, 2) Answer Impossibilit y Eliminated Abstractions a 0 c 0 d 0, a 0 b 0 d 0, a 1 c 0 d 0 (6/16) a 0 c 0, a 1 c 1, a 0 c 1 (16/16) UC Berkeley 10/31/2021

Mixing counterexamples Iteration 1 Iteration 3 Eliminated Abstractions : UC Berkeley 10/31/2021

Mixing counterexamples Iteration 1 Mixed! Iteration 3 Eliminated Abstractions : UC Berkeley 10/31/2021

Experimental setup � Implemented using off-the-shelf solvers: � Datalog: bddbddb � Max. SAT: Mi. Fu. Ma. X � Applied to two analyses that are challenging to scale: � k-object-sensitivity � flow-insensitive, � typestate weak updates, cloning-based analysis: � flow-sensitive, � Evaluated pointer analysis: strong updates, summary-based on 8 Java programs (250 -450 KLOC each) UC Berkeley 10/31/2021

Pointer analysis results queries resolved total 4 -objectsensitivity abstraction < 50%size curren baselin t e < 3% of max 7 0 iterations final max 170 18 K 10 toba-s 7 javasrc-p 46 46 0 470 18 K 13 weblech 5 5 2 140 31 K 10 hedc 47 47 6 730 29 K 18 antlr 143 5 970 29 K 15 luindex 138 67 1 K 40 K 26 lusearch 322 29 1 K 39 K 17 schroederm 51 51 25 450 58 K 15 UC Berkeley 10/31/2021

Performance of Datalog solver Baseline k = 4, 3 h 28 m k = 3, 590 s k = 2, 214 s k = 1, 153 s lusearch UC Berkeley 10/31/2021

Performance of Max. SAT solver lusearch UC Berkeley 10/31/2021

Statistics of Max. SAT formulae pointer analysis variables clauses toba-s 0. 7 M 1. 5 M javasrc-p 0. 5 M 0. 9 M weblech 1. 6 M 3. 3 M hedc 1. 2 M 2. 7 M antlr 3. 6 M 6. 9 M luindex 2. 4 M 5. 6 M lusearch 2. 1 M 5. 0 M schroeder-m 6. 7 M 23. 7 M UC Berkeley 10/31/2021

User-guided analysis: Motivation � Analysis writers make various approximations � Properties may be impossible to define precisely (e. g. , security vulnerabilities, harmful race conditions, etc. ) � Computing exact solutions impossible or prohibitively costly � Program parts missing or opaque to analysis � => Analyses produce false positives or false negatives � Idea: shift decisions about approximation from analysis writers to analysis users UC Berkeley 10/31/2021

User-guided analysis: Our approach UC Berkeley 10/31/2021

Simplified datarace analysis in Datalog Input relations: next(p 1, p 2), may. Alias(p 1, p 2), guarded(p 1, p 2) Output relations: parallel(p 1, p 2), race(p 1, p 2) Rules: parallel(p 3, p 2) : - parallel(p 1, p 2), next (p 3, weight p 1). w 1 (2) parallel(p 1, p 2) : - parallel(p 2, p 1). race(p 1, p 2) : - parallel(p 1, p 2), may. Alias(p 1, p 2), ¬guarded(p 1, ¬parallel(p 1, p 2). weight w 0 ¬race(p 1, p 2). weight w 0 UC Berkeley 10/31/2021

A concurrent program: Apache ftp server public class Request. Handler { Ftp. Request. Impl request; public void close( ) { Ftp. Writer writer; synchronized (this) { Buffered. Reader reader; if (is. Connection. Closed) Socket control. Socket; return; boolean is. Connection. Closed = true; is. Connection. Closed; } … request. clear(); // x 1 request = null; // x 2 writer. close(); // y 1 public void get. Request( ) { return request; // x 0 writer = null; // y 2 } reader. close(); reader = null; control. Socket. close(); control. Socket = null; } UC Berkeley 10/31/2021

Before user feedback UC Berkeley 10/31/2021

After user feedback UC Berkeley 10/31/2021

How does it work? Input facts: next(x 2, x 1), may. Alias(x 2, x 1), ¬guarded(x 2, x 1), next(y 1, x 2), may. Alias(y 2, y 1), ¬guarded(y 2, y 1) Max. SAT formula: (¬parallel(x 1, x 1) ∨ ¬next(x 2, x 1) ∨ parallel(x 2, x 1)) weight w 1 ∧ (¬parallel(x 1, x 2) ∨ ¬next(x 2, x 1) ∨ parallel(x 2, x 2)) weight w 1 ∧ �� (¬parallel(x 2, x 2) ∨ ¬next(y 1, x 2) ∨ parallel(y 1, x 2)) weight w 1 ∧ (¬parallel(y 2, y 1) ∨ ¬may. Alias(y 2, y 1) ∨ guarded(y 2, y 1) ∨ race(y 2, y 1)) ∧ (¬parallel(x 2, ∨ ¬may. Alias(x 2, x 1) ∨ guarded(x 2, x 1) ∨ race(x 2, ¬race(x 2, x 1) weight w 2 x 1)) ∧ Output facts (before feedback): Output facts (after feedback): parallel(x 0, x 2), race(x 0, x 2), parallel(x 2, x 1), race(x 2, x 1), parallel(y 2, y 1), race(y 2, y 1) UC Berkeley 10/31/2021

Empirical evaluation � Implemented using off-the-shelf solvers: � Datalog: bddbddb � Max. SAT: MCSls � Applied to three different static analyses: � Datarace detection � Monomorphic call site inference � Downcast safety checking � Evaluated on 7 Java programs (150 -350 KLOC each) UC Berkeley 10/31/2021

Datarace analysis precision results UC Berkeley 10/31/2021

Datarace analysis scalability results Total ground clauses # iterations Total time (hrs: mins) # ground clauses Lazy Guide d Lazy Guided antlr 2. 4 x 1024 751 4 3: 02 0: 05 0. 2 M 0. 3 M avrora 1. 8 x 1026 492 12 6: 31 0: 25 0. 8 M 1. 6 M ftp 3. 7 x 1023 463 5 7: 53 0: 08 1. 2 M 1. 4 M hedc 1. 9 x 1024 354 6 1: 55 0: 06 0. 8 M 0. 9 M luindex 1. 6 x 1025 481 7 4: 07 0: 12 0. 6 M 1. 1 M lusearc h 1. 7 x 1025 429 6 2: 38 0: 14 0. 6 M 1. 0 M 416 6 1: 59 0: 07 0. 6 M 0. 9 M weblech 4. 4 x 1024 UC Berkeley 10/31/2021

Key takeaways � Extend benefits of constraint-based analysis in context of common and emerging use-cases of program analysis � Requires reasoning about a mix of hard (inviolable, logical) and soft (violable, probabilistic) propositional constraints � Motivates new problems and techniques to scale Max. SAT � Motivates learning new problems and techniques in weight UC Berkeley 10/31/2021

Thank you! UC Berkeley 10/31/2021