Speeding Up Dataflow Analysis Using Flow Insensitive Pointer

  • Slides: 31
Download presentation
Speeding Up Dataflow Analysis Using Flow. Insensitive Pointer Analysis Stephen Adams, Tom Ball, Manuvir

Speeding Up Dataflow Analysis Using Flow. Insensitive Pointer Analysis Stephen Adams, Tom Ball, Manuvir Das Sorin Lerner, Mark Seigle Westley Weimer Microsoft Research University of Washington UC Berkeley

Motivation • Static analysis for program verification • Complex dataflow analyses are popular –

Motivation • Static analysis for program verification • Complex dataflow analyses are popular – SLAM, ESP, BLAST, CQual, … – Flow-Sensitive – Interprocedural – Expensive! • Cut down on “data flow facts” • Without losing anything important

General Idea • If complex analysis is worse than O(N) • And you have

General Idea • If complex analysis is worse than O(N) • And you have a cheap analysis that – Is O(N) – Reduces N • Then composing them saves time

Value Flow Graph (VFG) • • • Variant of a points-to graph Encodes the

Value Flow Graph (VFG) • • • Variant of a points-to graph Encodes the flow of values in the program Conservative approximation Lightweight, fast to compute and query Early queries can safely reduce – data-flow facts considered – program points considered • Like slicing a program wrt. value flow

Computing a VFG • Use a subtyping-based pointer analysis – We used One-Level Flow

Computing a VFG • Use a subtyping-based pointer analysis – We used One-Level Flow [Das] • Process all assignments – Not just those involving pointers • Represent constant values explicitly – Put them in the graph • Label graph with source locations – Encodes program slices

Example Points-To Graph x 1: int a, *x; 2: x = &a; 3: *x

Example Points-To Graph x 1: int a, *x; 2: x = &a; 3: *x = 7; Points-to Edge a x Source “Address” Node Expr Node

One Level Flow Graph Flow Edge x 1: int a, *x; 2: x =

One Level Flow Graph Flow Edge x 1: int a, *x; 2: x = &a; 3: *x = 7; Points-to Edge a x Source “Address” Node Expr Node

Value Flow Graph 2 x 1: int a, *x; 2: x = &a; 3:

Value Flow Graph 2 x 1: int a, *x; 2: x = &a; 3: *x = 7; Flow Edge Points-to Edge 2 7 3 2 2, 3 a x Source “Address” Node Expr Node

VFG Properties • Computed in almost-linear time • Get points-to sets from VFG in

VFG Properties • Computed in almost-linear time • Get points-to sets from VFG in linear time – Backwards reachability via flow edges – Gather up all variables • Get value flow from VFG in linear time – Backwards reachability via flow edges – Follow points-to edges up one

VFG Query: Points-To of x 2 x 1: int a, *x; 2: x =

VFG Query: Points-To of x 2 x 1: int a, *x; 2: x = &a; 3: *x = 7; Flow Edge Points-to Edge 2 7 3 2 2, 3 a x Source “Address” Node Expr Node

VFG Query: Value Flow into a 2 x 1: int a, *x; 2: x

VFG Query: Value Flow into a 2 x 1: int a, *x; 2: x = &a; 3: *x = 7; Flow Edge Points-to Edge 2 7 3 2 2, 3 a x Source “Address” Node Expr Node

VFG Summary • Computed in almost-linear time • Queries complete in linear time •

VFG Summary • Computed in almost-linear time • Queries complete in linear time • Approximates flow of values in program • Show two applications that benefit – ESP – SLAM

Application 1: ESP • Verification tool for large C++ programs • Tracks “typestate” of

Application 1: ESP • Verification tool for large C++ programs • Tracks “typestate” of values – Encoded as Finite State Machine – Special Error state • Core: interprocedural data-flow engine – Flow sensitive: state at every point • Performed bottom-up on call graph • Requires function summaries

ESP Function Summaries • Consider stateful memory locations • Summarize function behavior for each

ESP Function Summaries • Consider stateful memory locations • Summarize function behavior for each loc – Reducing number of locs would be good! – But C has evil casts, so types cannot be used • Worst case set of locations: – All globals and formal parameters – Everything transitively reachable from there

Reduce Location Set • Location L needs to be considered in F if –

Reduce Location Set • Location L needs to be considered in F if – Some exp E has its state changed in F – Value held by L at entry to F can flow into E • Assuming state-changing ops are known • Query VFG to find values that flow in

ESP Example FILE *e, *f, *g, *h; void foo() { FILE **p; int a

ESP Example FILE *e, *f, *g, *h; void foo() { FILE **p; int a = (int)h; if (…) p = &e; else p = &f; *p = fopen(…); } Locations to consider foo() summary: { e, *e, f, *f, g, *g, h, *h }

ESP Example FILE *e, *f, *g, *h; void foo() { FILE **p; int a

ESP Example FILE *e, *f, *g, *h; void foo() { FILE **p; int a = (int)h; if (…) p = &e; else p = &f; *p = fopen(…); } (1) Compute VFG (2) Query value flow on *p (3) Reduced locations to consider foo() summary: { e, f } (4) Reduce lines to consider for dataflow

ESP Results • FILE * output in GCC – 140 KLOC, 2149 functions, 66

ESP Results • FILE * output in GCC – 140 KLOC, 2149 functions, 66 files, 1068 globals • VFG Queries take 200 seconds • Reduce average number of locations per function summary from 1100 to <1 – Median of 15 for functions with >0 • Verification takes 15 minutes – Infeasible otherwise

Application 2: SLAM • Validates temporal safety properties – Boolean abstraction – Interprocedural dataflow

Application 2: SLAM • Validates temporal safety properties – Boolean abstraction – Interprocedural dataflow analysis – Counterexample-driven refinement • Convert C program to Boolean program • Exhaustive dataflow analysis – No errors? Program is safe. – Real error? Program has a bug. – False error? Add predicates, repeat.

Boolean Programs int x, y; x = 5; p means “x == 5” y

Boolean Programs int x, y; x = 5; p means “x == 5” y = 6; q means “x < y” x = x * 2; y = y * 2; assert(x<y) Predicates C Program (important!) bool p, q; p = 1; q = 1; p = 0; q = 1; assert(q) Boolean Program

SLAM Predicates • Hard to come up with good predicates • Counterexample-driven refinement –

SLAM Predicates • Hard to come up with good predicates • Counterexample-driven refinement – Picks good predicates – Is very slow • Taking all possible predicates – Is even slower • Want “all the useful” predicates

Speeding Up SLAM • For a simple subset of C – Similar to “Copy

Speeding Up SLAM • For a simple subset of C – Similar to “Copy Constants” – Use VFG to find a sufficient set of predicates – Provably sufficient for this subset • If this set fails to prove the real program – Fall back on counterexample-driven refinement

A Simple Language s : : = vi = n | vi = vj

A Simple Language s : : = vi = n | vi = vj | if (*) s 1 else s 2 | vi = fun(vj, …) | return(vi) | assert(vi » vj) // constants // variable copy // condition ignored // function call // function return // safety property

Predicate Discovery • High-level idea – Each flow edge in the VFG means “values

Predicate Discovery • High-level idea – Each flow edge in the VFG means “values may flow from X to Y” – Add predicates to see if they do • For each assert(vi » vj) – Consider the chain of values flowing to vi, vj – Add an equality predicate for each link – Use constants to resolve scoping

SLAM Example int sel(int f) { int r; if (*) r = f; else

SLAM Example int sel(int f) { int r; if (*) r = f; else r = 3; return(r); } void main() { int a, b, c; a = 1; b = sel(a); if (*) c = 2; else c = 4; assert(b > c); } b 3 r 2 c 4 f a 1

Predicates For “b” int sel(int f) { int r; if (*) r = f;

Predicates For “b” int sel(int f) { int r; if (*) r = f; else r = 3; return(r); } void main() { int a, b, c; a = 1; b = sel(a); if (*) c = 2; else c = 4; assert(b > c); } b 3 Predicates: b == r r == 3 r == f f == a a == 1 r f a 1

Predicates For “b” int sel(int f) { int r; if (*) r = f;

Predicates For “b” int sel(int f) { int r; if (*) r = f; else r = 3; return(r); } void main() { int a, b, c; a = 1; b = sel(a); if (*) c = 2; else c = 4; assert(b > c); } b 3 r Predicates: b == r r == 3 r == f f == a // no scope! a == 1 f a 1

Predicates For “b” int sel(int f) { int r; if (*) r = f;

Predicates For “b” int sel(int f) { int r; if (*) r = f; else r = 3; return(r); } void main() { int a, b, c; a = 1; b = sel(a); if (*) c = 2; else c = 4; assert(b > c); } b 3 r f a 1 Predicates: b == r r == 3 r == f f == a // no scope! f == 1 f == 3 a == 1 a == 3

Why does this work? • Simple language – No arithmetic, etc. – Just copying

Why does this work? • Simple language – No arithmetic, etc. – Just copying around initial values • Knowing final values of variables – Completely decides safety condition • Still related to real life – Cannot do arithmetic on locks, FILE *s, device driver status codes, etc.

Some SLAM Results Program LOC Original Runtime Improved Generated Missing Runtime Predicates apmbatt 2207

Some SLAM Results Program LOC Original Runtime Improved Generated Missing Runtime Predicates apmbatt 2207 229 22 85 0 pnpmem 3849 1132 125 143 4 floppy 7562 1063 600 154 33 iscsiprt 4543 ** 729 146 42 Generated predicates are between all and two-thirds of the necessary predicates. However, since SLAM must iterate once to generate 3 -7 missing predicates, the net performance increase is more than linear. Predicates can be specialized or simplified if the assert() condition is a common relational operator (e. g. , x==y, x<y, x==5).

Conclusions • Complex interprocedural analyses can benefit from inexpensive value-flow • VFG encodes value

Conclusions • Complex interprocedural analyses can benefit from inexpensive value-flow • VFG encodes value flow – Constructed and queried quickly • Prune the set of dataflow facts and program points considered • Large net performance increase