Mining Specifications lots of code specifications Glenn Ammons

  • Slides: 18
Download presentation
Mining Specifications (lots of) code specifications Glenn Ammons Univ. of Wisconsin Ras Bodík Univ.

Mining Specifications (lots of) code specifications Glenn Ammons Univ. of Wisconsin Ras Bodík Univ. of Wisconsin Jim Larus Microsoft Research 1

Verification: beyond engine-less cars Recent successes. ü specifications languages ü checkers ü abstractors What’s

Verification: beyond engine-less cars Recent successes. ü specifications languages ü checkers ü abstractors What’s still missing? ? specifications Drivers wanted. 2

So who formulates specifications? Programmers? Probably not. Why they won’t: • too busy; Yet

So who formulates specifications? Programmers? Probably not. Why they won’t: • too busy; Yet another language to learn? • specifications aren’t cool. Why they shouldn’t: • may misunderstand usage rules. • may not know all usage rules. Mining Specifications: ? Convenience. ? Like in data mining, discover surprise rules. 3

Advantages of mining Exploits the massive programmers’ effort reflected in the code. • Programmers

Advantages of mining Exploits the massive programmers’ effort reflected in the code. • Programmers resolved many problems: • incomplete system requirements. • incomplete API documentation. • implementation-dependent rules. • Want redundancy? (without redundant programming) • ask multiple programmers (and vote). 4

Our output: a specification x = socket() bind(x) listen(x) y = accept(x) read(y) write(y)

Our output: a specification x = socket() bind(x) listen(x) y = accept(x) read(y) write(y) close(x) 5

How do we mine? Underlying premise: Even bad software is debugged enough to show

How do we mine? Underlying premise: Even bad software is debugged enough to show hints of correct behavior. E Maxim: Common usage is the correct usage. 6

Mining = machine learning Reduce the problem into the well-known problem of learning regular

Mining = machine learning Reduce the problem into the well-known problem of learning regular languages. Obstacles: 1. bugs from source code may be learned into specification 2. what is “common” behavior? Solutions: 1. learn from dynamic behavior 2. learn probabilistically learn from traces into probabilistic FSMs 7

Input: trace(s) 7 = socket(2, 1, 0); bind(7, 0 x 400120, 16); listen(7, 5);

Input: trace(s) 7 = socket(2, 1, 0); bind(7, 0 x 400120, 16); listen(7, 5); 8 = accept(7, 0 x 400200, 0 x 400240); read(8, 0 x 400320, 255); write(8, 0 x 400320, 12); read(8, 0 x 400320, 255); write(8, 0 x 400320, 7); close(8); 10 = accept(7, 0 x 400200, 0 x 400240); read(10, 0 x 400320, 255); write(10, 0 x 400320, 13); close(10); close(7); … … x = socket() bind(x) listen(x) y = accept(x) read(y) write(y) close(x) 8

The mining algorithm dynamic execution (traces) trace abstraction generalized scenarios (probabilistic NFA) dynamic exe.

The mining algorithm dynamic execution (traces) trace abstraction generalized scenarios (probabilistic NFA) dynamic exe. to be checked (trace) usage scenarios extract heavy core (and approve) dynamic checker (strings) (off-the-shelf) Reg. Exp learner specification (NFA) OK/bug 9

Trace abstraction: 4 challenges • Traces interleave useful and useless events. • Reg. Exp

Trace abstraction: 4 challenges • Traces interleave useful and useless events. • Reg. Exp learner cannot separate them. • Specifications must include both temporal and value -flow constraints. • Reg. Exp learner only good with temporal constraints. • Only some of API calls’ arguments impose “true” dependences. • Infeasible to learn value-flow constraints on all arguments. • Specifications may impose only partial order. • Encoding all legal partial orders would produce a huge FSM. 10

Trace abstraction h(3, 5) c(10) a(4, 5) d(4, 7) b(0, 5) f(10) h(8, 11)

Trace abstraction h(3, 5) c(10) a(4, 5) d(4, 7) b(0, 5) f(10) h(8, 11) e(7) f(50) d(15, 1) c(7) a(9, 11) b(6, 7) d(9, 14) f(20) e(7) … h(_, 5) c(10) a(4, 5) d(4, 7) b(_, 5) f(10) h(_, 11) e(7) f(_) d(_, _) c(7) a(9, 11) b(_, 11) d(9, _) e(_) f(_) … h(_, ) h(_, X) a( , ) d( , ) b(_, ) a(Y, X) d(Y, Z) b(_, X) a(Y, X) b(_, X) d(Y, Z) e(Z) h(_, X) a(Y, X) b(_, X) d(Y, Z) e(Z) h(_, X) a(Y, X) b(_, X) d(Y, Z) 11

Preliminary experiments Attempted to learn and verify two published X Windows rules As of

Preliminary experiments Attempted to learn and verify two published X Windows rules As of Friday: 1. A timestamp-passing rule • • 2. learned the rule! (compact: 6 states) bugs in 2 out of 17 programs (ups, e 93) Set. Owner(x) must be followed by Get. Selection(x) • • failed to learn the rule (small learning set) but bugs in 2 out of 5 programs (xemacs, ups) 12

Related work Arithmetic pre/post conditions • Daikon, Houdini • properties orthogonal from us •

Related work Arithmetic pre/post conditions • Daikon, Houdini • properties orthogonal from us • eventually, we may need to include and learn some arithmetic relationships Temporal relationships over calls • intrusion detection: [Ghosh et al], [Wagner and Dean] • software processes: [Cook and Wolf] • error checking: [Engler et al SOSP 2001] • lexical and syntactic pattern matching • user must write templates (e. g. , <a> always follows <b>) 13

Ongoing work Mechanize tool. Find more gold. 14

Ongoing work Mechanize tool. Find more gold. 14

Future work ESP Vault code inputs Mining specifications SPIN bugs Verisoft ? SLAM …

Future work ESP Vault code inputs Mining specifications SPIN bugs Verisoft ? SLAM … Give gold to jewelers. 15

Summary • Semi-automatically creating well-formend, non- trivial specifications is an important part of the

Summary • Semi-automatically creating well-formend, non- trivial specifications is an important part of the verification tool chain. • Contributions: • introduced specifications mining • phrased it as probabilistic learning from dynamic traces • decomposed it into a sequence of subproblems (using an off-the-shelf learner) • developed dynamic checker • found bugs 16

Discussion Expressibility • what classes of properties can/should we learn? • can we learn

Discussion Expressibility • what classes of properties can/should we learn? • can we learn more than we can check? • can a single-threaded specification avoid race conditions? 17

Backup Slides 18

Backup Slides 18