Mining Specifications lots of code specifications Glenn Ammons

Verification: beyond engine-less cars Recent successes. ü specifications languages ü checkers ü abstractors What’s

So who formulates specifications? Programmers? Probably not. Why they won’t: • too busy; Yet

Advantages of mining Exploits the massive programmers’ effort reflected in the code. • Programmers

Our output: a specification x = socket() bind(x) listen(x) y = accept(x) read(y) write(y)

How do we mine? Underlying premise: Even bad software is debugged enough to show

Mining = machine learning Reduce the problem into the well-known problem of learning regular

Input: trace(s) 7 = socket(2, 1, 0); bind(7, 0 x 400120, 16); listen(7, 5);

The mining algorithm dynamic execution (traces) trace abstraction generalized scenarios (probabilistic NFA) dynamic exe.

Trace abstraction: 4 challenges • Traces interleave useful and useless events. • Reg. Exp

Trace abstraction h(3, 5) c(10) a(4, 5) d(4, 7) b(0, 5) f(10) h(8, 11)

Preliminary experiments Attempted to learn and verify two published X Windows rules As of

Related work Arithmetic pre/post conditions • Daikon, Houdini • properties orthogonal from us •

Ongoing work Mechanize tool. Find more gold. 14

Future work ESP Vault code inputs Mining specifications SPIN bugs Verisoft ? SLAM …

Summary • Semi-automatically creating well-formend, non- trivial specifications is an important part of the

Discussion Expressibility • what classes of properties can/should we learn? • can we learn

Slides: 18

Download presentation

Mining Specifications (lots of) code specifications Glenn Ammons Univ. of Wisconsin Ras Bodík Univ. of Wisconsin Jim Larus Microsoft Research 1

Verification: beyond engine-less cars Recent successes. ü specifications languages ü checkers ü abstractors What’s still missing? ? specifications Drivers wanted. 2

So who formulates specifications? Programmers? Probably not. Why they won’t: • too busy; Yet another language to learn? • specifications aren’t cool. Why they shouldn’t: • may misunderstand usage rules. • may not know all usage rules. Mining Specifications: ? Convenience. ? Like in data mining, discover surprise rules. 3

Advantages of mining Exploits the massive programmers’ effort reflected in the code. • Programmers resolved many problems: • incomplete system requirements. • incomplete API documentation. • implementation-dependent rules. • Want redundancy? (without redundant programming) • ask multiple programmers (and vote). 4

Our output: a specification x = socket() bind(x) listen(x) y = accept(x) read(y) write(y) close(x) 5

How do we mine? Underlying premise: Even bad software is debugged enough to show hints of correct behavior. E Maxim: Common usage is the correct usage. 6

Mining = machine learning Reduce the problem into the well-known problem of learning regular languages. Obstacles: 1. bugs from source code may be learned into specification 2. what is “common” behavior? Solutions: 1. learn from dynamic behavior 2. learn probabilistically learn from traces into probabilistic FSMs 7

Input: trace(s) 7 = socket(2, 1, 0); bind(7, 0 x 400120, 16); listen(7, 5); 8 = accept(7, 0 x 400200, 0 x 400240); read(8, 0 x 400320, 255); write(8, 0 x 400320, 12); read(8, 0 x 400320, 255); write(8, 0 x 400320, 7); close(8); 10 = accept(7, 0 x 400200, 0 x 400240); read(10, 0 x 400320, 255); write(10, 0 x 400320, 13); close(10); close(7); … … x = socket() bind(x) listen(x) y = accept(x) read(y) write(y) close(x) 8

The mining algorithm dynamic execution (traces) trace abstraction generalized scenarios (probabilistic NFA) dynamic exe. to be checked (trace) usage scenarios extract heavy core (and approve) dynamic checker (strings) (off-the-shelf) Reg. Exp learner specification (NFA) OK/bug 9

Trace abstraction: 4 challenges • Traces interleave useful and useless events. • Reg. Exp learner cannot separate them. • Specifications must include both temporal and value -flow constraints. • Reg. Exp learner only good with temporal constraints. • Only some of API calls’ arguments impose “true” dependences. • Infeasible to learn value-flow constraints on all arguments. • Specifications may impose only partial order. • Encoding all legal partial orders would produce a huge FSM. 10

Trace abstraction h(3, 5) c(10) a(4, 5) d(4, 7) b(0, 5) f(10) h(8, 11) e(7) f(50) d(15, 1) c(7) a(9, 11) b(6, 7) d(9, 14) f(20) e(7) … h(_, 5) c(10) a(4, 5) d(4, 7) b(_, 5) f(10) h(_, 11) e(7) f(_) d(_, _) c(7) a(9, 11) b(_, 11) d(9, _) e(_) f(_) … h(_, ) h(_, X) a( , ) d( , ) b(_, ) a(Y, X) d(Y, Z) b(_, X) a(Y, X) b(_, X) d(Y, Z) e(Z) h(_, X) a(Y, X) b(_, X) d(Y, Z) e(Z) h(_, X) a(Y, X) b(_, X) d(Y, Z) 11

Preliminary experiments Attempted to learn and verify two published X Windows rules As of Friday: 1. A timestamp-passing rule • • 2. learned the rule! (compact: 6 states) bugs in 2 out of 17 programs (ups, e 93) Set. Owner(x) must be followed by Get. Selection(x) • • failed to learn the rule (small learning set) but bugs in 2 out of 5 programs (xemacs, ups) 12

Related work Arithmetic pre/post conditions • Daikon, Houdini • properties orthogonal from us • eventually, we may need to include and learn some arithmetic relationships Temporal relationships over calls • intrusion detection: [Ghosh et al], [Wagner and Dean] • software processes: [Cook and Wolf] • error checking: [Engler et al SOSP 2001] • lexical and syntactic pattern matching • user must write templates (e. g. , <a> always follows <b>) 13

Ongoing work Mechanize tool. Find more gold. 14

Future work ESP Vault code inputs Mining specifications SPIN bugs Verisoft ? SLAM … Give gold to jewelers. 15

Summary • Semi-automatically creating well-formend, non- trivial specifications is an important part of the verification tool chain. • Contributions: • introduced specifications mining • phrased it as probabilistic learning from dynamic traces • decomposed it into a sequence of subproblems (using an off-the-shelf learner) • developed dynamic checker • found bugs 16

Discussion Expressibility • what classes of properties can/should we learn? • can we learn more than we can check? • can a single-threaded specification avoid race conditions? 17

Backup Slides 18