1 Introducing bias in automata inference Introducing Domain

1 Introducing bias in automata inference Introducing Domain and Typing Bias in Automata Inference François Coste Daniel Fredouille (speaker) Christopher Kermorvant Colin de la Higuera INRIA/IRISA (France) Robert Gordon University (UK) Université de Montréal (Canada) EURISE, Université Jean Monnet (France) RGU F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

2 Introducing bias in automata inference Automata Inference Search Space Pruning with counter-examples generalization UA MCA S+ ={baaa, bba} L(MCA)=S+ F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

3 Introducing bias in automata inference Where to introduce bias in the state merging framework? • • A ¬ MCA while merge_choice(A, q 1, q 2) do • A’ ¬ merge(A, q 1, q 2) • if compatible(A’) then • A ¬ A’ • endif endwhile return A F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, UA MCA ICGI, October 2004.

4 Introducing bias in automata inference Syntactic and semantic bias F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

5 Introducing bias in automata inference Language Bias A background knowledge on the syntax of strings Set of all strings (S*) Inferred language L- Lg (= S* − L-) • (Infinite) set of counter-example (L-) • Domain (Lg) • More general language (Lg) F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

6 Introducing bias in automata inference Language Bias: Formalisation • The set L- is given by an automaton: L- = L(A-) (complementation needed if given Lg) AStrings Automata inference The algorithm ensures: Automaton L(A-) L(A) = Ø • Examples: • Lg: Correct boolean expression • L- : Forbidden pattern, e. g. ‘¬)’ should not appear in a correct boolean expression F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

7 Introducing bias in automata inference Typing Bias A background knowledge on the semantic of strings represented in the “shape” of the target automaton. . . CSKPGVIFLTKRSRQVRQC. . . FLTKVIRCSKPSRQVCGFL. . . GVKPIFLTKRSRQVCCSKP. . . FCSKGVIGVIPLTKSKSRQ. . . F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

8 Introducing bias in automata inference Typing Bias: Formalisation As we possibly know types on an infinite number of strings, we need something to express this knowledge [KH 02] Typing function : S* ´ S* ® T acbbcacbabc b abcccbabc context : left typed element Typing automaton: S-a right a a F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, S b S ICGI, October 2004.

9 Introducing bias in automata inference Typing bias: Examples • Prosite Motifs: Typing protein sequences The motifs can be transformed in typing automata. Motif PDOC 00028 (Zinc-finger): C-x(2, 4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3, 5)-H • Brill Tagger: Typing strings in natural language The machine is NOT a typing automaton but can still be partially used (see paper). F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

10 Introducing bias in automata inference Bias: Improvements on the state of the art • Semantic bias are no more mixed with syntactic ones (Both a conceptual and algorithmic improvement) • The proposed formalisation relaxes all constraints existing in the previous formalism [KH 02]: • Non-determinism allowed: • Any kind of language/typing automaton allowed • Any kind of inferred automaton allowed • Incomplete typing function allowed: • Typing is not required to be given on all strings and/or all letters of strings • Typing is not required to cover the examples • Other: Typing automaton can have an identical type for two different states, … F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

11 Introducing bias in automata inference What kind of Background Knowledge ? Sample annotation Strings Formalised BK Automata inference Automaton • Annotation: Knowledge on the examples set - Parenthesizing (for CFG) [Sa 92, SM 03] - Typing [GBE 96, KH 02] • Formalised BK: Knowledge on any string F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

12 Introducing bias in automata inference Algorithms and experiments F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

13 Introducing bias in automata inference Algorithm: overview • Complexity: • O(N 1 x. N 2) per merge. • N 1 factor amortized along different merges • Particular cases identified: • O(1): specialization of a grammar (extension and better understanding of the [KH 02] results) • O(merge operator): DFA/UFA with typing on examples • Idea: look for common acceptances between BK and inferred automata F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

14 Introducing bias in automata inference Algorithm: Language inconsistency detection b S-a a b c a a S b S : “share a common prefix word” F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

15 Introducing bias in automata inference Algorithm: Language inconsistency detection b S-a a b c a a S b S : “share a common suffix word” F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

16 Introducing bias in automata inference Algorithm: Typing projection b S-a a b c a a S b S : “share both a common prefix and a common suffix word” F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

17 Introducing bias in automata inference Experimental Results: Artificial Data • Gowachin generated • 2 dimensions: automaton and examples • Increasing |S+ S-| (x-axis) • RPNI algorithm • Increasing “size” of BK • Recognition level on a test (different curves) set (y-axis) F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

18 Introducing bias in automata inference Experimental Results: Typing on real data • Task: Atis • Algorithm: Alergia • Typing: part of speech tags (Brill tagger) • Evaluation: Perplexity & coverage Best results F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.

19 Introducing bias in automata inference Background Knowledge can now be introduced in regular GI ! New ! • Algorithms can express complex BK: • They handle independently syntax and semantic. • They handle non-determinism, incompleteness. • Algorithms have been tested both on artificial and real data. • Needs to be tested on more real world data. • Theoretical basis ? • Amount of knowledge needed to identify the target ? • What are the links with MAT and similar ? • Extensions • Using these bias in heuristics (handling noise ? ) • Extensions to more powerful representations (CFG. . . ). F. Coste, D. Fredouille, C. Kermorvant, C. de la Higuera, ICGI, October 2004.