Information Extraction with Tree Automata Induction Raymond Kosala

Outline • Introduction: – information extraction (IE) – grammatical inference • • Approach k-testable

IE from unstructured documents Company Name Job Title Requirement 1 Requirement 2 • Extract

Grammatical inference • • • : finite alphabet Regular language L * Given: set

IE with grammatical inference • Mark field ‘x’ as special token • Infer DFA

IE from structured documents • Previous works learn string language • XML or HTML

Learning process < > …. . ………. < > Structured document Tree automata learner

Testing process < > …. . ………. < > Structured document With each text

Why do we need the context? |-- tr |-- td |-- lastupdate (CDATA) |--

Tree automata • Ranked alphabet : finite set of function symbols with arities. E.

Example • Given an automaton M with the following transitions: 1 : c q

Unranked trees • XML/HTML: bib … paper report book paper … • The number

Encoding of unranked trees • There are well-known methods of encoding to binary trees,

k-testable tree languages • Languages in which membership can be checked by just looking

k-testable tree languages (cont. ) • An example t: html head title r 2(t):

k-testable algorithm [Rico-Juan, et al. ] Given: a set of positive examples T, a

g-testable algorithm • Idea: generalize state transitions from forks that are not important for

g-testable algorithm Given: a set of positive examples T, positive integer k and l

g-testable algorithm example • A set of examples T (with k = 2 and

g-testable algorithm example (cont. ) • F’ = {html(*, *), head(*), body(h 1, x,

Experiment • Two benchmark datasets: Internet Address Finder (IAF) and Quote Server (QS). •

Experiment results The results in % are: Dataset: IAF-altname IAF-org QS-date QS-vol Shakespeare Prec

Further work • More generalization while using bigger context is achieved, but sometimes the

Slides: 24

Download presentation

Information Extraction with Tree Automata Induction Raymond Kosala 1, Jan Van den Bussche 2, Maurice Bruynooghe 1, Hendrik Blockeel 1 1 Katholieke Universiteit Leuven, Belgium 2 University of Limburg, Belgium

Outline • Introduction: – information extraction (IE) – grammatical inference • • Approach k-testable and g-testable algorithms Preliminary result Further work 2

IE from unstructured documents Company Name Job Title Requirement 1 Requirement 2 • Extract certain fields of interest from a text • Learner is trained with (positive) examples • Each learner focuses on a single field marked with ‘x’ 3

Grammatical inference • • • : finite alphabet Regular language L * Given: set of examples (pos. or neg. ) Task: infer a DFA compatible with examples Quality criterion: – Exact learning in the limit – PAC – etc. • Large body of work 4

IE with grammatical inference • Mark field ‘x’ as special token • Infer DFA for the language L = {S over ( x)* | the field to be extracted is marked by x} • Only positive examples 5

IE from structured documents • Previous works learn string language • XML or HTML data: tree structured • Natural extension is to learn a tree language a a a b x b “x has a b-brother” b c x a c c b c a c c • Extraction of a field can depend on structural context 6

Learning process < > …. . ………. < > Structured document Tree automata learner Parsed annotated Transformed 7

Testing process < > …. . ………. < > Structured document With each text nodes replaced, run the tree automaton Parsed Output Transformed 8

Page examples 9

Why do we need the context? |-- tr |-- td |-- lastupdate (CDATA) |-- td |-- b |-- 12/4/98 |-- td |-- tr |-- td |-- organization (CDATA) |-- td |-- b |-- ABC |-- td |-- tr |-- td • Not enough to differentiate the fields of interest that depend on the structural context. • Can be chosen automatically. 10

Tree automata • Ranked alphabet : finite set of function symbols with arities. E. g. = {a(2), b(2), c(0)} • Tree ground term over a c a b c a(c, a(b(c, c), b(c, c))) : tree with depth 3 b c c c • Tree automaton: M = ( , Q, , F). • is a set of transitions of the form: v(q 1, …, qn) q Where v , n is the arity of v , qi and q Q 11

Example • Given an automaton M with the following transitions: 1 : c q 0 2 : a(q 0 , q 0) q 0 3 : b(q 0 , q 0) accept • M accepts a tree t t has b-node as the root a a c b c a c a c c c 12

Unranked trees • XML/HTML: bib … paper report book paper … • The number of children is not fixed by the label • Two approaches: 1. Generalize notion of tree automaton to unranked trees. L transition rules: v(e) q , where e is regular expression over Q 2. Encode as ranked tree 13

Encoding of unranked trees • There are well-known methods of encoding to binary trees, we use: – encode(T) = – encodef(v(F 1), F 2) = Where: T : = v(F), v encodef(T) v vright(encodef(F 2)) vleft(encodef(F 1)) v(encodef(F 1), encodef(F 2)) F : = T, F if F 1 = F 2 = if F 1 = , F 2 if F 1 , F 2 = otherwise • Example: a_left a b c d becomes | b_right c a a d 14

k-testable tree languages • Languages in which membership can be checked by just looking at subtrees of length k-1 that appear in the tree. • k-roots: • k-forks: • k-subtrees: 15

k-testable tree languages (cont. ) • An example t: html head title r 2(t): body h 1 head table f 2(t): html head html body head body title h 1 table s 2(t): body h 1 table body head table title h 1 title 16

k-testable algorithm [Rico-Juan, et al. ] Given: a set of positive examples T, a positive integer k Q, FS, = Ø For each t T, – Let R = rk-1(t); F = fk(t); S = sk-1(t); – – Q = Q R rk-1( F ) S ; FS = FS R; v(t 1, …, tm) S : = { m(v(t 1, …, tm)) = v(t 1, …, tm)} v(t 1, …, tm) F : = { m(v(t 1, …, tm)) = rk-1(v(t 1, …, tm))} 17

g-testable algorithm • Idea: generalize state transitions from forks that are not important for the extraction. • Important forks are those that contain ‘x’ and (possibly) the distinguishing context. t: a b gen(t, 1): c d a * e gen(t, 2): b * * a * c * * 18

g-testable algorithm Given: a set of positive examples T, positive integer k and l FS, , Ss, CF, OF’ = Ø For each t T, – – – Let R = rk-1(t); F = fk(t); S = sk-1(t); Ss = Ss S FS = FS R v(t 1, …, tm) S : = { m(v(t 1, …, tm)) = v(t 1, …, tm)} CF = CF {f | f F, f contains x} OF = OF {f | f F, f does not contain x} For each of OF, – of’ = gen(of, l) – If of’ covers one of CF then OF’ = OF’ of else OF’ = OF’ of’ Let F’ = OF’ CF Q = Q F rk-1( F’ ) Ss v(t 1, …, tm) F’ : = { m(v(t 1, …, tm)) = rk-1(v(t 1, …, tm))} 19

g-testable algorithm example • A set of examples T (with k = 2 and l = 1): html head title • html body h 1 table title body h 1 x F = f 2(T): html head body • • x head title body h 1 x table FS = {html} OF = {html(head, body), head(title)} CF = {body(h 1, x), body(h 1, x, table)} OF’ = {html(*, *), head(*)} • R = r 1(T): html • Ss = s 1(T): table , title , h 1, x 20

g-testable algorithm example (cont. ) • F’ = {html(*, *), head(*), body(h 1, x, table)} • Transitions from the trees in the subtrees s 1(T): (table) = table (title) = title (h 1) = h 1 (x) = x Transitions from the trees in the generalized forks F’: – (html(*, *)) = html – (head(*)) = head – (body(h 1, x)) = body – (body(h 1, x, table)) = body Q = {html ; head ; body ; table ; title ; h 1; x} – – • • 21

Experiment • Two benchmark datasets: Internet Address Finder (IAF) and Quote Server (QS). • Comparison with: HMM, Stalker, and BWI. • The highlights of our method: – More expressive. – Doesn’t require: • manual specifications of windows length of the prefix and suffix of the target field (HMM and BWI) • special tokens of the delimiters such as “: ” “>” (Stalker and BWI) • embedded catalog tree (Stalker) • The limitations: – The field that can be extracted limited to whole node – Slower when extracting 22

Experiment results The results in % are: Dataset: IAF-altname IAF-org QS-date QS-vol Shakespeare Prec Rec F 1 Prec Rec F 1 -------------------------------------------------------------HMM 1. 7 90 3. 4 16. 8 89. 7 28. 4 36. 3 100 53. 3 18. 4 96. 2 30. 9 Stalker 100 48. 0 0 0 BWI 90. 9 43. 5 58. 8 77. 5 45. 9 57. 7 100 100 61. 9 76. 5 k-testable 100 73. 9 85 100 57. 9 73. 3 100 60. 5 75. 4 100 73. 6 84. 8 56. 2 90 69. 2 g-testable 100 73. 9 85 100 82. 6 90. 5 100 60. 5 75. 4 100 73. 6 84. 8 69. 2 90 78. 2 Parameters 4 and (5, 2) 2 and (3, 2) 5 and (6, 5) 3 and (4, 2) * F 1 is the harmonic mean of recall and precision * The results of HMM, Stalker and BWI are adopted from [Freitag & Kushmerick] 23

Further work • More generalization while using bigger context is achieved, but sometimes the binarisation makes the context far from the field of interest the generalization cannot go very far • Work on the algorithm that can work directly with unranked trees. 24