Graphical models for structure extraction and information integration

Information Extraction (IE) & Integration The Extraction task: Given, – E: a set of

IE from free format text • Classical Named Entity Recognition – Extract person, location,

Text segmentation House number Building Road City State Zip 4089 Whispering Pines Nobel Drive

Personal Information Systems – Automatically add a bibtex entry of a paper I download

History of approaches • Manually-developed set of scripts – – Tedious, lots and lots

Basic chain model for extraction My review of Fermat’s last theorem by S. Singh

Features • The word as-is • Orthographic word properties • Capitalized? Digit? Ends-with-dot? •

Outline Graphical models Extraction – Chain models - Basic extraction (Word-level) – Associative Markov

Undirected Graphical models • Joint probability distribution of multiple variables expressed compactly as a

The joint probability distribution Cliques of graph Normalizing constant Potential function y 1 y

Form of potentials Conditional Random Fields (CRFs) Lafferty et al, ICML 2000 Model probability

Inference on graphical models • Probability of an assignment of variables • Most likely

Message passing • Efficient two-pass dynamic programming algorithm for graphs without cycles – Viterbi

Long range dependencies • Extraction with repeated names (Bunescu et al 2004)

Dependency graph • Assume only word-level matches. nitric y 1 oxide synthase e. Nos

Associative Markov Networks y 1 y 2 y 3 y 4 y 5 y

Factorial CRFs: multiple linked chains • Several synchronized inter-dependent tasks – POS, Noun phrase,

Inference with multiple chains • Graph has cycles, most likely exact inference intractable •

Conventional Extraction Research Labeled unstructured text Unstructured text 1 text 2 text 3 Training

Goals of integration • Exploit database to improve extraction – Entity might exist in

R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see

Segmentation models (Semi-CRFs) t x y 1 2 3 4 5 6 7 8

Graphical models for segmentation y 1 y 2 y 3 y 4 y 5

Effect of database on extraction performance L Personal. Bib Address L+DB %Δ author 75.

CACM 2000, R. Fagin and J. Helpern, Belief, awareness, reasoning in AI Combined Extraction+integration

Combined extraction + matching • Convert predicted label to be a pair y =

Constrained models • Training – Ignore constraints or use max-margin methods that require only

Full integration performance L Personal. Bib Address L+DB %Δ author 70. 8 74. 0

What next in data integration? • Lots to be done in building large-scale, viable

Probabilistic Querying Systems • Integration systems while improving, cannot be perfect particularly for domains

Probabilities in CRFs are well-calibrated Cora citations Ideal Cora headers Ideal Probability of segmentation

Uncertainty in integration systems: Unstructured text Additional training data Model Other more compact models?

In summary • Data integration provides scope for several interesting learning problems • Probabilistic

Slides: 41

Download presentation

Graphical models for structure extraction and information integration Sunita Sarawagi IIT Bombay http: //www. it. iitb. ac. in/~sunita

Information Extraction (IE) & Integration The Extraction task: Given, – E: a set of structured elements – S: unstructured source S extract all instances of E from S The Integration task: Also, given – Database of existing inter-linked entities Resolve which extracted entities exists, and Insert appropriate links and entities. n Many versions involving many source types • Actively researched in varied communities • Several tools and techniques • Several commercial applications

IE from free format text • Classical Named Entity Recognition – Extract person, location, organization names According to Robert Callahan, president of Eastern's flight attendants union, the past practice of Eastern's parent, Houston-based Texas Air Corp. , has involved ultimatums to unions to accept the carrier's terms n Several applications –News tracking – Monitor events –Bio-informatics – Protein and Gene names from publications –Customer care • Part number, problem description from emails in help centers

Text segmentation House number Building Road City State Zip 4089 Whispering Pines Nobel Drive San Diego CA 92122 Author Year Title Journal Volume Page P. P. Wangikar, T. P. Graycar, D. A. Estell, D. S. Clark, J. S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J. Amer. Chem. Soc. 115, 12231 -12237.

Information Extraction on the web

Personal Information Systems – Automatically add a bibtex entry of a paper I download – Integrate a resume in email with the candidates database People Papers Files Emails Projects Web Resumes

History of approaches • Manually-developed set of scripts – – Tedious, lots and lots of special cases Needs continuous refinement as new cases arise Ad hoc ways of combining varied set of clues Example: wrappers, OK for regular tasks • Learning-based approach (lots!) – Rule-based (Whisk, Rapier etc) 80 s • Brittle – Statistical • Generative: HMMs 90 s – Intuitive but not too flexible • Conditional (flexible feature set) 00 s

Basic chain model for extraction My review of Fermat’s last theorem by S. Singh t 1 2 3 4 5 6 7 8 9 x y My review of Fermat’s last theorem by S. Singh Other Title other Author y 3 y 4 y 5 y 6 y 7 y 8 y 1 y 2 Independent model y 9

Features • The word as-is • Orthographic word properties • Capitalized? Digit? Ends-with-dot? • Part of speech • Noun? • Match in a dictionary • Appears in a dictionary of people names? • Appears in a list of stop-words? • Fire these for each label and • The token, • W tokens to the left or right, or • Concatenation of tokens.

Basic chain model for extraction My review of Fermat’s last theorem by S. Singh t 1 2 3 4 5 6 7 8 9 x y My review of Fermat’s last theorem by S. Singh Other Title other Author y 2 y 3 y 4 y 5 y 6 y 1 y 7 y 8 Global conditional model over Pr(y 1, y 2…y 9|x) y 9

Outline Graphical models Extraction – Chain models - Basic extraction (Word-level) – Associative Markov Networks - Collective labeling – Dynamic CRFs - Two labelings (POS, extraction) – 2 -D CRFs - Layout-driven extraction (web) + Integration – Segmentation models - Match with entities databases – Constrained models - Integrating to multiple tables

Undirected Graphical models • Joint probability distribution of multiple variables expressed compactly as a graph y 1 y 2 y 3 y 4 Discrete variables over finite set of labels Example: {Author, Title, Other} y 5 y 3 directly dependent on y 4 y 3 independent of y 1 and y 5 given y 2 & y 4

The joint probability distribution Cliques of graph Normalizing constant Potential function y 1 y 2 y 4 y 3 y 5

Form of potentials Conditional Random Fields (CRFs) Lafferty et al, ICML 2000 Model probability of a set of labels given observation x Observed variables Model parameters Numeric feature

Inference on graphical models • Probability of an assignment of variables • Most likely assignment of variables Exponential terms • Marginal probability of a subset of variables

Message passing • Efficient two-pass dynamic programming algorithm for graphs without cycles – Viterbi is a special case for chains • Cyclic graphs – Approximate answer after convergence, or, – Transform cliques to nodes in a junction tree – Alternatives to message passing • Exploit structure of potentials to design special algorithms – two examples in this talk • Upper bound using one or more trees • MCMC sampling

Long range dependencies • Extraction with repeated names (Bunescu et al 2004)

Dependency graph • Assume only word-level matches. nitric y 1 oxide synthase e. Nos …. with …synthase interaction e. NOS y 2 y 3 y 4 y 5 y 6 y 7 y 8 • Approximate message passing • Sample results (Bunescu et al ACL 2004) Protein names from medline abstracts – F 1: 65% 68% Person names, organization names etc from news articles – F 1: 80% 82%

Associative Markov Networks y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 • Consider a simpler graph • • y 1 Binary labels Only associative edges – Higher potential when same label to both y 2 y 3 y 4 y 5 y 6 y 7 y 8 Exact inference in polynomial time via mincut (Greig 1989) Multi-class, metric labeling approximate algorithm with guarantees (Kleinberg 1999)

Factorial CRFs: multiple linked chains • Several synchronized inter-dependent tasks – POS, Noun phrase, Entity extraction • Cascading propagates errors • Joint models i saw w 2 POS w 1 IE y 1 y 2 mr. ray canning at the market w 3 w 4 w 5 w 6 w 7 w 8 y 3 y 4 y 5 y 6 y 7 y 8

Inference with multiple chains • Graph has cycles, most likely exact inference intractable • Two alternatives – Approximate message passing – Upper bound marginal (Piecewise training) • Treat each edge potential as an independent training instance • Results (F 1): noun phrase + POS – Piecewise training 88%, faster – Belief propagation 86% Combined (Sutton et al, ICML 2004) (Mc. Callum et al, EMNLP/HLT 2005) Staged

Conventional Extraction Research Labeled unstructured text Unstructured text 1 text 2 text 3 Training Model Entities Data integration Labeled unstructured text Training Model Unstructured text 1 text 2 text 3 Entities integrated with existing data Linked entity database

Goals of integration • Exploit database to improve extraction – Entity might exist in the database • Integrate extracted entities, resolve if entity already in database – If existing, create links – If not existing, create a new entry

R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see Articles 3 Toplevel entities Id 2 Title Journals Year Update Semantics 1983 Journal Canonical Author 2 11 2 2 2 3 Name 10 ACM TODS 17 AI 16 ACM Trans. Databases Canonical 10 Writes Article Id Authors Id Name 11 M Y Vardi 2 J. Ullman 4 3 Ron Fagin 3 4 Jeffrey Ullman 4 Canonical 17 Variant links to canonical entries Database: normalized, stores noisy variants

Segmentation models (Semi-CRFs) t x y 1 2 3 4 5 6 7 8 R. Fagin and J. Helpbern Belief Awareness Reasoning Author Other Author Title Features describe the single word “Fagin”

Segmentation models (Semi-CRFs) t x y 1 2 3 4 5 6 7 8 R. Fagin and J. Helpbern Belief Awareness Reasoning Author Other Author Title Features describe the single word “Fagin” l, u x y l 1=1, u 1=2 R. Fagin Author l 1=u 1=3 and Other l 1=4, u 1=5 J. Helpbern Author l 1=6, u 1=8 Belief Awareness Reasoning Title Similarity to author’s column in database Features describe the segment from l to u

Graphical models for segmentation y 1 y 2 y 3 y 4 y 5 y 6 y 7 • Graph has many cycles – clique size = maximum segment length • Two kinds of potentials – Transition potentials • Only across adjacent nodes – Segment potentials • Requires all positions in segment to have the same label exact inference possible in time linear * maximum segment length (Cohen & Sarawagi 2004) y 8

Effect of database on extraction performance L Personal. Bib Address L+DB %Δ author 75. 7 79. 5 4. 9 journal 33. 9 50. 3 48. 6 title 61. 0 70. 3 15. 1 city_name 72. 4 76. 7 6. 0 state_name 13. 9 33. 2 138. 5 zipcode 91. 6 94. 3 3. 0 L = Only labeled structured data L + DB: similarity to database entities and other DB features (from Mansuri et al ICDE 2006)

R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see Extraction Articles Id 2 Author: R. Fagin A Author: J. Helpern Title Belief, . . reasoning Journal: AI Year: 1998 Integration 7 Title Year Update Semantics Belief, awareness, reasoning Journal 1983 Journals Canonical Id Name 10 ACM TODS 17 AI 17 16 ACM Trans. Databases 10 10 1988 17 Writes Article Author Canonical Authors Id Name Canonical 2 11 11 M Y Vardi 2 2 2 J. Ullman 4 2 3 3 Ron Fagin 3 7 8 4 Jeffrey Ullman 4 7 9 8 9 R Fagin J Helpern 3 8 Match with existing linked entities while respecting all constraints

CACM 2000, R. Fagin and J. Helpern, Belief, awareness, reasoning in AI Combined Extraction+integration Only extraction Author: R. Fagin Author: J. Helpern Title: Belief, . . reasoning Journal: AI Year: 2000 Author: R. Fagin Author: J. Helpern Title: Belief, . . reasoning in AI Journal: CACM Year: 2000 Id Title Year Journal 2 Update Semantics 1983 10 7 Belief, awareness, reasoning 1988 17 Year mismatch! Canonical

Combined extraction + matching • Convert predicted label to be a pair y = (a, r) • (r=0) means none-of-the-above or a new entry l, u x y r l 1=1, u 1=2 CACM. 2000 Journal Year 0 7 l 1=u 1=3 Fagin l 1=4, u 1=8 Belief Awareness Reasoning In AI Author Title 3 7 Id of matching entity Constraints exist on ids that can be assigned to two segments

Constrained models • Training – Ignore constraints or use max-margin methods that require only MAP estimates • Application: – Formulate as a constrained integer programming problem (expensive) – Use general A-star search to find most likely constrained assignment

Full integration performance L Personal. Bib Address L+DB %Δ author 70. 8 74. 0 4. 5 journal 29. 6 45. 5 53. 6 title 51. 6 65. 0 25. 9 city_name 70. 1 74. 6 6. 4 9. 0 28. 3 213. 8 87. 8 90. 7 3. 3 state_name pincode • L = conventional extraction + matching • L + DB = technology presented here • Much higher accuracies possible with more training data (from Mansuri et al ICDE 2006)

What next in data integration? • Lots to be done in building large-scale, viable data integration systems • Online collective inference – Cannot freeze database – Cannot batch too many inferences – Need theoretically sound, practical alternatives to exact, batch inference • Performance of integration (Chandel et al, ICDE 2006) • Other operations – Data standardization – Schema management

Probabilistic Querying Systems • Integration systems while improving, cannot be perfect particularly for domains like the web • Users supervision of each integration result impossible Create uncertainty-aware storage and querying engines – Two enablers: • Probabilistic database querying engines over generic uncertainty models • Conditional graphical models produce well-calibrated probabilities

Probabilities in CRFs are well-calibrated Cora citations Ideal Cora headers Ideal Probability of segmentation Probability correct E. g: 0. 5 probability Correct 50% of the times

Uncertainty in integration systems: Unstructured text Additional training data Model Other more compact models? Entities p 1 Entities p 2 Entities pk Very uncertain? IEEE Intl. Conf. On data mining 0. 8 Probabilistic database system Conf. On data mining Select conference name of article RJ 03? Find most cited author? 0. 2 D Johnson 16000 0. 6 J Ullman 0. 4 13000

In summary • Data integration provides scope for several interesting learning problems • Probabilistic graphical models provide robust, unified mechanism of exploiting wide variety of clues and dependencies • Lot of open research challenges in making graphical models work in a practical setting