Statistical learning models for information extraction and entity

  • Slides: 7
Download presentation
Statistical learning models for information extraction and entity resolution Sunita Sarawagi IIT Bombay http:

Statistical learning models for information extraction and entity resolution Sunita Sarawagi IIT Bombay http: //www. cse. iitb. ac. in/~sunita Team: Rahul Gupta (Ph. D) Abhishek Agarkar (Mtech) Upendra (Mtech) Pranav Kashyap (Btech)

Information extraction n Formulate task as a statistical learning model n Exploiting entity-level features

Information extraction n Formulate task as a statistical learning model n Exploiting entity-level features n n Collective inference n n New method of modeling the inference problem interesting inference algorithms Inference algorithms n n n Semi-markov conditional random fields for information extraction. In NIPS, 2004 Efficient inference on sequence segmentation models, ICML 2006 Efficient inference with cardinality-based clique potentials. In ICML, 2007. Training algorithms for structured models n n Better generalizability and more tractable inference. (ICML 08) Domain adaptation: adapt models trained in one domain to new domains

Semi-markov models t x y 1 2 3 4 5 6 7 8 R.

Semi-markov models t x y 1 2 3 4 5 6 7 8 R. Fagin and J. Halpern Belief Awareness Reasoning Author Other Author Title Features describe the single word “Fagin” y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 Segmentation model l, u x y l 1=1, u 1=2 R. Fagin Author l 1=u 1=3 and Other l 1=4, u 1=5 J. Halpern l 1=6, u 1=8 Belief Author Similarity to author’s column in database Features describe full entity Awareness Title Reasoning

Inference in segmentation models R. Fagin and J. Helpern, Belief, awareness, reasoning, In AI

Inference in segmentation models R. Fagin and J. Helpern, Belief, awareness, reasoning, In AI 1998 Surface features (cheap) Many large tables Database lookup features Authors Name (expensive!) M Y Vardi 1. Batch up to do better than individual top-k? Efficient search for top-k most similar entities 2. Find top segmentation without top-k matches for all segments? J. Ullman Ron Fagin Claire Cardie J. Gherke Thorsten J Kleinberg S Chakrabarti Inverted index Jay Shan Jackie Chan Bill Gates

Collective information extraction n Y has character. Mr. X lives in Y. X buys

Collective information extraction n Y has character. Mr. X lives in Y. X buys Y Times daily. y 12 y 22 y 32 y 42 y 11 y 21 y 31 y 41 y 13 y 23 y 33 y 43 y 52 y 53 Associative scores wfe(i, i) > wfe(i, j)

Information extraction n Formulate task as a statistical learning model n Exploiting entity-level features

Information extraction n Formulate task as a statistical learning model n Exploiting entity-level features n n Collective inference n n New method of modeling the inference problem interesting inference algorithms Inference algorithms n n n Semi-markov conditional random fields for information extraction. In NIPS, 2004 Efficient inference on sequence segmentation models, ICML 2006 Efficient inference with cardinality-based clique potentials. In ICML, 2007. Training algorithms for structured models n n Better generalizability and more tractable inference. (ICML 08) Domain adaptation: adapt models trained in one domain to new domains

Managing imprecision n n Representing the imprecision of extraction as simple row and column

Managing imprecision n n Representing the imprecision of extraction as simple row and column uncertainty models for easy querying (VLDB 06) Aggregate queries over data with uncertain duplicates (EDBT 08) n Given a large set of entities with noisy duplicates where finding duplicate groups is expensive, find n n n K largest group Groups with count >= threshold All papers available at: http: //www. it. iitb. ac. in/~sunita/pubs. html