Typeenabled Keyword Searches with Uncertain Schema Soumen Chakrabarti
Type-enabled Keyword Searches with Uncertain Schema Soumen Chakrabarti IIT Bombay www. cse. iitb. ac. in/~soumen ICML 2005 Chakrabarti
Evolution of Web search § The first decade of Web search • Crawling and indexing at massive scale • Macroscopic whole-page connectivity analysis • Very limited expression of information need § Exploiting entities and relations—clear trend • • • ICML 2005 Maintaining large type systems and ontologies Discovering mentions of entities and relations Deduplicating and canonicalizing mentions Forming uncertain, probabilistic E-R graphs Enhancing keyword or schema-aware queries Chakrabarti 2
Know. It. All Frame. Net Wikipedia Word. Net 1 Raw corpus Disambiguation Named entity tagging Relation tagging Uniform lexical network provider Annotated corpus Indexer Past query workload stats Question ICML 2005 Answer type predictor 2 Text index Annotation index Ranking engine Keyword match predictor 4 3 Response snippets Chakrabarti 3
Populating entity and relation tables § Hearst patterns (Hearst 1992) • T such as x, x and other T, x is a T § DIPRE (Brin 1998) § Snowball (Agichtein+ 2000) • [left] entity 1 [middle] entity 2 [right] § PMI-IR (Turney 2001) • Recognize synonyms using Web stats § Know. It. All (Etzioni+ 2004) § C-PANKOW (Cimiano+ 2005) • Is-a relations from Hearst patterns, lists, PMI ICML 2005 Chakrabarti 4
DIPRE and Snowball Seed tuples Augmented table Generate extraction patterns Tag mentions in free text Locate new tuples Encoded as bag-of-words ℓ m r … the Irving-based Exxon Corporation … location ICML 2005 Chakrabarti organization 5
Scoring patterns and tuples Snowball DIPRE § Pattern confidence = m+/(m+ + m−) over validation tuples Uses 5 -part encoding § Soft-or tuple confidence = § Recent improvements: Urn model (Etzioni+ 2005) ICML 2005 Chakrabarti 6
Know. It. All and C-PANKOW § A “propose-validate” approach • Using existing patterns, generate queries • For each web page w returned • Extract potential fact e and assign confidence score • Add fact to database if it has high enough score § Patterns use chunk info ICML 2005 Chakrabarti 7
Exploiting answer types with PMI § From two word queries to two text boxes • • author; “Harry Potter” Answer type person; “Eiffel Tower” Keywords director; Swades movie to match city; India Pakistan cricket § Keywords search engine snippets § Every token/chunk in a snippet is a candidate • Elimination hacks that we won’t discuss § Fire Hearst pattern queries between desired answer type and candidate token/chunk ICML 2005 Chakrabarti 8
Information carnivores at work KO : : India Pakistan Cricket Series A web site by Khalid Omar, sort of live from Karachi, Pakistan. “cities such as [probe]” “[probe] and other cities”, “[probe] is a city”, etc. § “Garth Brooks is a country” [singer], “gift such as wall” [clock] § “person like Paris” [Hilton], “researchers like Michael Jordan” (which one? ) ICML 2005 Chakrabarti 9
Sample output § author; “Harry Potter” • J K Rowling, Ron § person; “Eiffel Tower” Ambiguity and extremely skewed Web popularity • Gustave, (Eiffel), Paris § director; Swades movie • Ashutosh Gowariker, Ashutosh Gowarikar § What can search engines do to help? • Cluster mentions and assign IDs • Allow queries for IDs — expensive! • “Harry Potter” context in “Ron is an author” ICML 2005 Chakrabarti 10
Know. It. All Frame. Net Wikipedia Word. Net 1 Raw corpus Disambiguation Named entity tagging Relation tagging Uniform lexical network provider Annotated corpus Indexer Past query workload stats Question ICML 2005 Answer type predictor 2 Text index Annotation index Ranking engine Keyword match predictor 4 3 Response snippets Chakrabarti 11
Answer type (atype) prediction § Standard sub-problem in question answering § Increasingly important (but more difficult) for grammar-free Web queries (Broder 2002) § Current approaches • Pattern matching, e. g. head of noun phrase adjacent to what or which; map when, who, where, directly to classes time, person, place • Coupled perceptrons (Li and Roth, 2002) • Linear SVM on bag-of-2 grams (Hacioglu 2002) • SVM with tree kernel on parse (Zhang and Lee, 2004): slim gains § Surely a parse tree holds more usable info ICML 2005 Chakrabarti 12
Informer span § A short, contiguous span of question tokens reveals the anticipated answer type (atype) § Except in multi-function questions, one informer span is dominant and sufficient • • What is the weight of a rhino? How much does a rhino weigh? How much does a rhino cost? Who is the CEO of IBM? § Question parse informer span tagger § Learn atype label from informer + question ICML 2005 Chakrabarti 13
Example What WP is the capital city of Japan 0 NN NN IN NNP 1 NP 2 VBZ DT WHNP VP NP PP 4 SBARQ Level NP SQ 3 What, is, the 1 capital, of, city Japan 2 3 (start) 5 6 § Pre-in-post Markov process produces question § Train a CRF with features derived from parse tree • POS, attachments to neighboring chunks, multiple levels • First noun chunk? Adjacent to second verb? ICML 2005 Chakrabarti 14
Atype guessing accuracy Question Trained CRF Filter Informer feature generator Ordinary feature generator Merge Feature vector Linear SVM ICML 2005 Atype Chakrabarti 15
Know. It. All Frame. Net Wikipedia Word. Net 1 Raw corpus Disambiguation Named entity tagging Relation tagging Uniform lexical network provider Annotated corpus Indexer Past query workload stats Question ICML 2005 Answer type predictor 2 Text index Annotation index Ranking engine Keyword match predictor 4 3 Response snippets Chakrabarti 16
Scoring function for typed search § Instance of atype “near” keyword matches • IR systems: “hard” proximity predicates • Search engines: unknown reward for proximity • XML+IR, XRank: “hard” word containment in subtree Not closest ICML 2005 Selectors Chakrabarti Candidate born John Baird Inventor 1925. in was invented Selectors: invent*, television Atype: person#n#1 was Up to some maximum window Question: Who invented the television? IS-A person#n#1 17
Learning a scoring function § Assume parametric form for a ranking classifier • Form of IDF, window size, • Can also choose among decay function forms § Question-answer pairs give partial orders (Joachims 2004) § Recall in top-50, mean reciprocal rank ICML 2005 Chakrabarti 18
Indexing issues § Standard IR posting: word {(doc, offsets)} • word 1 near word 2 is standard • instance-of(atype) near {word 1, word 2, …} § Word. Net has 80000 atype nodes, 17000 internal, depth > 10 • “horse” also indexed as mammal, animal, sports equipment, chess piece, … • Original corpus 4 GB, gzipped corpus 1. 3 GB, IR index 0. 9 GB, full atype index 4. 3 GB § XML structure indices not designed for finegrain, word-as-element-node use ICML 2005 Chakrabarti 19
Exploit skew in query atypes? § § Index only a small registered set of atypes R Relax query atype a to generalization g in R Test a response reachability and retain/discard How to pick R? What is a good objective? • Relaxed query and discarding steps cost extra time • Rare atypes in what, which, and name questions— long-tailed distribution ICML 2005 Chakrabarti 20
Approx objective and approach § Index space approx § Expected query time bloat is approx § Minimize approx index space with an upper bound on bloat (hard, as expected) § Sparseness: query. Prob(a) observed to be zero for most a-s in a large taxonomy § Smooth using similarity between atypes ICML 2005 Chakrabarti 21
Sample results Runtime § Index space approximation reasonable § Reasonable average query time bloat with small index space overheads Using g Using a ICML 2005 Queries Chakrabarti 22
Summary § Entity and relation annotators • Maturing technology • Unlikely to be perfect for open-domain sources § The future: query paradigms that combine text and annotations • End-user friendly selection and aggregation • Allow uncertainty, exploit redundancy § Can we scale to terabytes of text? § Will centralized search engines be feasible? § How to federate annotation management? ICML 2005 Chakrabarti 23
- Slides: 23