The Sphere Search Engine for Unified Ranked Retrieval

Problem Web … Intranet Enterprise Information Systems Databases Search for… The inventor of the

Arising Questions • What do we know about the structure of the data? •

Example query #1 What are the publications of Max Planck? Max Planck should be

Example query #2 ? Conferences about XML in Norway 2005 Information is not present

Example query #3 Which professors from the Technion do research on Theory of computer

Sphere. Search Concepts Goal: Increase recall & precision for hard queries on linked and

Outline • Challanges in search engines • Sphere. Search Concepts • Transformation and Annotation

Unifying Search on Heterogeneous Data Web Intranet XML Heuristics, type-spec transformations Databases Enterprise Information

Heuristic Transformation of HTML Goal: Transform layout tags to semantic annotations • Headlines <h

Basic Data Model <Professor> Gerhard Weikum <Course> IR </Course> Saarbrücken <Research> XML </Research> </Professor>

Information Extraction (IE) • Named Entity Recognition • Based on part-of-speech tagging and large

Annotation-Aware Data Model <Professor> Gerhard Weikum <Course>IR</Course> Saarbrücken <Research>XML</Research> </Professor> 1 2 Annotation introduces

Unifying Search on Heterogeneous Data Web Intranet Databases XML Heuristics, type-spec transformations Annotation of

Data Model • Given a collection of XML documents and links, we define the

Sphere. Search Queries Extended keyword queries: • similarity conditions ~professor, ~Information retrieval • concept-based

Sphere. Search Queries: Examples concept (tag) value (content) Query group R(professor, location=~Jordan) C(course, ~database)

Formal Query Language • A Sphere. Search query consists of a set of query

Local Node Score Query: Course ~XML location=Max Planck • We compute a node score

Similarity Conditions Similarity conditions like ~professor, ~XML For ~K similarity we first compute exp(K);

Concept-based conditions Goal: Exploit explicit (tags) and automatic annotations in documents location=Jordan concept value

Spheres • Most existing retrieval systems consider only the content of an element itself

Score Aggregation: Sphere. Score Local score for each research XML element e (tf/idf, BM

Query Groups Goal: Related terms should occur in the same context Sphere. Score computed

Scores for Query Results query result R: one result per query group compactness ~

Join conditions Goal: Connect results of different query groups A(research, XML) B(VLDB 2005 paper)

Score for Join Conditions Join condition A. T=B. S: • For all nodes n

Architecture Crawler Client GUI Transformer Ontology Service Query Processor Annotator Indexer Index 29/38

Setup for Experiments No existing benchmark (INEX, TREC, …) fits Three corpora: • Wikipedia

Incremental Language Levels SSE-Join (join conditions) SSE-QG (query groups) SSE-CV (concept-based conditions) SSE-basic (keywords,

Experimental Results on Wiki++ and DBLP++ • Sphere. Scores better than local scores •

Qualitative query examples • Concept-Value: (American, politician, rice) (person=rice, politician) precision@10: 0 ? 0.

Conclusion • We introduced Sphere. Search for unified ranked retrieval of heterogenous data –

Future Work • Further Experimentation is needed • Integration with Semantic-Web – Usage of

Slides: 38

Download presentation

The Sphere. Search Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents Jens Graupmann Ralf Schenkel Gerhard Weikum

Problem Web … Intranet Enterprise Information Systems Databases Search for… The inventor of the world wide web Gothic and romantic churches that are located in the same place Movies with an actor who is the governor of California 2/38

Arising Questions • What do we know about the structure of the data? • Why do we care what is the structure of the data? Roy Smith works in Apple <worker> <name> Roy Smith</name> <company>Apple</company> </worker> XPath Query: //worker[company=“Apple”]/name • Can we “structure” data? How? <person> Roy Smith</person> works in <company>Apple</company> • Does John Doe also works in Apple? <company>Apple</company> employs <person> John Doe</person> • How do we interpret links? 3/38

Example query #1 What are the publications of Max Planck? Max Planck should be instance of concept person, not of concept institute Concept Awareness 4/38

Example query #2 ? Conferences about XML in Norway 2005 Information is not present on a single page, but distributed across linked pages VLDB Conference 2005, Trondheim, Norway Call for Papers …XML… Context Awareness 5/38

Example query #3 Which professors from the Technion do research on Theory of computer science Dr. Amir Shpilka …senior lecturer… Abstraction Different terminology in query and Web pages Awareness 6/38

Sphere. Search Concepts Goal: Increase recall & precision for hard queries on linked and heterogeneous data • Unified search for unstructured, semistructured, structured data from heterogeneous sources • Graph-based model, including links • Annotation engines from NLP to recognize classes of named entities (persons, locations, dates, …) for concept-aware queries • Flexible yet simple abstraction-aware query language with context-aware scoring • Compactness-based scores 7/38

Outline • Challanges in search engines • Sphere. Search Concepts • Transformation and Annotation • Query Language and Scoring • Experimental Evaluation • Current and Future Work 8/38

Unifying Search on Heterogeneous Data Web Intranet XML Heuristics, type-spec transformations Databases Enterprise Information Systems … 9/38

Heuristic Transformation of HTML Goal: Transform layout tags to semantic annotations • Headlines <h 1>Experiments</h 1> <h 2>Settings</h 2> We evaluated. . . <h 2>Results</h 2> Our system. . . <Experiments> <Settings>. . . </Settings> <Results>. . . </Results> </Experiments> • Patterns <b>Topic: </b>XML <Topic>XML</Topic> • Rules for tables, lists, … 10/38

Basic Data Model <Professor> Gerhard Weikum <Course> IR </Course> Saarbrücken <Research> XML </Research> </Professor> person docid=1 tag=“Professor“ 1 content=“Gerhard Weikum Saarbrücken“ docid=1 2 tag=“Course“ content=“IR“ 3 location docid=1 tag=“Research“ content=“XML“ Automatic. Tags annotation important annotateofcontent with concepts (persons, locations, dates, corresponding concept money amounts) with tools from Information Extraction 11/38

Information Extraction (IE) • Named Entity Recognition • Based on part-of-speech tagging and large dictionary containing names of cities, countries, common person names etc. • Mature (out-of-the-box products, e. g. GATE/ANNIE) • Extensible The Hotel in Salvador, operated by in The Pelican <company> Pelican Hotel </company> Roberto Cardoso, offers comfortable roomsbystarting at <location> Salvador </location>, operated $100 a night, including breakfast. <person> Roberto Cardoso </person>, offers Please checkrooms in before 7 pm. comfortable starting at <price> $100 </price> a night, including breakfast. Please check in before <time> 7 pm </time>. 12/38

Annotation-Aware Data Model <Professor> Gerhard Weikum <Course>IR</Course> Saarbrücken <Research>XML</Research> </Professor> 1 2 Annotation introduces new tags 2 docid=1 tag=“Course“ content=“IR“ docid=1 tag=“Professor“ content=“Gerhard Weikum Saarbrücken“ docid=1 tag=“Course“ content=“IR“ 3 docid=1 tag=“Research“ content=“XML“ Annotation with GATE: „Saarbrücken“ of type „location“ docid=1 tag=„Professor“ 1 content=“Gerhard Weikum“ docid=1 tag=“location“ 4 content=“Saarbrücken“ 3 docid=1 tag=“Research“ content=“XML“ 13/38

Unifying Search on Heterogeneous Data Web Intranet Databases XML Heuristics, type-spec transformations Annotation of named entities with IE tools (e. g. , GATE) Enterprise Information Systems … Annotated XML 14/38

Outline • Challanges in search engines • Sphere. Search Concepts • Transformation and Annotation • Query Language and Scoring • Experimental Evaluation • Current and Future Work 15/38

Data Model • Given a collection of XML documents and links, we define the element-level graph q V is the union of the elements of all documents q E contains all parent child edges and links q Attributes are considered as if they were elements • G is undirected for it is easier to phrase queries without thinking about the direction of the edges • For each element , denotes to the tag name and its content • Each edge is assigned a nonnegative weight which is 1 for parent-child edge and for links • takes two elements as input and computes the weight of the shortest path in G between them 16/38

Sphere. Search Queries Extended keyword queries: • similarity conditions ~professor, ~Information retrieval • concept-based conditions person=Max Planck, location=Paris • grouping • join conditions Ranked results with context-aware scoring 17/38

Sphere. Search Queries: Examples concept (tag) value (content) Query group R(professor, location=~Jordan) C(course, ~database) R(~seminar, ~XML) For each query group, a disjunction of basic conditions A(gothic, church) B(romantic, church) A. location=B. location Join condition • Supports traditional keyword search 18/38

Formal Query Language • A Sphere. Search query consists of a set of query groups and a set of join conditions (possibly empty) • Each consists of (keywords conditions) and (concept value conditions) • A join has the form for exact match join and for similarity join ( are query groups and are tag names) A result for Query is a list of g-tuples of elements sorted by score 19/38

Local Node Score Query: Course ~XML location=Max Planck • We compute a node score for each node based on (tf/idf, Okapi BM 25 scoring model…) adapted for XML • How can we use such measure in order to score ~XML or location=Max Planck ? 20/38

Similarity Conditions Similarity conditions like ~professor, ~XML For ~K similarity we first compute exp(K); a set of all terms similar to K using the ontology Example: δ-exp(K)={w|sim(K, w)>δ} Local score: weighted max over all expansion terms: Thesaurus/Ontology: concepts, relationships, glosses from Word. Net, Gazetteers, Web forms & tables, Wikipedia alchemist primadonna artist director wizard investigator intellectual researcher professor HYPONYM (0. 7) scientist scholar academic, academician, faculty member Abstraction awareness educator lecturer mentor teacher relationships quantified by statistical co-occurence measures 21/38

Concept-based conditions Goal: Exploit explicit (tags) and automatic annotations in documents location=Jordan concept value concept-specific distance measure like 1970<date<1980 docid=1 tag=„location“ n content=“Jordan“ Concept awareness 22/38

Spheres • Most existing retrieval systems consider only the content of an element itself to asses its relevance for a query • In Sphere. Search this type of score is provided by the Node Score (ns) • Local score may not be sufficient – In the presence of links – When content is spread over several elements 23/38

Score Aggregation: Sphere. Score Local score for each research XML element e (tf/idf, BM 25, …) 1 2 2 1 s(1): Weighted aggregation of local scores in environment of element (sphere score): Context awareness 24/38

Query Groups Goal: Related terms should occur in the same context Sphere. Score computed for each group Group conditions that relate to the same „entity“ professor teaching IR research XML professor T(teaching IR) R(research XML) Find compact sets with one result for each group 25/38

Scores for Query Results query result R: one result per query group compactness ~ 1/size of a minimal spanning tree A 3 1 1 A A 1 3 4 X X 2 X 1 2 2 B 5 3 X 2 1 B B 5 6 X Context awareness 1 2 26/38

Join conditions Goal: Connect results of different query groups A(research, XML) B(VLDB 2005 paper) A. person=B. person B A VLDB research 2005 1. 0 XML Ralf Schenkel 0. 9 2004 2005 R. Schenkel • Join conditions do not change the score for a node • Join conditions create a new link with a specific weight 27

Score for Join Conditions Join condition A. T=B. S: • For all nodes n 1 with type T, n 2 with type S, add edge (n 1, n 2) with weight sim(n 1, n 2))-1 • sim(n 1, n 2): content-based similarity A B 2 3 1 X 1 2 X 2 B 28/38

Architecture Crawler Client GUI Transformer Ontology Service Query Processor Annotator Indexer Index 29/38

Outline • Challanges in search engines • Sphere. Search Concepts • Transformation and Annotation • Query Language and Scoring • Experimental Evaluation • Current and Future Work 30/38

Setup for Experiments No existing benchmark (INEX, TREC, …) fits Three corpora: • Wikipedia • extended Wikipedia with links to IMDB • extended DBLP corpus with links to homepages 50 Queries like • A(actor birthday 1970<date<1980) western • G(California, governor) M(movie) • A(Madonna, husband) B(director) A. person=B. directo Opponent: keyword queries with standard TF/IDF-based score „simplified Google“ 31/38

Incremental Language Levels SSE-Join (join conditions) SSE-QG (query groups) SSE-CV (concept-based conditions) SSE-basic (keywords, Sphere. Scores) 32/38

Experimental Results on Wikipdia 33/38

Experimental Results on Wiki++ and DBLP++ • Sphere. Scores better than local scores • New SSE features improve precision 34/38

Qualitative query examples • Concept-Value: (American, politician, rice) (person=rice, politician) precision@10: 0 ? 0. 6 (1970<date<1980, actor) • Query groups: (California, governor, movie) M(movie) precision@10: 0 G(California, governor) 0. 4 • Joins: A(Madonna, husband) B(director) A. person=B. director 0. 4 precision@10: Elements pair: (1) movie directed by Guy Ritchie (2) information that Guy Ritchie is Madonna’s husband 35/38

Conclusion • We introduced Sphere. Search for unified ranked retrieval of heterogenous data – Transformation of heterogenous data into unified format – Annotation of latent concepts – Incorporate concept, context and abstraction features • Query language that is – More expresive then traditional keyword search – Simpler then full-fledged XML query language / SPARQL • The Spheres idea seems to be benificial • Preliminary experiments show improvement in certain queries 36/38

Future Work • Further Experimentation is needed • Integration with Semantic-Web – Usage of existing tools (such as SPARQL) – Ontology based inheritance aware search – Standartization of annotated tags – Easier integration with existing RDF data • More Efficient query evaluation • Better Similarity measure • Parameter tuning with relevance feedback • Deep Web search through automatic queries 37/38

Thank you! 38/38