The Sphere Search Engine for Unified Ranked Retrieval

  • Slides: 32
Download presentation
The Sphere. Search Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents

The Sphere. Search Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents Ralf Schenkel joint work with Jens Graupmann and Gerhard Weikum

Outline • Where existing search engines fail • Sphere. Search Concepts • Transformation and

Outline • Where existing search engines fail • Sphere. Search Concepts • Transformation and Annotation • Query Language and Scoring • Experimental Evaluation • Summary VLDB 2005, Trondheim, Norway 2

Example query #1 Which professors from Saarbrücken do research on XML Different terminology in

Example query #1 Which professors from Saarbrücken do research on XML Different terminology in query and Web pages Director of Department 5 DBS & IS Professor at Saarland University Abstraction Awareness VLDB 2005, Trondheim, Norway 3

Example query #2 ? Conferences about XML in Norway 2005 Information is not present

Example query #2 ? Conferences about XML in Norway 2005 Information is not present on a single page, but distributed across linked pages VLDB Conference 2005, Trondheim, Norway Call for Papers …XML… Context Awareness VLDB 2005, Trondheim, Norway 4

Example query #3 What are the publications of Max Planck? Max Planck should be

Example query #3 What are the publications of Max Planck? Max Planck should be instance of concept person, not of concept institute Concept Awareness VLDB 2005, Trondheim, Norway 5

Sphere. Search Concepts Goal: Increase recall & precision for hard queries on linked and

Sphere. Search Concepts Goal: Increase recall & precision for hard queries on linked and heterogeneous data • Unified search for unstructured, semistructured, structured data from heterogeneous sources • Graph-based model, including links • Annotation engines from NLP to recognize classes of named entities (persons, locations, dates, …) for concept-aware queries • Flexible yet simple abstraction-aware query language with context-aware scoring • Compactness-based scores VLDB 2005, Trondheim, Norway 6

Some Related Work • Web Query Languages e. g. , W 3 QS [VLDB

Some Related Work • Web Query Languages e. g. , W 3 QS [VLDB 95], Web. OQL [ICDE 95], … • Web IR with thesauri e. g. , Qiu et al. [SIGIR 93], Liu et al. [SIGIR 04], … • XML IR e. g. , XXL [Web. DB 00], XIRQL [SIGIR 01], XSearch [VLDB 93], XRank [SIGMOD 03], … • Information extraction e. g. , Lixto, Know. It. All, … • Advanced Web graph IR e. g. , BANKS [ICDE 02], Hristidis et al. [VLDB 03], … VLDB 2005, Trondheim, Norway 7

Outline • Where existing search engines fail • Sphere. Search Concepts • Transformation and

Outline • Where existing search engines fail • Sphere. Search Concepts • Transformation and Annotation • Query Language and Scoring • Experimental Evaluation • Current and Future Work VLDB 2005, Trondheim, Norway 8

Unifying Search on Heterogeneous Data Web Intranet XML Heuristics, type-spec transformations Databases Enterprise Information

Unifying Search on Heterogeneous Data Web Intranet XML Heuristics, type-spec transformations Databases Enterprise Information Systems … VLDB 2005, Trondheim, Norway 9

Heuristic Transformation of HTML Goal: Transform layout tags to semantic annotations • Headlines <h

Heuristic Transformation of HTML Goal: Transform layout tags to semantic annotations • Headlines <h 1>Experiments</h 1> <h 2>Settings</h 2> We evaluated. . . <h 2>Results</h 2> Our system. . . <Experiments> <Settings>. . . </Settings> <Results>. . . </Results> </Experiments> • Patterns <b>Topic: </b>XML <Topic>XML</Topic> • Rules for tables, lists, … VLDB 2005, Trondheim, Norway 10

(Almost) Generic XML Data Model <Professor> Gerhard Weikum <Course> IR </Course> Saarbrücken <Research> XML

(Almost) Generic XML Data Model <Professor> Gerhard Weikum <Course> IR </Course> Saarbrücken <Research> XML </Research> </Professor> person docid=1 tag=“Professor“ 1 content=“Gerhard Weikum Saarbrücken“ docid=1 2 tag=“Course“ content=“IR“ 3 location docid=1 tag=“Research“ content=“XML“ Automatic. Tags annotation important annotateofcontent with concepts (persons, locations, dates, corresponding concept money amounts) with tools from Information Extraction VLDB 2005, Trondheim, Norway 11

Information Extraction (IE) • Named Entity Recognition (NER) • Named Entity ~ abstract datatype,

Information Extraction (IE) • Named Entity Recognition (NER) • Named Entity ~ abstract datatype, concept (location, person, …, IP-address) • Mature (out-of-the-box products, e. g. GATE/ANNIE) • Extensible The Hotel in Salvador, operated by in The Pelican <company> Pelican Hotel </company> Roberto Cardoso, offers comfortable roomsbystarting at <location> Salvador </location>, operated $100 a night, including breakfast. <person> Roberto Cardoso </person>, offers Please checkrooms in before 7 pm. comfortable starting at <price> $100 </price> a night, including breakfast. Please check in before <time> 7 pm </time>. VLDB 2005, Trondheim, Norway 12

Unifying Search on Heterogeneous Data Web Intranet Databases XML Heuristics, type-spec transformations Annotation of

Unifying Search on Heterogeneous Data Web Intranet Databases XML Heuristics, type-spec transformations Annotation of named entities with IE tools (e. g. , GATE) Enterprise Information Systems … Annotated XML VLDB 2005, Trondheim, Norway 13

Annotation-Aware Data Model <Professor> Gerhard Weikum <Course>IR</Course> Saarbrücken <Research>XML</Research> </Professor> 1 2 Annotation introduces

Annotation-Aware Data Model <Professor> Gerhard Weikum <Course>IR</Course> Saarbrücken <Research>XML</Research> </Professor> 1 2 Annotation introduces new tags 2 docid=1 tag=“Course“ content=“IR“ docid=1 tag=“Professor“ content=“Gerhard Weikum Saarbrücken“ docid=1 tag=“Course“ content=“IR“ 3 docid=1 tag=“Research“ content=“XML“ Annotation with GATE: „Saarbrücken“ of type „location“ docid=1 tag=„Professor“ 1 content=“Gerhard Weikum“ docid=1 tag=“location“ 4 content=“Saarbrücken“ VLDB 2005, Trondheim, Norway 3 docid=1 tag=“Research“ content=“XML“ 14

Data Model for Links VLDB 2005, Trondheim, Norway 15

Data Model for Links VLDB 2005, Trondheim, Norway 15

Architecture Search Engine INDEX Engine FROM=SIGIR Location= Frankfurt Location= Salvador Date = 15 -18

Architecture Search Engine INDEX Engine FROM=SIGIR Location= Frankfurt Location= Salvador Date = 15 -18 August Price =89 $ Location=Salvador Person=Schenke l Time = 13: 15 Annotators Adapters Sources Annotation Module DATE Event=SIGIR … PRICE Web Portal IE Processor … … Web Adapter Schedule Annotation Module LOCATION XML EMail Adapter SIGIR Hotel Flight SUBJECT=Notificati on Location=Salvad or Website Graupmann Homepage VLDB 2005, Trondheim, Norway Tourist Guide (XML) 16

Outline • Where existing search engines fail • Sphere. Search Concepts • Transformation and

Outline • Where existing search engines fail • Sphere. Search Concepts • Transformation and Annotation • Query Language and Scoring • Experimental Evaluation • Current and Future Work VLDB 2005, Trondheim, Norway 17

Sphere. Search Queries Extended keyword queries: • similarity conditions ~professor, ~Saarbrücken • concept-based conditions

Sphere. Search Queries Extended keyword queries: • similarity conditions ~professor, ~Saarbrücken • concept-based conditions person=Max Planck, location=Trondheim • grouping • join conditions Ranked results with context-aware scoring VLDB 2005, Trondheim, Norway 18

Score Aggregation: Sphere. Score Local score s. L(e) for each research XML element e

Score Aggregation: Sphere. Score Local score s. L(e) for each research XML element e (tf/idf, BM 25, …) 1 2 2 1 s(1): Weighted aggregation of local scores in environment of element (sphere score): Rewards proximity Context of terms and compactness of awareness term distribution VLDB 2005, Trondheim, Norway 19

Similarity Conditions Similarity conditions like ~professor, ~Saarbrücken disambiguation Query expansion δ-exp(x)={w|sim(x, w)>δ} Local score:

Similarity Conditions Similarity conditions like ~professor, ~Saarbrücken disambiguation Query expansion δ-exp(x)={w|sim(x, w)>δ} Local score: weighted max over all expansion terms s. L(e, ~professor) = max t δ-exp(professor) {sim(professor, t)*s. L(e, t)} Abstraction awareness Thesaurus/Ontology: concepts, relationships, glosses from Word. Net, Gazetteers, Web forms & tables, Wikipedia alchemist primadonna artist director wizard investigator intellectual researcher professor HYPONYM (0. 7) scientist scholar academic, academician, faculty member educator lecturer mentor teacher relationships quantified by statistical co-occurence measures VLDB 2005, Trondheim, Norway 20

Concept-based conditions Goal: Exploit explicit (tags) and automatic annotations in documents location=Trondheim concept value

Concept-based conditions Goal: Exploit explicit (tags) and automatic annotations in documents location=Trondheim concept value docid=1 tag=„location“ e content=“Trondheim“ s. L(e, c=v)= score for concept-tag match + score for value-content-match conceptspecific Allows similarity and range queries (for annotated concepts) like location~Trondheim 1970<date<1980 Concept with concept-specific distance awareness measures VLDB 2005, Trondheim, Norway 21

Query Groups Goal: Related terms should occur in the same context Group conditions that

Query Groups Goal: Related terms should occur in the same context Group conditions that relate to the same „entity“ professor teaching IR research XML professor T(teaching IR) R(research XML) Sphere. Score computed for each group Find compact sets with one result for each group VLDB 2005, Trondheim, Norway 22

Scores for Query Results query result R: one result per query group compactness ~

Scores for Query Results query result R: one result per query group compactness ~ 1/size of a minimal spanning tree A 3 1 1 A A 1 3 4 X X 2 X 1 2 2 B 5 3 X 2 1 B B 5 6 X Context awareness 1 2 VLDB 2005, Trondheim, Norway 23

Join conditions Goal: Connect results of different query groups A(research, XML) B(VLDB 2005 paper)

Join conditions Goal: Connect results of different query groups A(research, XML) B(VLDB 2005 paper) A. person=B. person B A VLDB research 2005 1. 0 XML Ralf Schenkel Dependent on database size, application • Precomputed • Computed during query execution 0. 9 2004 2005 R. Schenkel • Join conditions do not change the score for a node • Join conditions create a new 24 VLDB 2005, Trondheim, link with Norway a specific weight

Score for Join Conditions Join condition A. T=B. S: • For all nodes n

Score for Join Conditions Join condition A. T=B. S: • For all nodes n 1 with type T, n 2 with type S, add edge (n 1, n 2) with weight sim(n 1, n 2))-1 • sim(n 1, n 2): content-based similarity A B 2 3 1 X 1 2 X 2 B VLDB 2005, Trondheim, Norway 25

Outline • Where existing search engines fail • Sphere. Search Concepts • Transformation and

Outline • Where existing search engines fail • Sphere. Search Concepts • Transformation and Annotation • Query Language and Scoring • Experimental Evaluation • Current and Future Work VLDB 2005, Trondheim, Norway 26

Setup for Experiments No existing benchmark (INEX, TREC, …) fits Three corpora: • Wikipedia

Setup for Experiments No existing benchmark (INEX, TREC, …) fits Three corpora: • Wikipedia • extended Wikipedia with links to IMDB • extended DBLP corpus with links to homepages 50 Queries like • A(actor birthday 1970<date<1980) western • G(California, governor) M(movie) • A(Madonna, husband) B(director) A. person=B. directo Opponent: keyword queries with standard TF/IDF-based score „simplified Google“ VLDB 2005, Trondheim, Norway 27

Incremental Language Levels SSE-Join (join conditions) SSE-QG (query groups) SSE-CV (concept-based conditions) SSE-basic (keywords,

Incremental Language Levels SSE-Join (join conditions) SSE-QG (query groups) SSE-CV (concept-based conditions) SSE-basic (keywords, Sphere. Scores) VLDB 2005, Trondheim, Norway 28

Experimental Results on Wikipdia VLDB 2005, Trondheim, Norway 29

Experimental Results on Wikipdia VLDB 2005, Trondheim, Norway 29

Experimental Results on Wiki++ and DBLP++ • Sphere. Scores better than local scores •

Experimental Results on Wiki++ and DBLP++ • Sphere. Scores better than local scores • New SSE features nearly double precision VLDB 2005, Trondheim, Norway 30

Current and Future Work • Improve graphical user interface • Refined type-specific similarity measures

Current and Future Work • Improve graphical user interface • Refined type-specific similarity measures (like geographic distances) [SIGIR-WS 2005] • Deep Web search through automatic portal queries • Parameter tuning with relevance feedback • Efficiency of query evaluation through precomputation and integrated top-k (Top. X talk this afternoon) VLDB 2005, Trondheim, Norway 31

Thank you! VLDB 2005, Trondheim, Norway 32

Thank you! VLDB 2005, Trondheim, Norway 32