Text Search over XML Documents Jayavel Shanmugasundaram Cornell

The HTML World <body> <h 1> XML and Information Retrieval: A SIGIR 2000 Workshop

The XML World <workshop date=” 28 July 2000”> <title> XML and Information Retrieval: A

Key Aspect of XML • Captures text and structure (semantics) • Applications – Digital

Searching XML Repositories • • “Pure” Keyword Search Confluence of Information Retrieval (text) and

Outline • • Pure Keyword Search in Context Full-Text + DB Queries Conclusion

Keyword Search over Unstructured Data Ranked Results Query Keywords Hyperlinked HTML Documents

Keyword Search over XML [Guo et al. , SIGMOD 2003] Ranked Results Query Keywords

XRANK System • Semi-structured XML documents – Predefined schema not necessary • Hyperlink-based ranking

Outline • Pure Keyword Search – Design Principles – Indexing and Query Processing •

XML Document <workshop date=” 28 July 2000”> <title> XML and Information Retrieval: A SIGIR

Design Principles 1) Return most specific element containing the query keywords

Design Principles 1) Return most specific element containing the query keywords 2) Ranking has

Data Model Containment edge <workshop> date <title> 28 July … <editors> XML and …

Elem. Rank • Captures importance of an element • Analogous to Google’s Page. Rank

Page. Rank [Brin & Page 1998] d/3 d/3 : Hyperlink edge d: Probability of

Elem. Rank d 1/3 : Hyperlink edge d 3 : Containment edge w d

System Architecture Keyword query XML/HTML Documents Ranked Results Query Evaluator Data access Elem. Rank

Naïve Method <workshop> date 2 <title> 3 Naïve inverted lists: 1 <editors> 4 <proceedings>

Dewey Encoding of IDs [1850 s] <workshop> date 0. 0 28 July … <title>

De we y Id El em Ra Po sit nk ion Li st Dewey

DIL: Query Processing • Merge query keyword inverted lists in Dewey ID Order –

Ranked Dewey Inverted List (RDIL) B+-tree On Dewey Id XQL Inverted List … Sorted

Motivation for DIL/RDIL Hybrid • Correlation of query keywords: probability that the query keywords

Hybrid Dewey Inverted List (HDIL) B+-tree On Dewey Id XQL Full Inverted List …

HDIL: Algorithm • Start with RDIL (to learn correlation) • Periodically calculate – time

INEX IEEE SIGMOD Record Shakespeare's Plays (<3%) . . . Shakespeare's Plays Find relevant

Explaining the Results • TF-IDF scoring for a keyword k: – TF (Term Frequency):

INEX IEEE SIGMOD Record Shakespeare's Plays (<3%) . . . Need a way to

Context-Sensitive Ranking [Botev & Shanmugasundaram, Web. DB 2005] • Use Dewey inverted lists +

Motivation • Many new applications require sophisticated DB queries + “complex” full-text search –

Example Queries • From XQuery Full-Text Use Cases Document – Find the titles of

XQuery Full-Text • Full-text search extension to XQuery – W 3 C Working Draft

XQuery Primer Find the titles of books: //book/title Find the titles of books with

Syntax Overview • Two new XQuery constructs 1) FTContains. Expr • • Expresses “Boolean”

FTContains. Expr Context. Expr ftcontains FTSelection – Context. Expr (any XQuery expression) is context

FTScore Clause FOR $v [SCORE $s]? IN [FUZZY] Expr ORDER BY … RETURN Example

Quark @ Cornell • An efficient implementation of XQuery Full. Text • http: //www.

10000 Foot View of Data Management Information Retrieval Systems Ranked Search Queries Complex and

Related Work • Semi-structured ranked keyword search – XIRQL [Fuhr and Grobjohann, 2001] –

Space Requirements DBLP XMARK Inv. List Index Naïve-ID 258 MB N/A 872 MB N/A

Slides: 54

Download presentation

Text Search over XML Documents Jayavel Shanmugasundaram Cornell University

The HTML World <body> <h 1> XML and Information Retrieval: A SIGIR 2000 Workshop </h 1> <p> The workshop was held on 28 July 2000. The editors of the workshop were David Carmel, Yoelle Maarek, and Aya Soffer </p> <h 2> XQL and Proximal Nodes </h 2> <p> The paper was authored by Ricardo Baeza-Yates and Gonzalo Navarro. The abstract of this paper is given below. </p> <p> We consider the recently proposed language … </p> <p> The paper references the following papers: <a href=“http: //www. acm. org/www 8/paper/xmlql”> … </a> … </p> …

The XML World <workshop date=” 28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=” 1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <cite xmlns: xlink=”http: //www. acm. org/www 8/paper/xmlql> … </cite> </paper> …

Key Aspect of XML • Captures text and structure (semantics) • Applications – Digital libraries – Content management • Many such XML repositories already available – IEEE INEX collection – Library of Congress documents – Shakespeare’s plays – SIGMOD, DBLP, …

Searching XML Repositories • • “Pure” Keyword Search Confluence of Information Retrieval (text) and Database (structure) techniques A spectrum of possibilities Keyword Search in Context Full-Text + DB Queries

Outline • • Pure Keyword Search in Context Full-Text + DB Queries Conclusion

Keyword Search over Unstructured Data Ranked Results Query Keywords Hyperlinked HTML Documents

Keyword Search over XML [Guo et al. , SIGMOD 2003] Ranked Results Query Keywords Mix of Hyperlinked XML and HTML Documents

XRANK System • Semi-structured XML documents – Predefined schema not necessary • Hyperlink-based ranking – Exploit hyperlinked nature of XML for ranking • Generalize unstructured keyword search – Can query a mix of XML and HTML documents (Internet, corporate Intranets) – “Google for XML” (also Google for HTML!)

Outline • Pure Keyword Search – Design Principles – Indexing and Query Processing • Keyword Search in Context • Full-Text + DB Queries • Conclusion

XML Document <workshop date=” 28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=” 1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <cite xmlns: xlink=”http: //www. acm. org/www 8/paper/xmlql> … </cite> </paper> …

Design Principles 1) Return most specific element containing the query keywords

XML Document <workshop date=” 28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=” 1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work” </subsection> The XQL language … </subsection> </section> <cite xmlns: xlink=”http: //www. acm. org/www 8/paper/xmlql> … </cite> </paper> <paper id=” 2”> …

Design Principles 1) Return most specific element containing the query keywords 2) Ranking has to be done at the granularity of elements

XML Document <workshop date=” 28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=” 1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work” </subsection> The XQL language … </subsection> </section> … <cite xmlns: xlink=”http: //www. acm. org/www 8/paper/xmlql> … </cite> </paper> …

Design Principles 1) Return most specific element containing the query keywords 2) Ranking has to be done at the granularity of elements 3) Generalize HTML keyword search

Data Model Containment edge <workshop> date <title> 28 July … <editors> XML and … <proceedings> David Carmel … <paper> <title> XQL and … <author> Ricardo … <paper> … … … Hyperlink edge

Elem. Rank • Captures importance of an element • Analogous to Google’s Page. Rank – But computed at granularity of elements – Exploit hyperlink edges and containment edges • Naturally generalizes Google’s Page. Rank – Random walk interpretation

Page. Rank [Brin & Page 1998] d/3 d/3 : Hyperlink edge d: Probability of following hyperlink w 1 -d: Probability of random jump

Elem. Rank d 1/3 : Hyperlink edge d 3 : Containment edge w d 1/3 d 2/2 d 1: Probability of following hyperlink d 2: Probability of visiting a subelement d 3: Probability of visiting parent 1 -d 2 -d 3: Probability of random jump

Outline • Pure Keyword Search – Design Principles – Indexing and Query Processing • Keyword Search in Context • Full-Text + DB Queries • Conclusion

System Architecture Keyword query XML/HTML Documents Ranked Results Query Evaluator Data access Elem. Rank Computation XML Elements with Elem. Ranks Hybrid Dewey Inverted List Compute top-k query results as per definition of ranking

Naïve Method <workshop> date 2 <title> 3 Naïve inverted lists: 1 <editors> 4 <proceedings> 5 Ricardo 1 ; 5 ; 6 ; 8 n. XQL 1; 5; 6; 7 n 28 July … XML and … David Carmel … <paper> <title> 7 XQL and … <author> 8 Ricardo … 6 <paper> … … … Problems: 1. Space Overhead 2. Spurious Results Main issue: Decouples representation of ancestors and descendants

Dewey Encoding of IDs [1850 s] <workshop> date 0. 0 28 July … <title> 0. 1 XML and … 0 <editors> 0. 3. 0. 0 XQL and … <proceedings> 0. 3 David Carmel … <paper> <title> 0. 2 <author> 0. 3. 0. 1 Ricardo … <paper> … 0. 3. 1 … …

De we y Id El em Ra Po sit nk ion Li st Dewey Inverted List (DIL) XQL 5. 0. 3. 0. 0 85 32 8. 0. 3. 8. 3 38 89 91 … … … Ricardo 5. 0. 3. 0. 1 82 38 8. 2. 1. 4. 2 99 52 … … … Sorted by Dewey Id Store IDs of elements that directly contain keyword - Avoids space overhead

DIL: Query Processing • Merge query keyword inverted lists in Dewey ID Order – Entries with common prefixes are processed together • Compute Longest Common Prefix of Dewey IDs during the merge – Longest common prefix ensures most specific results – Also suppresses spurious results • Keep top-k results seen so far in output heap – Output contents of output heap after scanning inverted lists • Algorithm works in a single scan over inverted lists

Ranked Dewey Inverted List (RDIL) B+-tree On Dewey Id XQL Inverted List … Sorted by Elem. Rank B+-tree On Dewey Id Ricardo Inverted List … Sorted by Elem. Rank Needs random probes on index

Motivation for DIL/RDIL Hybrid • Correlation of query keywords: probability that the query keywords occur in same element – High correlation: RDIL likely to outperform DIL by stopping early – Low correlation: DIL likely to outperform RDIL because RDIL has to scan most (or entire) inverted list • Dilemma – DIL and RDIL are likely to outperform each other – But require inverted lists to be sorted in different orders • Challenges – Get benefits of DIL and RDIL without doubling space? – How can keyword correlation be determined?

Hybrid Dewey Inverted List (HDIL) B+-tree On Dewey Id XQL Full Inverted List … Sorted by Dewey id Short List Sorted by Elem. Rank • RDIL is better only when it scans little of inverted list – Short list sorted by Elem. Rank - saves space! • Can reuse full inverted list as leaf of B+-tree – Saves space!

HDIL: Algorithm • Start with RDIL (to learn correlation) • Periodically calculate – time spent so far: t – number of results above threshold: r – expected remaining time: (m-r)*t/r, where m is desired number of query results • If expected time for RDIL exceeds that for DIL, switch to DIL, else stick to RDIL • Expected time for DIL can easily be calculated a priori because DIL scans the entire inverted list

DBLP: Low Correlation Keywords

Outline • • Pure Keyword Search in Context Full-Text + DB Queries Conclusion

INEX IEEE SIGMOD Record Shakespeare's Plays (<3%) . . . Shakespeare's Plays Find relevant elements in Shakespeare’s plays about ‘the process of speech’ • 9 of top 10 results for one repository were not in the top 10 results of other repository – XIRQL’s [Fuhr & Grobjohann, SIGIR 2001] TF-IDF scoring

Explaining the Results • TF-IDF scoring for a keyword k: – TF (Term Frequency): # occurences of k in element • Usually normalized by some factor – IDF (Inverse Document Frequency): (# elements)/(# elements that contain k) • Score = sum of TF*IDF for all query keywords • Main reason for skewed results – Language of engineers very different from language of Shakespeare! – ‘process’ common in INEX, ‘speech’ uncommon

INEX IEEE SIGMOD Record Shakespeare's Plays (<3%) . . . Need a way to efficiently compute IDF (or other corpus scoring statistic) “on-the-fly”

Context-Sensitive Ranking [Botev & Shanmugasundaram, Web. DB 2005] • Use Dewey inverted lists + context B+-trees • Two pass algorithm – First pass: collect statistics – Second pass: compute results (entries cached from first pass)

Outline • • Pure Keyword Search in Context Full-Text + DB Queries Conclusion

Motivation • Many new applications require sophisticated DB queries + “complex” full-text search – Example: Library of Congress documents in XML • Current XML query languages are mostly “database” languages – Examples: XQuery, XPath • Provide very rudimentary text/IR support – fn: contains(e, keywords) • No support for complex IR queries – Distance predicates, stemming, scoring, …

Example Queries • From XQuery Full-Text Use Cases Document – Find the titles of the books whose body contains the phrases “Usability” and “Web site” in that order, in the same paragraph, using stemming if necessary to match the tokens – Find the titles of the books whose body contains “Usability” and “testing” within a window of 3 words, and return them in score order

XQuery Full-Text • Full-text search extension to XQuery – W 3 C Working Draft • XQuery Full-Text Evolution – Quark query language • [Botev & Shanmugasundaram], 2003 – Te. XQuery • WWW 2004 [Amer-Yahia, Botev, Shanmugasundaram] – XQuery Full-Text • http: //www. w 3. org/TR/xquery-full-text • Invited experts (Botev and Shanmugasundaram)

XQuery Primer Find the titles of books: //book/title Find the titles of books with price < 25: //book[. /price < 25]/title Find books written by Dawkins, in order of price: for $b in //book[. /author = ‘Dawkins’] order by $b/price return $b

Syntax Overview • Two new XQuery constructs 1) FTContains. Expr • • Expresses “Boolean” full-text search predicates Seamlessly composes with other XQuery expressions 2) FTScore. Clause • • Extension to FOR expression Can score FTContains. Expr and other expressions

FTContains. Expr Context. Expr ftcontains FTSelection – Context. Expr (any XQuery expression) is context spec – FTSelection is search spec – Returns true iff at least one node in Context. Expr satisfies the FTSelection • Examples – //book ftcontains ‘Usability’ && ‘testing’ distance 5 – //book[. /content ftcontains ‘Usability’ with stems]/title – //book ftcontains /article[author=‘Dawkins’]/title

FTScore Clause FOR $v [SCORE $s]? IN [FUZZY] Expr ORDER BY … RETURN Example FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing”] ORDER BY $s RETURN $b

FTScore Clause FOR $v [SCORE $s]? IN [FUZZY] Expr ORDER BY … RETURN Example FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing” and. /price < 10. 00] ORDER BY $s RETURN $b

FTScore Clause FOR $v [SCORE $s]? IN [FUZZY] Expr ORDER BY … RETURN Example FOR $b SCORE $s in FUZZY /pub/book[. ftcontains “Usability” && “testing”] ORDER BY $s RETURN $b

Quark @ Cornell • An efficient implementation of XQuery Full. Text • http: //www. cs. cornell. edu/database/quark

Outline • • Pure Keyword Search in Context Full-Text + DB Queries Conclusion

10000 Foot View of Data Management Information Retrieval Systems Ranked Search Queries Complex and Structured Database Systems Structured Unstructured Data

Related Work • Semi-structured ranked keyword search – XIRQL [Fuhr and Grobjohann, 2001] – XXL [Theobald and Weikum, 2001] – Commercial search engines [Luk et al. , 2000] – SGML documents [Myaeng et al. , 2001] • Keyword search over databases – BANKS [Bhalotia et al. , 2002] – DBXplorer [Agrawal et al. , 2002] – DISCOVER [Hristidis et al. , 2002] – LORE [Goldman et al. , 1999]

Space Requirements DBLP XMARK Inv. List Index Naïve-ID 258 MB N/A 872 MB N/A Naïve-Rank 258 MB 217 MB 872 MB 527 MB DIL 144 MB N/A 254 MB N/A RDIL 144 MB 156 MB 254 MB 209 MB HDIL 186 MB 7 MB 307 MB 3. 2 MB

DBLP: High Correlation Keywords