Text Search over XML Documents Jayavel Shanmugasundaram Cornell
- Slides: 54
Text Search over XML Documents Jayavel Shanmugasundaram Cornell University
The HTML World <body> <h 1> XML and Information Retrieval: A SIGIR 2000 Workshop </h 1> <p> The workshop was held on 28 July 2000. The editors of the workshop were David Carmel, Yoelle Maarek, and Aya Soffer </p> <h 2> XQL and Proximal Nodes </h 2> <p> The paper was authored by Ricardo Baeza-Yates and Gonzalo Navarro. The abstract of this paper is given below. </p> <p> We consider the recently proposed language … </p> <p> The paper references the following papers: <a href=“http: //www. acm. org/www 8/paper/xmlql”> … </a> … </p> …
The XML World <workshop date=” 28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=” 1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <cite xmlns: xlink=”http: //www. acm. org/www 8/paper/xmlql> … </cite> </paper> …
Key Aspect of XML • Captures text and structure (semantics) • Applications – Digital libraries – Content management • Many such XML repositories already available – IEEE INEX collection – Library of Congress documents – Shakespeare’s plays – SIGMOD, DBLP, …
Searching XML Repositories • • “Pure” Keyword Search Confluence of Information Retrieval (text) and Database (structure) techniques A spectrum of possibilities Keyword Search in Context Full-Text + DB Queries
Outline • • Pure Keyword Search in Context Full-Text + DB Queries Conclusion
Keyword Search over Unstructured Data Ranked Results Query Keywords Hyperlinked HTML Documents
Keyword Search over XML [Guo et al. , SIGMOD 2003] Ranked Results Query Keywords Mix of Hyperlinked XML and HTML Documents
XRANK System • Semi-structured XML documents – Predefined schema not necessary • Hyperlink-based ranking – Exploit hyperlinked nature of XML for ranking • Generalize unstructured keyword search – Can query a mix of XML and HTML documents (Internet, corporate Intranets) – “Google for XML” (also Google for HTML!)
Outline • Pure Keyword Search – Design Principles – Indexing and Query Processing • Keyword Search in Context • Full-Text + DB Queries • Conclusion
XML Document <workshop date=” 28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=” 1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <cite xmlns: xlink=”http: //www. acm. org/www 8/paper/xmlql> … </cite> </paper> …
XML Document <workshop date=” 28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=” 1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <cite xmlns: xlink=”http: //www. acm. org/www 8/paper/xmlql> … </cite> </paper> …
Design Principles 1) Return most specific element containing the query keywords
XML Document <workshop date=” 28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=” 1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work” </subsection> The XQL language … </subsection> </section> <cite xmlns: xlink=”http: //www. acm. org/www 8/paper/xmlql> … </cite> </paper> <paper id=” 2”> …
Design Principles 1) Return most specific element containing the query keywords 2) Ranking has to be done at the granularity of elements
XML Document <workshop date=” 28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=” 1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work” </subsection> The XQL language … </subsection> </section> … <cite xmlns: xlink=”http: //www. acm. org/www 8/paper/xmlql> … </cite> </paper> …
Design Principles 1) Return most specific element containing the query keywords 2) Ranking has to be done at the granularity of elements 3) Generalize HTML keyword search
Design Principles 1) Return most specific element containing the query keywords 2) Ranking has to be done at the granularity of elements 3) Generalize HTML keyword search
Data Model Containment edge <workshop> date <title> 28 July … <editors> XML and … <proceedings> David Carmel … <paper> <title> XQL and … <author> Ricardo … <paper> … … … Hyperlink edge
Elem. Rank • Captures importance of an element • Analogous to Google’s Page. Rank – But computed at granularity of elements – Exploit hyperlink edges and containment edges • Naturally generalizes Google’s Page. Rank – Random walk interpretation
Page. Rank [Brin & Page 1998] d/3 d/3 : Hyperlink edge d: Probability of following hyperlink w 1 -d: Probability of random jump
Elem. Rank d 1/3 : Hyperlink edge d 3 : Containment edge w d 1/3 d 2/2 d 1: Probability of following hyperlink d 2: Probability of visiting a subelement d 3: Probability of visiting parent 1 -d 2 -d 3: Probability of random jump
Outline • Pure Keyword Search – Design Principles – Indexing and Query Processing • Keyword Search in Context • Full-Text + DB Queries • Conclusion
System Architecture Keyword query XML/HTML Documents Ranked Results Query Evaluator Data access Elem. Rank Computation XML Elements with Elem. Ranks Hybrid Dewey Inverted List Compute top-k query results as per definition of ranking
Naïve Method <workshop> date 2 <title> 3 Naïve inverted lists: 1 <editors> 4 <proceedings> 5 Ricardo 1 ; 5 ; 6 ; 8 n. XQL 1; 5; 6; 7 n 28 July … XML and … David Carmel … <paper> <title> 7 XQL and … <author> 8 Ricardo … 6 <paper> … … … Problems: 1. Space Overhead 2. Spurious Results Main issue: Decouples representation of ancestors and descendants
Dewey Encoding of IDs [1850 s] <workshop> date 0. 0 28 July … <title> 0. 1 XML and … 0 <editors> 0. 3. 0. 0 XQL and … <proceedings> 0. 3 David Carmel … <paper> <title> 0. 2 <author> 0. 3. 0. 1 Ricardo … <paper> … 0. 3. 1 … …
De we y Id El em Ra Po sit nk ion Li st Dewey Inverted List (DIL) XQL 5. 0. 3. 0. 0 85 32 8. 0. 3. 8. 3 38 89 91 … … … Ricardo 5. 0. 3. 0. 1 82 38 8. 2. 1. 4. 2 99 52 … … … Sorted by Dewey Id Store IDs of elements that directly contain keyword - Avoids space overhead
DIL: Query Processing • Merge query keyword inverted lists in Dewey ID Order – Entries with common prefixes are processed together • Compute Longest Common Prefix of Dewey IDs during the merge – Longest common prefix ensures most specific results – Also suppresses spurious results • Keep top-k results seen so far in output heap – Output contents of output heap after scanning inverted lists • Algorithm works in a single scan over inverted lists
Ranked Dewey Inverted List (RDIL) B+-tree On Dewey Id XQL Inverted List … Sorted by Elem. Rank B+-tree On Dewey Id Ricardo Inverted List … Sorted by Elem. Rank Needs random probes on index
Motivation for DIL/RDIL Hybrid • Correlation of query keywords: probability that the query keywords occur in same element – High correlation: RDIL likely to outperform DIL by stopping early – Low correlation: DIL likely to outperform RDIL because RDIL has to scan most (or entire) inverted list • Dilemma – DIL and RDIL are likely to outperform each other – But require inverted lists to be sorted in different orders • Challenges – Get benefits of DIL and RDIL without doubling space? – How can keyword correlation be determined?
Hybrid Dewey Inverted List (HDIL) B+-tree On Dewey Id XQL Full Inverted List … Sorted by Dewey id Short List Sorted by Elem. Rank • RDIL is better only when it scans little of inverted list – Short list sorted by Elem. Rank - saves space! • Can reuse full inverted list as leaf of B+-tree – Saves space!
HDIL: Algorithm • Start with RDIL (to learn correlation) • Periodically calculate – time spent so far: t – number of results above threshold: r – expected remaining time: (m-r)*t/r, where m is desired number of query results • If expected time for RDIL exceeds that for DIL, switch to DIL, else stick to RDIL • Expected time for DIL can easily be calculated a priori because DIL scans the entire inverted list
DBLP: Low Correlation Keywords
Outline • • Pure Keyword Search in Context Full-Text + DB Queries Conclusion
INEX IEEE SIGMOD Record Shakespeare's Plays (<3%) . . . Shakespeare's Plays Find relevant elements in Shakespeare’s plays about ‘the process of speech’ • 9 of top 10 results for one repository were not in the top 10 results of other repository – XIRQL’s [Fuhr & Grobjohann, SIGIR 2001] TF-IDF scoring
Explaining the Results • TF-IDF scoring for a keyword k: – TF (Term Frequency): # occurences of k in element • Usually normalized by some factor – IDF (Inverse Document Frequency): (# elements)/(# elements that contain k) • Score = sum of TF*IDF for all query keywords • Main reason for skewed results – Language of engineers very different from language of Shakespeare! – ‘process’ common in INEX, ‘speech’ uncommon
INEX IEEE SIGMOD Record Shakespeare's Plays (<3%) . . . Need a way to efficiently compute IDF (or other corpus scoring statistic) “on-the-fly”
Context-Sensitive Ranking [Botev & Shanmugasundaram, Web. DB 2005] • Use Dewey inverted lists + context B+-trees • Two pass algorithm – First pass: collect statistics – Second pass: compute results (entries cached from first pass)
Outline • • Pure Keyword Search in Context Full-Text + DB Queries Conclusion
Motivation • Many new applications require sophisticated DB queries + “complex” full-text search – Example: Library of Congress documents in XML • Current XML query languages are mostly “database” languages – Examples: XQuery, XPath • Provide very rudimentary text/IR support – fn: contains(e, keywords) • No support for complex IR queries – Distance predicates, stemming, scoring, …
Example Queries • From XQuery Full-Text Use Cases Document – Find the titles of the books whose body contains the phrases “Usability” and “Web site” in that order, in the same paragraph, using stemming if necessary to match the tokens – Find the titles of the books whose body contains “Usability” and “testing” within a window of 3 words, and return them in score order
XQuery Full-Text • Full-text search extension to XQuery – W 3 C Working Draft • XQuery Full-Text Evolution – Quark query language • [Botev & Shanmugasundaram], 2003 – Te. XQuery • WWW 2004 [Amer-Yahia, Botev, Shanmugasundaram] – XQuery Full-Text • http: //www. w 3. org/TR/xquery-full-text • Invited experts (Botev and Shanmugasundaram)
XQuery Primer Find the titles of books: //book/title Find the titles of books with price < 25: //book[. /price < 25]/title Find books written by Dawkins, in order of price: for $b in //book[. /author = ‘Dawkins’] order by $b/price return $b
Syntax Overview • Two new XQuery constructs 1) FTContains. Expr • • Expresses “Boolean” full-text search predicates Seamlessly composes with other XQuery expressions 2) FTScore. Clause • • Extension to FOR expression Can score FTContains. Expr and other expressions
FTContains. Expr Context. Expr ftcontains FTSelection – Context. Expr (any XQuery expression) is context spec – FTSelection is search spec – Returns true iff at least one node in Context. Expr satisfies the FTSelection • Examples – //book ftcontains ‘Usability’ && ‘testing’ distance 5 – //book[. /content ftcontains ‘Usability’ with stems]/title – //book ftcontains /article[author=‘Dawkins’]/title
FTScore Clause FOR $v [SCORE $s]? IN [FUZZY] Expr ORDER BY … RETURN Example FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing”] ORDER BY $s RETURN $b
FTScore Clause FOR $v [SCORE $s]? IN [FUZZY] Expr ORDER BY … RETURN Example FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing” and. /price < 10. 00] ORDER BY $s RETURN $b
FTScore Clause FOR $v [SCORE $s]? IN [FUZZY] Expr ORDER BY … RETURN Example FOR $b SCORE $s in FUZZY /pub/book[. ftcontains “Usability” && “testing”] ORDER BY $s RETURN $b
Quark @ Cornell • An efficient implementation of XQuery Full. Text • http: //www. cs. cornell. edu/database/quark
Outline • • Pure Keyword Search in Context Full-Text + DB Queries Conclusion
10000 Foot View of Data Management Information Retrieval Systems Ranked Search Queries Complex and Structured Database Systems Structured Unstructured Data
Related Work • Semi-structured ranked keyword search – XIRQL [Fuhr and Grobjohann, 2001] – XXL [Theobald and Weikum, 2001] – Commercial search engines [Luk et al. , 2000] – SGML documents [Myaeng et al. , 2001] • Keyword search over databases – BANKS [Bhalotia et al. , 2002] – DBXplorer [Agrawal et al. , 2002] – DISCOVER [Hristidis et al. , 2002] – LORE [Goldman et al. , 1999]
Space Requirements DBLP XMARK Inv. List Index Naïve-ID 258 MB N/A 872 MB N/A Naïve-Rank 258 MB 217 MB 872 MB 527 MB DIL 144 MB N/A 254 MB N/A RDIL 144 MB 156 MB 254 MB 209 MB HDIL 186 MB 7 MB 307 MB 3. 2 MB
DBLP: High Correlation Keywords
- Text-to-media connection
- Binary search advantages
- Utility program in computer
- L i m s
- Partially supervised classification of text documents
- Over the mountain over the plains
- Siach reciting the word over and over
- Taking over navigational watch
- Time and space complexity of uninformed search
- Federated discovery
- Local search vs global search
- Federated search vs distributed search
- Informed and uninformed search
- Https://images.search.yahoo.com
- Best first search in ai
- Blind search adalah
- Yahoo gravity
- Video.search.yahoo.com search video
- Search by image
- Http://search.yahoo.com/search?ei=utf-8
- Semantic search vs cognitive search
- Which search strategy is called as blind search
- Http://search.yahoo.com/search?ei=utf-8
- 20 pertanyaan
- Couchbase full text search example
- Https //yandex.ru/images/search text
- Https yandex ru images search text
- Https //yandex.ru/images/search text
- Yandex image search
- Text search in nfa
- Yandex.ru images
- Https //yandex.ru/images/search text
- Search image text
- Text search
- Yandex ru image
- Search image text
- Text search
- Fwww
- Httpsyandex
- Yandex.ru/image
- Https //yandex.ru/images
- Full text search mysql example
- Ppap level 4 documents
- A short document
- Types of formal letter
- Official documents style
- Dta verification documents
- Safelink verify
- Questioned documents include:
- Nabl 163
- Pec bidding documents
- Book of prime entry in accounting
- On the resemblance and containment of documents
- Palo alto caret laboratory
- Financial documents grade 10 maths lit