XRANK Ranked Keyword Search over XML Documents Sang

Contents Introduction: problem overview Data Model and Query Semantics Computing Elem. Ranks Efficiently Evaluating

Introduction: Problem Overview Problem Efficiently producing ranked results for keyword search queries over hierarchical

Data Model and Query Semantics (1/4) Definitions XML Document G = (N, CE, HE)

Data Model and Query Semantics (2/4) Definitions (continued) Keyword Query Q={k 1, …, kn}

Data Model and Query Semantics (3/4) Ranking function Desired properties Result specificity: more specific

Data Model and Query Semantics (4/4) Ranking function (continued) Elem. Rank(v) Defined at the

Computing Elem. Ranks (1/2) Page. Rank Sum of 2 probabilities Visiting document v at

Computing Elem. Ranks (2/2) Elem. Rank Element level granularity, Sum of 4 probabilities Nd(u):

Efficiently Evaluating XML Keyword Search Queries (1/8) How to produce ranked results efficiently? Naive

Efficiently Evaluating XML Keyword Search Queries (2/8) Dewy Inverted List (DIL) Dewy ID: ID

Efficiently Evaluating XML Keyword Search Queries (3/8) Dewy Inverted List (Continued) DIL Data Structure

Efficiently Evaluating XML Keyword Search Queries (4/8) Dewy Inverted List (Continued) <workshop> date DIL

Efficiently Evaluating XML Keyword Search Queries (4/8) Rank[2] Pos. List[1] Pos. List[2] Contains. All

Efficiently Evaluating XML Keyword Search Queries (5/8) Ranked Dewey Inverted List DIL Challenge If

Efficiently Evaluating XML Keyword Search Queries (6/8) Ranked Dewey Inverted List (RDIL) Threshold Algorithm

Efficiently Evaluating XML Keyword Search Queries (7/8) Hybrid Dewey Inverted List (HDIL) Motivation In

Efficiently Evaluating XML Keyword Search Queries (8/8) HDIL (Continued) An adaptive strategy: Calculate the

Experimental Evaluation Setup: Data sets DBLP (real data, 143 MB, depth = 4, many

Conclusion XRANK is the first system that takes into account The hierarchical and hyperlinked

Appendix B. DIL Query Processing Algorithm 01. procedure Evaluate. Query (k 1, k 2,

Appendix C. Experimental Evaluation Space Requirements DBLP XMARK Inv. List Index Naïve-ID 258 MB

Appendix D. Experimental Evaluation Query Performance: DBLP - Low Correlation Keywords

Slides: 25

Download presentation

XRANK: Ranked Keyword Search over XML Documents Sang. Jin Lee sjinlee@snu. ac. kr Dept. of Industrial Engineering Seoul National University

Contents Introduction: problem overview Data Model and Query Semantics Computing Elem. Ranks Efficiently Evaluating XML Keyword Search Queries Experimental Evaluation Conclusion

Introduction: Problem Overview Problem Efficiently producing ranked results for keyword search queries over hierarchical XML documents Challenges Nested elements are results Best relevant element Hyperlink and containment links Parent/child elements Keyword proximity Keyword distance Ancestor distance E. g. XQL language 01. <workshop date=” 28 July 2000”> 02. <title> XML and IR: A SIGIR 2000 Workshop </title> 03. <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> 04. <proceedings> 05. <paper id=” 1”> 06. <title> XQL and Proximal Nodes </title> 07. <author> Ricardo Baeza-Yates </author> 08. <author> Gonzalo Navarro </author> 09. <abstract> The recently proposed language … 10. </abstract> 11. <body> 12. <section name=”Introduction”> 13. Searching on structured text is more important … 14. </section> 15. <section name=”Implementing XML Operations”> 16. <subsection name=”Path Expressions”> 17. At first sight, the XQL query language looks … 18. </subsection> 19. … 20. </section> 21. <cite ref=” 2”>Querying XML in Xyleme</cite> 22. <cite xlink=”. . /paper/xmlql/”>A Query … </cite> 23. </body> 24. </paper> 25. <paper id=” 2”> 26. <title> Querying XML in Xyleme </title> 27. … 28. </paper> 29. </proceedings> 30. </workshop>

Data Model and Query Semantics (1/4) Definitions XML Document G = (N, CE, HE) N : The set of nodes N = NE U NV - NE : The set of elements - NV : The set of values 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. <body> <section name=”Introduction”> Searching on structured text is more important … </section> <section name=”Implementing XML Operations”> <subsection name=”Path Expressions”> At first sight, the XQL query language looks … </subsection> … </section> <cite ref=” 2”>Querying XML in Xyleme</cite> <cite xlink=”. . /paper/xmlql/”>A Query … </cite> </body> CE : The set of containment edges relating nodes HE : The set of hyperlink edges relating nodes ※ contains*(v, k): the node v directly or indirectly contains keyword k

Data Model and Query Semantics (2/4) Definitions (continued) Keyword Query Q={k 1, …, kn} conjunctive semantics (k 1 … kn): contain all of the query keywords disjunctive semantics (k 1 … kn): contain at least one of the query keywords (*) Results (1) R 0 = {v v NE k Q(contains*(v, k))} directly or indirectly contain all of the query keywords (2) Result(Q)={v k Q, c N ((v, c) CE c R 0 contains*(c, k))} only the most specific results are returned. element that has multiple independent occurrences of the query keywords ※ CE are considered for result set, HE are considered for ranking

Data Model and Query Semantics (3/4) Ranking function Desired properties Result specificity: more specific results higher than less specific results Keyword proximity Keyword distance Ancestor distance Hyperlink awareness Hyperlink Containment links Parent/child elements ※ Search keyword: XQL language 25. 26. 27. 28. <paper id=” 2”> <title> Querying XQL language. . . </title> … </paper> 05. 06. 07. 08. 09. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. <paper id=” 1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> The recently proposed language … </abstract> <body> <section name=”Introduction”> Searching … </section> <section name=”Implementing XML Operations”> <subsection name=”Path Expressions”> At first sight, the XQL query language looks … </subsection> … </section> <cite ref=” 2”>Querying XML in Xyleme</cite> <cite xlink=”. . /paper/xmlql/”>A Query … </cite> </body> </paper>

Data Model and Query Semantics (4/4) Ranking function (continued) Elem. Rank(v) Defined at the granularity of an element Taking the nested structure of XML into account Consider Keyword search query Q={k 1, …, kn} Results R= Result(Q) A result element v 1 R With respect to one keyword: r(v 1, Q) (v 1, v 2) (v 2, v 3), . . . , (vt, vt+1), vt+1: directly contains the keyword ki ※ f = max or f=sum Overall Ranking

Computing Elem. Ranks (1/2) Page. Rank Sum of 2 probabilities Visiting document v at random, e. g. d=0. 85 Visiting document v by navigating (hyperlink) from document u

Computing Elem. Ranks (2/2) Elem. Rank Element level granularity, Sum of 4 probabilities Nd(u): the total number of documents Nde(u): the number of elements containing the element v Nc(u): the number of sub-elements of u Nh(u): the number of out-going hyperlinks from element u d 1: by hyperlink d 2: by forward containment edges d 3: by reverse containment edge Convergence It is said to be proved But, . . ? 11. 22. 23. <body> <section>Querying XML. . . </section> <section>Querying HTML. . </section> </body>

Efficiently Evaluating XML Keyword Search Queries (1/8) How to produce ranked results efficiently? Naive approach Dewey Inverted List (DIL) Ranked Dewey Inverted List (RDIL) Hybrid Dewey Inverted List (HDIL) Naive approach Treating each element as a document Problems Space overhead Spurious query results Inaccurate ranking of results date <workshop> 2 <title> <editors> 4 <proceedings> 5 28 July … XML and … David Carmel … E. g. <paper> Search Query keyword XQL: 1, 5, 6, 8 Ricardo: 1, 5, 6, 7 3 1 <title> 7 XQL and … <author> 8 Ricardo … 6 <paper> … … …

Efficiently Evaluating XML Keyword Search Queries (2/8) Dewy Inverted List (DIL) Dewy ID: ID of an ancestor prefix of the descendant ID Ancestor-descendant relationship are implicitly captured E. g. XQL: 0. 3. 0. 0 Ricardo: 0. 3. 0. 1 <workshop> date 0. 0 <title> 0. 1 28 July … XML and … 0 <editors> <author> 0. 3. 0. 0 XQL <proceedings> 0. 3 David Carmel … <paper> <title> 0. 2 0. 3. 0. 1 Ricardo Dewy IDs <paper> … 0. 3. 1 … …

Efficiently Evaluating XML Keyword Search Queries (3/8) Dewy Inverted List (Continued) DIL Data Structure The inverted list for a keyword k contains the Dewey IDs of all the XML elements that directly contain the keyword k An entry in DIL Elem. Rank The list of all positions where the keyword k appears in that element Entries are sorted by Dewey IDs

Efficiently Evaluating XML Keyword Search Queries (4/8) Dewy Inverted List (Continued) <workshop> date DIL Query Processing 0. 0 <title> 28 July … 0. 1 0 <editors> XML and … <title> Merge the query keyword inverted lists 0. 3. 0. 0 XQL <proceedings> 0. 3 David Carmel … <paper> Key idea 0. 2 <author> 0. 3. 0<paper> 0. 3. 0. 1 Ricardo … 0. 3. 1 … 0 0 5 0 77 32 Contains. All 0 82 Pos. List[2] 3 Pos. List[1] 0 Rank[2] 3 Rank[1] 0 32 Pos. List[2] 0 85 Pos. List[1] 0 0 Rank[2] 1 Rank[1] 0 Dewey nothing Contains. All Simultaneously compute the longest common prefix of the Dewey IDs in different lists. 38 0 0 …

Efficiently Evaluating XML Keyword Search Queries (4/8) Rank[2] Pos. List[1] Pos. List[2] Contains. All 0 74 89 91 32 38 0 1 3 0 8 3 0 0 0 5 0 0 6 0 1 0 Rank[2] 38 77 Rank[1] 0 3 0 Dewey Rank[1] Contains. All 38 Dewey Pos. List[2] Pos. List[1] DIL Query Processing (Continued) 82 77 32 <workshop> date 0. 0 <title> 28 July … 0. 1 XML and … 0. 3. 0. 0 XQL RESULT 0 <editors> 0. 2 <proceedings> 0. 3 David Carmel … <paper> <title> compare top k <author> 0. 3. 0<paper> 0. 3. 0. 1 Ricardo … 0. 3. 1 … …

Efficiently Evaluating XML Keyword Search Queries (5/8) Ranked Dewey Inverted List DIL Challenge If inverted lists are long (e. g. common keywords or large document collections) ⇒ the cost of a single scan of the inverted list can be expensive ( users want only the top few results ) RDIL Inverted lists are ordered by the Elem. Rank Cf) DIL: by the Dewey ID Each inverted list has a B+-tree index of the Dewey ID field Higher ranked results will appear first in the inverted list B+-tree On Dewey Id XQL Inverted List … Sorted by Elem. Rank

Efficiently Evaluating XML Keyword Search Queries (6/8) Ranked Dewey Inverted List (RDIL) Threshold Algorithm S Output Heap B+-tree on Dewey Id Ricardo P: 9. 0. 4. 2. 0 Inverted List threshold < = (Elem. Rank(R 1), Rank(S) stop! Elem. Rank(R 2)) Rank(9. 0. 4) XQL R 2 R 1 9. 0. 4. 1. 2 8. 2. 1. 4. 2 9. 0. 4. 1. 2 9. 0. 5. 6 10. 8. 3 B+-tree on Dewey Id 9. 0. 4. 2. 0

Efficiently Evaluating XML Keyword Search Queries (7/8) Hybrid Dewey Inverted List (HDIL) Motivation In many cases, RDIL is likely to perform well. It may perform worse than DIL when there is a query where keywords are not correlated The individual query keywords occur relatively frequently in the document collection but rarely occur together in the same document. Since the number of results is small: RDIL has to scan most (or all) of the inverted lists to produce the output. Combination the benefits of DIL and RDIL

Efficiently Evaluating XML Keyword Search Queries (8/8) HDIL (Continued) An adaptive strategy: Calculate the estimated time for RDIL Time spent: t The number of results above threshold: r Estimated time remaining for RDIL = (m-r)*t/r m: desired number of query results Estimated time for DIL depends on the number of query keywords the size of each query keyword inverted list If estimated time of RDIL is more than the expected time for DIL, then switch to DIL.

Experimental Evaluation Setup: Data sets DBLP (real data, 143 MB, depth = 4, many small documents) XMARK (synthetic data, 113 MB, depth = 10, one large document) Quality and Ranking Function Space requirements Query Performance: DBLP – High Correlation Keywords

Conclusion XRANK is the first system that takes into account The hierarchical and hyperlinked structure of XML documents Two-dimensional notion of keyword proximity Future work Open problems: Incremental index maintenance Integration wit structured queries

Thank you

Appendix A. XRANK Architecture

Appendix B. DIL Query Processing Algorithm 01. procedure Evaluate. Query (k 1, k 2, …, kn, m) returns id. List 02. // k 1 … kn are the query keywords, m is the desired number of query results 03. // inverted. List[i] is the inverted list for keyword ki 04. 05. result. Heap = empty; // Intialize the result heap of size m dewey. Stack = empty; // Initialize the Dewey stack 05. while (eof has not been reached on all inverted lists) { 07. 08. 09. // Read the next entry from the inverted list having the smallest Dewey. ID find il. Index such that the next entry of inverted. List[il. Index] is the smallest Dewey. ID current. Entry = inverted. List[il. Index]. next. Entry; 10. 11. // Find the longest common prefix between dewey. Stack and current. Entry. dewey. Id find largest lcp such that dewey. Stack[i] = current. Entry. dewey. Id[i], 1 <= i <= lcp 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. } // Pop non-matching entries in the Dewey stack; add to result heap if appropriate while (dewey. Stack. size > lcp) { stack. Entry = dewey. Stack. pop(); if ( stack. Entry. pos. List non-empty for all keywords) { stack. Entry. Contains. All = true compute overall rank using formula in Section 2. 3. 2. 2 if overall rank is among top m seen so far, add dewey. Stack ID to result. Heap }else if ( ! stack. Entry. Contains. All) { dewey. Stack[dewey. Stack. size]. pos. List[i] += stack. Entry. pos. List[i] (for all i) dewey. Stack[dewey. Stack. size]. rank[i] = rank as in Sec. 2. 3. 2. 1 (for all i) } if (stack. Entry. Contains. All) dewey. Stack[dewey. Stack. size]. contains. All = true 25. 26. 27. 28. // Add non-matching part of current. Entry. dewey. Id to dewey. Stack for (all i such that lcp < i <= curr. Dewey. Id. Len) { dewey. Stack. push(dewey. Stack. Entry); } 29. // Add components to the top entry 30. dewey. Stack[curr. Dewey. Id. Len]. rank[il. Index ] = rank as in Section 2. 3. 2. 1 31. dewey. Stack[curr. Dewey. Id. Len]. pos. List[il. Index] += current. Entry. pos. List; 32. } // End of looping over all inverted lists 33. pop entries of dewey. Stack and add to result heap if appropriate (similar to lines 12 -24) 34. return ids in result. Heap

Appendix C. Experimental Evaluation Space Requirements DBLP XMARK Inv. List Index Naïve-ID 258 MB N/A 872 MB N/A Naïve-Rank 258 MB 217 MB 872 MB 527 MB DIL 144 MB N/A 254 MB N/A RDIL 144 MB 156 MB 254 MB 209 MB HDIL 186 MB 7 MB 307 MB 3. 2 MB

Appendix D. Experimental Evaluation Query Performance: DBLP - Low Correlation Keywords