XRANK Ranked Keyword Search over XML Documents Lin

Seven Steps Toward Better Searching My plump Minus plus -exclude in title +include 12/27/2021

OUTLINE 1. Introduction 2. 2. Background 3. Problem Definition 4. Contributions 5. Ranking Keyword

1. Introduction 12/27/2021 Spatial Database 4

What is XML ? ? ? XML is an official recommendation of the W

What is XML ? ? ? (Contd. ) Structure and Hierarchy of XML Documents

What is XML ? ? ? (Contd. ) XML Data Model As told XML

An Example XML Document 12/27/2021 Spatial Database 8

XML Vs HTML <OL> <LI>HTML allows <B><I>improper nesting</B></I>. <LI>HTML allows start tags, without end

What is Information Retrieval ? ? ? Information Retrieval is the process of determining

2. Background 12/27/2021 Spatial Database 11

How searching is done over HTML documents (Google) ? ? ? [2] 1. 2.

What is Page Rank ? ? ? [2] Ø Page. Rank is one of

3. 12/27/2021 Problem Definition Spatial Database 14

Why we need XRANK ? ? ? [1] For keywords searching over XML documents

Issues related to the Searching & Ranking of Keywords over XML documents [1] Ø

Keyword Query Results Ø There are two possible semantics for keyword search queries ü

4. 12/27/2021 Contributions Spatial Database 18

Contributions This Paper describes the architecture, implementation and evaluation of the XRANK system. The

5. Ranking Keyword Query Results 12/27/2021 Spatial Database 20

Properties of Ranking Function [3] Ø Result specificity: more specific results should be ranked

Ranking Function: Definition Ø Elem. Rank is defined at the granularity of an element

Overall Ranking The overall ranking is the sum of the ranks with respect to

6. 12/27/2021 XRANK System Spatial Database 24

Architecture [1] HDIL The Query Evaluator Module Ø Generates an index structure called HDIL

Elem. Rank Computing [2] ØPage. Rank function of document v is sum of 2

Refinements of Page. Rank [1] Bi-directional Transfer of Elem. Ranks A simple solution is

Refinements of Page. Rank (Contd. ) Discrimination between containment and hyperlink edges Ø d

Refinements of Page. Rank (Contd. ) Aggregate Elem. Ranks for reverse containment relationships Ø

7. Efficiently evaluating queries related to keywords over XML documents 12/27/2021 Spatial Database 30

Naïve Approach [3] Ø Main Difference between XML and HTML keyword search ü The

Naïve Approach (Contd. ) Ø Space Overhead ü The naïve adaptation of inverted list

Dewey Inverted List (DIL) [1] In order to overcome the problems of Naïve approach

Dewey Inverted List (Contd. ) DIL Data Structure Ø The inverted list for a

Dewey Inverted List (Contd. ) DIL Query Processing Algorithm An algorithm works in a

Dewey Inverted List (Contd. ) DIL Query Processing Algorithm using an Example With reference

Dewey Inverted List (Contd. ) DIL Query Processing Algorithm using an Example (Contd. )

Ranked Dewey Inverted List (RDIL) “If inverted lists are long (due to common keywords

Ranked Dewey Inverted List (Contd. ) RDIL Data Structure Ø RDIL is similar to

Ranked Dewey Inverted List (Contd. ) RDIL Query Processing Algorithm Ø The entry contains

Hybrid Dewey Inverted List (HDIL) Ø In many cases RDIL is likely to perform

Hybrid Dewey Inverted List (Contd. ) HDIL Query Processing Ø An adaptive strategy: ü

8. 12/27/2021 Experimental Evaluation Spatial Database 44

Query Performance Ø There are four main factors that affect the performance of keyword

Query Performance (Contd. ) Ø RDIL performs relatively badly for more than one query

9. 12/27/2021 Related Work Spatial Database 47

Related Work Systems Consider Ranking 2 D Keyword Proximity Integration With HTML Keyword Search

10. Conclusion & Future Work 12/27/2021 Spatial Database 49

Conclusion & Future Work Ø This paper presents ü XRANK system (design, implementation &

11. References 12/27/2021 Spatial Database 51

References [1] L. Guo, F. Shao, C. Botev, J. Shanmugasundaram, “XRANK: Ranked Keyword Search

Questions ? ? ? 12/27/2021 Spatial Database 53

Thank You 12/27/2021 Spatial Database 54

Slides: 53

Download presentation

XRANK: Ranked Keyword Search over XML Documents Lin Guo, Feng Shao, Chavdar Botev & Jayavel Shanmugasundaram Department of Computer Science Cornell University Publication: SIGMOD 2003 Presented By: Akshaya Arora 12/27/2021 Spatial Database 1

Seven Steps Toward Better Searching My plump Minus plus -exclude in title +include 12/27/2021 starfish quickly lowered Lincoln's tie. star quotes lower case link title wildcard* “phrases in quotes“ case MATTERS Spatial Database find page linked find words 2

OUTLINE 1. Introduction 2. 2. Background 3. Problem Definition 4. Contributions 5. Ranking Keyword Query Results 6. XRANK System 7. Efficiently evaluating queries related to keywords over XML documents 8. Experimental Evaluation 9. Related Work 10. Conclusion & Future Work 11. References 12/27/2021 Spatial Database 3

1. Introduction 12/27/2021 Spatial Database 4

What is XML ? ? ? XML is an official recommendation of the W 3 C The Extensible Markup Language Ø A meta language A language used to describe other languages using “markup” “Markup” describes properties of the data Ø Designed to be structured Strict rules about how data can be formatted Ø Designed to be extensible Can define own terms and markup 12/27/2021 Spatial Database 5

What is XML ? ? ? (Contd. ) Structure and Hierarchy of XML Documents Logical Structure Of Document XML Document Parent Document Unit Sub Units Child Sub Unit Siblings XML can be described in a tree Hierarchy, all elements must be nested. 12/27/2021 Spatial Database 6

What is XML ? ? ? (Contd. ) XML Data Model As told XML can be described in a tree Hierarchy Ø consist of Root Element, nested sub elements, attributes and values Ø supports intra-document and inter-document references. ü Intra-document referencees are represented using IDREFs. ü Inter-document references are represented using XLink. ü Both IDREFs and XLinks are reffered as hyperlinks! A collection of hyperlinked XML documents can be defined as a directed graph: G = (N, CE, HE) N : The set of nodes N = NE U NV NE : The set of elements NV : The set of values CE : The set of containment edges relating nodes HE : The set of hyperlink edges relating nodes 12/27/2021 Spatial Database 7

An Example XML Document 12/27/2021 Spatial Database 8

XML Vs HTML <OL> <LI>HTML allows improper nesting. <LI>HTML allows start tags, without end tags, like the tag. <LI>HTML allows attribute values without quotes <li>HTML is case-insensitive <LI>White space is not important in HTML </OL> <LI>XML requires proper nesting. </LI> <LI>XML requires empty tags to be identified with a trailing slash, as in . </LI> <LI>XML requires quoted attribute values. </LI> <LI>XML is case-sensitive <LI>White space is important in XML </OL> 12/27/2021 Spatial Database 9

What is Information Retrieval ? ? ? Information Retrieval is the process of determining the relevant documents from a collection of documents, based on a query presented by the user. Information Retrieval applications require speed, consistency, accuracy and ease of use in retrieving relevant texts to satisfy user queries Keyword Search is a type of Information Retrieval. 12/27/2021 Spatial Database 10

2. Background 12/27/2021 Spatial Database 11

How searching is done over HTML documents (Google) ? ? ? [2] 1. 2. 3. 4. Parse the query. Convert words into word. IDs. Seek to the start of the doclist in the short barrel for every word. Scan through the doclists until there is a document that matches all the search terms. 5. Compute the rank (Page Rank) of that document for the query. 6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. 7. If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top. 12/27/2021 Spatial Database 12

What is Page Rank ? ? ? [2] Ø Page. Rank is one of the methods Google uses to determine a page’s relevance or importance. Ø Page. Rank is a “vote”, by all the other pages on the Web, about how important a page is. A link to a page counts as a vote of support. If there’s no link there’s no support PR(A) = (1 -d) + d (PR(T 1)/C(T 1) +. . . + PR(Tn)/C(Tn)) • PR(A) is the Page. Rank of Page A • d is a dampening factor(0 -1). Nominally this is set to 0. 85 • PR(T 1) is the Page. Rank of a site pointing to Page A • C(T 1) is the number of links off that page • PR(Tn)/C(Tn) means we do that for each page pointing to Page A Page. Ranks form a probability distribution over web pages, so the sum of all web pages’ Page. Ranks will be ONE 12/27/2021 Spatial Database 13

3. 12/27/2021 Problem Definition Spatial Database 14

Why we need XRANK ? ? ? [1] For keywords searching over XML documents Ø One approach is the sophisticated query language XQUERY Limitations • Effective in some cases • User have to learn a complex query language and have to learn a schema of underlying XML. In order to Overcome these Issues Ø An alternative approach is XRANK ü Retain the simple keyword search query interface. ü Exploit XML’s tagged and nested structure during query processing. 12/27/2021 Spatial Database 15

Issues related to the Searching & Ranking of Keywords over XML documents [1] Ø The result of the keyword search query can be a deeply nested XML element. ü return the ‘deepest’ node. Ø Ranking is not solely based on hyperlinks. ü semantics of containment links (relating parent and child elements) is very difficult from that of hyperlinks (such as IDREFs and XLinks). Ø The notion of proximity among keywords is more complex. ü in HTML, proximity among keywords translates directly to the distance between keywords in a document. ü for XML there is a 2 -dimensional proximity metric. 12/27/2021 • Keyword distance • Ancestor distance Spatial Database 16

Keyword Query Results Ø There are two possible semantics for keyword search queries ü conjunctive keyword query semantics • contain all of the query keywords are returned. ü disjunctive keyword query semantics • contain at least one of the query keywords are returned. This paper focuses on conjunctive keyword query semantics An important requirement for keyword search is to RANK the query results so that the most relevant results appear first. 12/27/2021 Spatial Database 17

4. 12/27/2021 Contributions Spatial Database 18

Contributions This Paper describes the architecture, implementation and evaluation of the XRANK system. The contributions are: Ø the problem definition and system architecture Ø an algorithm for computing the ranking of XML elements Ø new inverted list index structures and associated query processing algorithms Ø an experimental evaluation of XRANK 12/27/2021 Spatial Database 19

5. Ranking Keyword Query Results 12/27/2021 Spatial Database 20

Properties of Ranking Function [3] Ø Result specificity: more specific results should be ranked higher than less specific results, this one dimension of result proximity. Ø Keyword proximity: proximity of query keywords, another dimension of result proximity. Ø Hyperlink Awareness: hyperlinked structure of XML documents. 12/27/2021 Spatial Database 21

Ranking Function: Definition Ø Elem. Rank is defined at the granularity of an element and takes the nested structure of XML into account, it is based on the hyperlinked structure of XML docs. Ø Similar to Google’s Page. Rank Q = (k 1, k 2, …, kn) R = Result(Q) Ø A result element v 1 R Ø First define the ranking of v 1 with respect to one query keyword ki, r(v 1, ki) before defining the overall rank, rank(v 1, Q). 12/27/2021 Spatial Database 22

Overall Ranking The overall ranking is the sum of the ranks with respect to each query keyword, multiplied by a measure of keyword proximity p(v 1, k 2, …, kn). 12/27/2021 Spatial Database 23

6. 12/27/2021 XRANK System Spatial Database 24

Architecture [1] HDIL The Query Evaluator Module Ø Generates an index structure called HDIL Ø Evaluates queries using HDIL Ø Returns ranked results Elem. Rank Computation Module Ø Computes the Elem. Ranks of XML elements Ø Combined with ancestor info 12/27/2021 Spatial Database 25

Elem. Rank Computing [2] ØPage. Rank function of document v is sum of 2 probabilities ü The first probability is of visiting v at random (d=0. 85) ü The second probability is of visiting v by navigating through other documents ü Nh(u) is the number outgoing hyperlinks from document There are some issues while adapting this formulae directly for the XML documents by mapping each element to a document , and by mapping all edges to hyperlink edges. Ø Hyperlinks are treated as directed edges and Page Rank propagates in one direction. 12/27/2021 Spatial Database 26

Refinements of Page. Rank [1] Bi-directional Transfer of Elem. Ranks A simple solution is to add reverse containment edges Ne is the total no of XML elements Nc(u) is the number of sub-elements of u. E = HE U CE -1, where CE -1 is a set of reverse containment edges. Limitation: does not distinguish between containment and hyperlink edges 12/27/2021 Spatial Database 27

Refinements of Page. Rank (Contd. ) Discrimination between containment and hyperlink edges Ø d 1 & d 2 are the probabilities of navigating through hyperlinks and containment edges, respectively Limitations: it weights forward and reverse containment relationships similarly 12/27/2021 Spatial Database 28

Refinements of Page. Rank (Contd. ) Aggregate Elem. Ranks for reverse containment relationships Ø d 1, d 2 & d 3 are probabilities of navigating through hyperlinks, forward containment edges & reverse containment edges respectively Ø Nde(v) is the number of elements in the XML documents containing the element v 12/27/2021 Spatial Database 29

7. Efficiently evaluating queries related to keywords over XML documents 12/27/2021 Spatial Database 30

Naïve Approach [3] Ø Main Difference between XML and HTML keyword search ü The granularity of query results ü XML keyword search returns elements ü HTML keyword search returns documents Ø One way to do XML keyword search ü Treat each element as a document Problems: • Space Overhead • Spurious Query Results • Inaccurate ranking of results 12/27/2021 Spatial Database 31

Naïve Approach (Contd. ) Ø Space Overhead ü The naïve adaptation of inverted list contains the list of elements that contains the keyword. üA large space overhead; because each inverted list contains. • XML element that directly contains the keyword(1) • All of (1)s ancestors redundantly Ø Spurious Query Results ü The naïve approach ignores ancestor-descendant relationships. • All elements treated as independent documents ü Results will not correspond to the desired semantics for XML keyword search. Ø Inaccurate Ranking of Results ü Existing approaches do not take result specificity into account when ranking results 12/27/2021 Spatial Database 32

Dewey Inverted List (DIL) [1] In order to overcome the problems of Naïve approach DIL is introduced Ø Dewey encoding of Element IDs jointly captures ancestor and descendant information. Ø ID of an ancestor is a prefix of the ID of a descendant. 12/27/2021 Spatial Database 33

Dewey Inverted List (Contd. ) DIL Data Structure Ø The inverted list for a keyword k contains the Dewey IDs of all the XML elements that directly contain the keyword k. Ø For multiple documents ü First component of each Dewey ID is the document ID Ø An entry in DIL ü Elem. Rank of corresponding XML element ü The list of all positions where the keyword k appears in that element. Ø Entries are sorted by Dewey IDs Ø The size of DIL is smaller than that of Naïve Approach. 12/27/2021 Spatial Database 34

Dewey Inverted List (Contd. ) DIL Query Processing Algorithm An algorithm works in a single pass over the query keyword inverted lists. 12/27/2021 Spatial Database 35

Dewey Inverted List (Contd. ) DIL Query Processing Algorithm using an Example With reference to Figure 4 Ø Let the input (the keyword search query) be “ XQL Ricardo” Ø First the entry with the smallest Dewey ID (5. 0. 3. 0. 0) is being read. ü Since the Dewey stack is initially empty, the longest common prefix is empty and the Dewey ID components are simply pushed onto stack, and the rank and pos. List of the topmost entry is updated (Line 25 – 32). Ø Then the next smallest Dewey ID (5. 0. 3. 0. 1) is being read in the “Ricardo” inverted list. Ø The longest common prefix (5. 0. 3. 0) of the current entry and the Dewey stack is determined (Line 10 - 11). Ø Non matching entries (5. 0. 3. 0. 0) are popped from stack. (Line 12 – 24) ü The scaled down rank and position list of popped entry are copied to its parent entry (5. 0. 3. 0) (Line 19 – 22). ü The rank and position of the current entry (5. 0. 3. 0. 1) is then pushed onto the stack. 12/27/2021 Spatial Database 37

Dewey Inverted List (Contd. ) DIL Query Processing Algorithm using an Example (Contd. ) Ø Then the next smallest Dewey ID (6. 0. 3. 8. 3) is being read. ü Since the longest common prefix with the Dewey stack is empty, it pops all the entries of the Dewey stack. (Line 13 – 24) ü The top most entry (5. 0. 3. 0. 1. 0) does not contain all query keywords, its scaled down rank and position list are copied to its parent (5. 0. 3. 0) when popped. Ø Now the parent (5. 0. 3. 0) contains all the query keywords, Contains. All flag is set to true, and is added to the result heap (Line 16 – 18). Conclusion Ancestor of more specific result (5. 0. 3. 0) are not returned, thus eliminating the spurious results. Ø The algorithm then pushes (6. 0. 3. 8. 3) onto the stack and proceed as before. 12/27/2021 Spatial Database 38

Ranked Dewey Inverted List (RDIL) “If inverted lists are long (due to common keywords or large document collections) even the cost of a single scan of the inverted list can be expensive, especially if the users want only the top few results. ” In order to overcome this issue Solution 1 Ø Order the inverted lists by the Elem. Rank instead of by the Dewey ID ü Higher ranked results will appear first in the inverted list, query processing can usually be terminated without scanning all the inverted lists. Limitation Ø Processing queries with multiple keywords is challenging as one query keyword may occur in an element with a high Elem. Rank (beginning) while other may occur in end. In order to overcome this issue Solution 2 Ø Threshold algorithm is proposed that works for conjunctive queries too. 12/27/2021 Spatial Database 39

Ranked Dewey Inverted List (Contd. ) RDIL Data Structure Ø RDIL is similar to DIL except that: ü Inverted lists are ordered by Elem. Rank, ü Each inverted list has a B+-tree index of the Dewey ID field. Now the question comes why we use B + tree ? Consider a query keyword kj ( ≠ki). We first need to find the longest prefix of d that also contains the keyword kj. ü we just need to find the smallest Dewey ID, d 2, in the kj inverted list that is larger than d. • this operation can be easily supported in B+trees because it is logically equivalent to starting a range scan at d, and reading the first entry d 2 in the range. ü Then, either d 2 or its immediate predecessor in the B+-tree, d 3, shares the longest common prefix with d. 12/27/2021 Spatial Database 40

Ranked Dewey Inverted List (Contd. ) RDIL Query Processing Algorithm Ø The entry contains the Dewey ID d of a top-ranked element that directly contains the query keyword ki Ø To determine a query result the longest prefix of d that also contains the other query keywords needs to be determined Consider an entry retrieved from the inverted list of keyword ki 12/27/2021 Spatial Database 41

Hybrid Dewey Inverted List (HDIL) Ø In many cases RDIL is likely to perform well. Ø It may perform worse than DIL when there is a query where keywords are not correlated. Ø The individual query keywords occur relatively frequently in the document collection but rarely occur together in the same document. Ø Since the number of results is small: ü RDIL has to scan most (or all) of the inverted lists to produce the output. In order to overcome these issues We combine the benefits of DIL and RDIL without replicating the entire inverted list index Figure 9: HDIL 12/27/2021 Spatial Database 42

Hybrid Dewey Inverted List (Contd. ) HDIL Query Processing Ø An adaptive strategy: ü Periodically monitor performance. ü Calculate; • Time spent – t • The number of results above threshold – r • Estimated time remaining for RDIL = (m-r)*t / r where m: desired number of query results ü If estimated time is more than the expected time for DIL, then switch to DIL. 12/27/2021 Spatial Database 43

8. 12/27/2021 Experimental Evaluation Spatial Database 44

Query Performance Ø There are four main factors that affect the performance of keyword search queries ü number of query key words. ü correlation between keywords. ü the desired number of query results. ü the selectivity of the keyword. Ø RDIL performs well because the index probes to find common ancestors are successful. Ø DIL, on the other hand, has to scan the entire inverted list, and hence performs relatively poorly. Ø The performance of HDIL tracks that of RDIL by estimating a low completion time for RDIL. 12/27/2021 Spatial Database 45

Query Performance (Contd. ) Ø RDIL performs relatively badly for more than one query keyword because there are many unsuccessful random B+-tree lookups. Ø In contrast, DIL sequentially scan the inverted lists and performs better. Ø HDIL tracks the performance of DIL, but with a slight overhead because it starts of as RDIL, and then switches to DIL. 12/27/2021 Spatial Database 46

9. 12/27/2021 Related Work Spatial Database 47

Related Work Systems Consider Ranking 2 D Keyword Proximity Integration With HTML Keyword Search Query Processing algorithms/ inverted lists XRANK ü ü XIRQL [5] ü X XXL [6] ü X X X BANKS[4] ü X X X ü ü Algorithms for computing the deepest common ancestor of two nodes in a tree are well known, but these do not consider ranking, and are not directly applicable for lists of nodes. 12/27/2021 Spatial Database 48

10. Conclusion & Future Work 12/27/2021 Spatial Database 49

Conclusion & Future Work Ø This paper presents ü XRANK system (design, implementation & evaluation) to handle these features of XML keyword search. • the hierarchical and hyperlinked structure of XML documents, • two-dimensional notion of keyword proximity ü XRANK offers both space & performance benefits ü XRANK can be used to query both HTML and XML documents. Ø Future work includes ü extension of other ranking functions. ü integration with structured queries, and ü incremental index maintenance. 12/27/2021 Spatial Database 50

11. References 12/27/2021 Spatial Database 51

References [1] L. Guo, F. Shao, C. Botev, J. Shanmugasundaram, “XRANK: Ranked Keyword Search Over XML Documents”, Cornell University Technical Report, 2003. [2] S. Brin, L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, WWW Conf. , 1998. [3] G. Salton, “Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer", Addison Wesley, 1989. [4] G. Bhalotia, et al. , “Keyword Searching and Browsing in Databases using BANKS”, ICDE Conf. , 2002 [5] N. Fuhr, K. Grobjohann, “XIRQL: A Language for Information Retrieval in XML Documents”, SIGIR Conf. , 2001. [6] A. Theobald, G. Weikum, “The Index-Based XXL Search Engine for Querying XML Data with Relevance Rankings”, EDBT Conf. , 2002. 12/27/2021 Spatial Database 52

Questions ? ? ? 12/27/2021 Spatial Database 53

Thank You 12/27/2021 Spatial Database 54