Unordered Tree Matching and Strict Unordered Tree Matching

  • Slides: 36
Download presentation
Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R 3 B 2 E 9

Outline n n n Motivation Algorithm for unordered tree pattern query evaluation - Tree

Outline n n n Motivation Algorithm for unordered tree pattern query evaluation - Tree encoding - Algorithm description On the strict unordered tree matching Experiment results Summary

Motivation n Efficient method to evaluate XPath expression queries – XML query processing a

Motivation n Efficient method to evaluate XPath expression queries – XML query processing a tree pattern query XML documents

Motivation Document: <Purchase> <Seller> P <Name>dell</Name> <Item> S B <Manufacturer>IBM</Manufacturer> <Name>part#1</Name> <Item> <Manufacturer>Intel</Manufacturer> N

Motivation Document: <Purchase> <Seller> P <Name>dell</Name> <Item> S B <Manufacturer>IBM</Manufacturer> <Name>part#1</Name> <Item> <Manufacturer>Intel</Manufacturer> N I I L L </Item> <Item> Dell M N I N Houston Winnipeg <Name>Part#2</Name> </Item> IBM Part#1 M Part#2 <Location>Houston</Location> </Seller> <Buyer> Intel <Location>Winnipeg</Location> <Name>Y Chen</Name> </Buyer> </Purchase> N Y Chen

Motivation Document: Query – XPath expressions: P S Q 1: /Purchase[Seller[Loc=‘Boston’]]/ Buyer[Loc = ‘New

Motivation Document: Query – XPath expressions: P S Q 1: /Purchase[Seller[Loc=‘Boston’]]/ Buyer[Loc = ‘New York’ Purchase B Buyer Seller N I I L L Location N Location ‘Houston’ ‘Winnipeg’ Dell M N I N Houston Winnipeg IBM Part#1 M Part#2 Intel Y Chen Q 2: /Purchase//Item[Manufacturer = ‘Intel’] d edge: ancestor descendant relationship Buyer c edge: parent child relationship Manufacturer Item ‘Intel’

Motivation n XPath evaluation against XML documents - XPath expression a b a[b[c and.

Motivation n XPath evaluation against XML documents - XPath expression a b a[b[c and. //d]]/b[c and e//d] c b d c e d book[title = ‘Art of Programming’]//author[fn = ‘Donald’ and ln = ‘Knuth’] <document> <book> <title> Art of Programming </title> <author> <fn>Donald Knuth</fn> …… book author title Art of Programming fn ln Donald Knuth

Motivation n XPath evaluation against XML documents - Evaluation based on unordered tree matching:

Motivation n XPath evaluation against XML documents - Evaluation based on unordered tree matching: Definition An embedding of a tree pattern Q into an XML document T is a mapping f: Q T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i) Preserve node label: For each u Q, label(u) matches label(f(u)). (ii) Preserve parent-child/ancestor-descendant relationships: If u v in Q, then f(v) is a child of f(u) in T; if u v in Q, then f(v) is a descendant of f(u) in T. Q: q 3 q 1 b a a c q 2 v 4 v 1 c T: v 6 c b b v 3 b v 2 v 5

Motivation n XPath evaluation against XML documents - Evaluation based on unordered tree matching:

Motivation n XPath evaluation against XML documents - Evaluation based on unordered tree matching: Definition 1 An embedding of a tree pattern Q into an XML document T is a mapping f: Q T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i) Preserve node label: For each u Q, label(u) matches label(f(u)). (ii) Preserve parent-child/ancestor-descendant relationships: If u v in Q, then f(v) is a child of f(u) in T; if u v in Q, then f(v) is a descendant of f(u) in T. Q: q 3 q 1 b a a c q 2 v 4 v 1 c T: v 6 c b b v 3 b v 2 v 5

Algorithm for query evaluation n Tree encoding Let T be a document tree. We

Algorithm for query evaluation n Tree encoding Let T be a document tree. We associate each node v in T with a quadruple (Doc. Id, Left. Pos, Right. Pos, Level. Num), where Doc. Id is the document identifier; Left. Pos and Right. Pos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and Level. Num is the nesting depth of the element in the document. 1 <A> 2 <B> 3 <C> 3 </C> 4 <B> 5 <C> 5 </C> 6 </C> 7 <D> 7 </D> 8 </B> 9 </B> 10 </B> 11 </A> T: A v 1 (1, 1, 1) B v 8 (1, 2, 9, 2) B v 2 (1, 3, 3, 3) v 3 C (1, 5, 5, 4) B v 4 (1, 4, 8, 3) v 5 C v 6 C (1, 6, 6, 4) D v 7 (1, 7, 7, 4) (1, 10, 2)

n Tree encoding Let T be a document tree. We associate each node v

n Tree encoding Let T be a document tree. We associate each node v in T with a quadruple (Doc. Id, Left. Pos, Right. Pos, Level. Num), denoted as a(v), where Doc. Id is the document identifier; Left. Pos and Right. Pos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and Level. Num is the nesting depth of the element in the document. (i) ancestor-descendant: a node v 1 associated with (d 1, l 1, r 1, ln 1) is an ancestor of another node v 2 with (d 2, l 2, r 2, ln 2) iff d 1 = d 2, l 1 < l 2, and r 1 > r 2. (ii) parent-child: a node v 1 associated with (d 1, l 1, r 1, ln 1) is the parent of another node v 2 with (d 2, l 2, r 2, ln 2) iff d 1 = d 2, l 1 < l 2, r 1 > r 2, and ln 2 = ln 1 + 1. (iii)from left to right: a node v 1 associated with (d 1, l 1, r 1, ln 1) is to the left of another node v 2 with (d 2, l 2, r 2, ln 2) iff d 1 = d 2, r 1 < l 2.

Algorithm for query evaluation n 1 Tree encoding <A> 2 <B> 3 <C> 3

Algorithm for query evaluation n 1 Tree encoding <A> 2 <B> 3 <C> 3 </C> 4 <B> 5 <C> 5 </C> 6 </C> 7 <D> 7 </D> 8 </B> 9 </B> 10 </B> 11 </A> T: A v 1 (1, 1, 1) B v 8 (1, 2, 9, 2) B v 2 (1, 3, 3, 3) v 3 C (1, 5, 5, 4) (1, 10, 2) B v 4 (1, 4, 8, 3) v 5 C v 6 C D v 7 (1, 7, 7, 4) (1, 6, 6, 4) Data streams: A: (1, 1, 1) B: (1, 2, 9, 2) (1, 4, 8, 3), (1, 10, 2) C: (1, 3, 3, 3) (1, 5, 5, 4), (1, 6, 6, 4) D: (1, 7, 7, 4) sorted by Left. Pos values

Algorithm for query evaluation n Algorithm description • Our algorithm works bottom up. Therefore,

Algorithm for query evaluation n Algorithm description • Our algorithm works bottom up. Therefore, we need to sort XML streams by (Doc. ID, Right. Pos) values. • Each time a query Q is submitted to the system, we will associate each query node q with a data stream L(q) such that for each v L(q) label(v) = label(q), in which each query node is attached with a list of matching nodes of the document tree. T: Q: A q 1 Q: L(q 2) L(q 3) B q 5 q 2 B q 3 C C q 4 L(q 1 ) = (1, 1, 1) - {v 1} L(q 2 ) = L(q 5) = (1, 4, 8, 3), (1, 2, 9, 2) (1, 10, 2) - {v 4, v 2, v 8} L(q 3) = L(q 4) = (1, 3, 3, 3) (1, 5, 5, 4), (1, 6, 6, 4) - {v 3, v 5, v 6} L(q 1) L(q 5) L(q 4) sorted by Right. Pos values

Algorithm for query evaluation n Algorithm description 1. S(v) – a set of query

Algorithm for query evaluation n Algorithm description 1. S(v) – a set of query nodes q associated with document tree node v such that Q[q] can be embedded in T[v]. 2. (q) – a variable associated with query node q. • During the process, (q) will be dynamically assigned a series of values a 0, a 1, . . . , am for some m in sequence, where a 0 = and ai’s (i = 1, . . . , m) are different nodes of T. • Initially, (q) is set to a 0 = . (q) will be changed from ai 1 to ai = v (i = 1, . . . , m) when the following conditions are satisfied. i) v ii) q iii) q q is the node currently encountered. appears in S(u) for some child node u of v. is a d child, or is a c child, and u is a c child with label(u) = label(q). T: Q: v S(v) q (q)

Algorithm for query evaluation n Algorithm description i) v ii) q iii) q q

Algorithm for query evaluation n Algorithm description i) v ii) q iii) q q is the node currently encountered. appears in S(u) for some child node u of v. is a d child, or is a c child, and u is a c child with label(u) = label(q). q’ v u d child q S(u) = {…, q, …} (q) is changed u S(u) = {…, q, …} from ai-1 to ai = v q’ v q c child label(u) = label(q). u must be a direct child of v.

Algorithm for query evaluation n Algorithm description 3. Subtree embedding by using (q) label(v)

Algorithm for query evaluation n Algorithm description 3. Subtree embedding by using (q) label(v) = label(q’} v q’ q 1 ql (q 1) = (q 2) = … = (ql) = v T[v] embeds Q[q’].

Algorithm for query evaluation n Algorithm description 4. Construction of S(v) • The nodes

Algorithm for query evaluation n Algorithm description 4. Construction of S(v) • The nodes of Q are numbered in postorder. So the nodes in Q will be referenced by their postorder numbers. • For a leaf node v in T, any encountered leaf node q in Q will be inserted into S(v) if label(u) = label(q). (Since Q is explored bottom up, the query nodes in S(v) must be increasingly sorted. ) • For a non leaf node v with children v 1, …, vk, S(v) S(v 1) … S(vk) {all nodes q with T[v] embedding Q[q]}. Q: A q 1 5 3 q 2 B 1 q 3 C C q 4 2 B q 5 4 Since S(v) is sorted, ‘union’ can be implemented using the ‘merge’ operation.

Strict unordered tree matching n Definition 2 A strict embedding of a tree pattern

Strict unordered tree matching n Definition 2 A strict embedding of a tree pattern Q into an XML document T is a mapping f: Q T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i) same as (i) in Definition 1. (ii) same as (ii) in Definition 1. (iii) For any two nodes v 1 Q and v 2 Q, if v 1 and v 2 are not (iv) related by an ancestor descendant relationship, then f(v 1) and (v) f(v 2) in T are not related by an ancestor descendant relationship Q: q 3 q 1 b a a c q 2 v 4 v 1 c T: v 6 c b b v 3 b v 2 v 5

Strict unordered tree matching n Definition 2 A strict embedding of a tree pattern

Strict unordered tree matching n Definition 2 A strict embedding of a tree pattern Q into an XML document T is a mapping f: Q T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i) same as (i) in Definition 1. (ii) same as (ii) in Definition 1. (iii) For any two nodes v 1 Q and v 2 Q, if v 1 and v 2 are not (iv) related by an ancestor descendant relationship, then f(v 1) and (v) f(v 2) in T are not related by an ancestor descendant relationship Q: q 3 q 1 a b a c q 2 v 4 v 1 By Definition 2, this mapping is not allowed. c T: v 6 c b b v 3 b v 2 v 5

Strict unordered tree matching needs the exponential time. Q: T: a b c a

Strict unordered tree matching needs the exponential time. Q: T: a b c a d e x … y e d c b O(|T||Q|dk) d – the largest out degree of the nodes in T. k – the largest out degree of the nodes in Q. Using the concept of hypergraphs, the time complexity can be slightly improved to O(|T||Q|2 k).

Algorithm for query evaluation n Experiments • We conducted our experiments on a DELL

Algorithm for query evaluation n Experiments • We conducted our experiments on a DELL desktop PC equipped with Pentium(R) 4 CPU 2. 80 GHz, 0. 99 GB RAM and 20 GB hard disk. The code was compiled using Microsoft Visual C++ compiler version 6. 0, running standalone. • Tested methods In the experiments, we have tested four methods: - Twig. Stack (TS for short), - Twig 2 Stack (T 2 S for short), - Twig-List (TL for short), One-Phase Holistic (OPH for short), - tree-embedding (discussed in this paper, TE for short).

Algorithm for query evaluation n Experiments • Tested methods In the experiments, we have

Algorithm for query evaluation n Experiments • Tested methods In the experiments, we have tested five methods: Twig. Stack (TS for short) [1], Twig 2 Stack (T 2 S for short) [2], Twig-List (TL for short) [3], One-Phase Holistic (OPH for short) [4], tree-embedding (discussed in this paper, TE for short). [1] [2] [3] [4] N. Bruno, N. Koudas, and D. Srivastava, Holistic Twig Joins: Optimal XML Pattern Matching, in Proc. SIGMOD Int. Conf. on Management of Data, Madison, Wisconsin, June 2002, pp. 310 321. S. Chen, H G. Li, J. Tatemura, W P. Hsiung, D. Agrawa, and K. S. Canda, Twig 2 Stack: Bottom up Processing of General ized Tree Pattern Queries over XML Documents, in Proc. VLDB, Seoul, Korea, Sept. 2006, pp. 283 294. Qin, L. , Yu, J. X. , and Ding, B. , “Twig. List: Make Twig Pattern Matching fast, ” In Proc. 12 th Int’l Conf. on Database Systems for Advanced Applications (DASFAA), pp. 850 862, Apr. 2007. Jiang, Z. , Luo, C. , Hou, W. C. , Zhu, Q. , and Che, D. , “Efficient Processing of XML Twig Pattern: A Novel One Phase Holistic Solution, ” In Proc. The 18 th Int’l Conf. on Database and Expert Systems Applications (DEXA), pp. 87 97, Sept. 2007.

Algorithm for query evaluation n • Experiments Theoretical computational complexities methods Query time Runtime

Algorithm for query evaluation n • Experiments Theoretical computational complexities methods Query time Runtime space usage Twig. Stack O(|D||Q|) Twig 2 Stack O(|D||Q|2+|sub. Twig. Results| O(|D||Q|) TL O(|D||Q|2) O(|D||Q|) OPH O(|D||Q|2) O(|D||Q|) TE O(|D||Q|) D a largest data stream associated with a node q of Q such that for each v in the data stream we have label(v) = label(q).

Algorithm for query evaluation n • Experiments Indexes All the tested methods use XB-trees

Algorithm for query evaluation n • Experiments Indexes All the tested methods use XB-trees as their index structure.

Algorithm for query evaluation n • Experiments Data sets The data sets used for

Algorithm for query evaluation n • Experiments Data sets The data sets used for the tests are DBLP data set [30] and a synthetic XMARK data set [35]. The DBLP data set is another real data set with high similarity in structure. It is in fact a wide and shallow document. DBLP XMark Data size 127 (MB) 113 (MB) Number of nodes 3332 k 1666 k Max/Avg. depth 6/2. 9 12/5. 5

Algorithm for query evaluation n • Experiments Queries – altogether 20 queries, divided into

Algorithm for query evaluation n • Experiments Queries – altogether 20 queries, divided into 4 groups Group I: Q 1: //inproceedings [author]//year [text() = ‘ 2004’] Q 2: //inproceedings [author and title]//year [text() = ‘ 2004’] Q 3: //inproceedings [author and title and. //pages]//year[text() = ‘ 2004’] Q 4: //inproceedings [author and title and. //pages and. //url] //year [text() = 2004’] Q 5: //articles [author and title and. //volume and. //pages and //url]//year [text() = ‘ 2004’]

Algorithm for query evaluation n • Experiments Test results For all the experiments, the

Algorithm for query evaluation n • Experiments Test results For all the experiments, the buffer pool size was fixed at 2000 pages. The page size of 8 KB was used. For each data set, all the tag names are stored in a single list and then each tag name is represented by its order number in that list during the evaluation of queries. In our implementation, each Doc. Id occupies 4 bytes while a number in a Prüfer sequence, a Left. Pos or a Right. Pos occupies 2 bytes. A level. Num value takes only 1 byte.

Algorithm for query evaluation n Experiments execution time (sec. ) 5 TS T 2

Algorithm for query evaluation n Experiments execution time (sec. ) 5 TS T 2 S OPH TL TM 4 3 + * + 2 + * Q 3 Q 4 1 Q 2 Q 5

Algorithm for query evaluation n • Experiments Queries – altogether 20 queries, divided into

Algorithm for query evaluation n • Experiments Queries – altogether 20 queries, divided into 4 groups Group II: Q 6: //inproceedings[author/* and. /*]/year Q 7: //inproceedings[author/* and title and. /*]/year Q 8: //inproceedings[author/* and title and. //pages and. /*]/year Q 9: //inproceeding[author/* and title and. //pages and. //url and. /*]/year Q 10: //articles[author/* and title and. //volume and. //pages and. //url and. /*]/year

Algorithm for query evaluation n Experiments execution time (sec. ) 60 TS T 2

Algorithm for query evaluation n Experiments execution time (sec. ) 60 TS T 2 S OPH TL TM 50 40 + * + 30 * + * * * 20 Q 6 Q 7 Q 8 Q 9 Q 10

Algorithm for query evaluation n • Experiments Queries – altogether 20 queries, divided into

Algorithm for query evaluation n • Experiments Queries – altogether 20 queries, divided into 4 groups Group III: Q 11: /site//open_auction[. //seller/person]//date[text() = ‘ 10/23/2006’] Q 12: /site//open_auction[. //seller/person and. //bidder]//date[text() = ‘ 10/23/2006’] Q 13: /site//open_auction[. //seller/person and /. /bidder/increase]//date [text() = ‘ 10/23/2006’] Q 14: /site//open_auction[. //seller/person and. //bidder/increase and. //initial]//date [text() = ‘ 10/23/ 2006’] Q 15: /site//open_auction[. //seller/person and. //bidder/increase and //initial and. //description]//date [text() = ‘ 10/23/2006’]

Algorithm for query evaluation n Experiments execution time (sec. ) 5 4 3 TS

Algorithm for query evaluation n Experiments execution time (sec. ) 5 4 3 TS T 2 S OPH TL TM + 2 + + * * * 1 Q 12 Q 13 Q 14 Q 15

Algorithm for query evaluation n • Experiments Queries – altogether 20 queries, divided into

Algorithm for query evaluation n • Experiments Queries – altogether 20 queries, divided into 4 groups Group IV: Q 16: /site//open_auction[. //seller/person/* and. /*]/date Q 17: /site//open_auction[. //seller/person/* and. //bidder and. /*]/date Q 18: /site//open_auction[. //seller/person/* and. //bidder/increase]/date Q 19: /site//open_auction[. //seller/person/* and. //bidder/increase and. //initial and. /*]/date Q 20: /site//open_auction[. //seller/person/* and. //bidder/increase and. //initial and. //description and. /*]/ date

Algorithm for query evaluation n Experiments execution time (sec. ) 5 + TS T

Algorithm for query evaluation n Experiments execution time (sec. ) 5 + TS T 2 S OPH TL TM 4 3 + * * 2 * 1 Q 16 Q 17 Q 18 Q 19 Q 20

Algorithm for query evaluation • Experiments Test results In the following figure, we compare

Algorithm for query evaluation • Experiments Test results In the following figure, we compare the runtime memory usage of all the five tested approaches for the second group of queries. By the memory usage, we mean the intermediate data structures, not including data stream (concretely, path stacks for Twig. Stack; hierarchical stacks for Twig 2 Stack, TL and OPH; and QSs for ours. ) Run time space usage n

Summary • An efficient method for evaluating unordered tree pattern queries in XML document

Summary • An efficient method for evaluating unordered tree pattern queries in XML document databases - parent/child ancestor/descendant relation - O(|D||Q|) time and O(|D||Q|) space • Strict unordered tree matching - exponential time • Experiments - Tree. Bank database, DBLP and XMark documents - I/O time and CPU processing time

Thank you.

Thank you.