Introduction to XML Algebra CS 561 1 Data
Introduction to XML Algebra CS 561 1
Data Model data model ~ core data structures and data types supported by DBMS n relational database is a table (set-oriented) data model n XML format is a tree-structured hierarchical model n 2
Why Query Algebra (for XML) ? It is common to translate a query language into an algebra. n First, the algebra is used to give a semantics for the query language. n Second, the algebra is used to support query optimization. n 3
NIAGARA n Title : Following the paths of XML Data: An algebraic framework for XML query evaluation By : Leonidas Galanis, Efstratios Viglas, David J. De. Witt, Jeffrey. F. Naughton, and David Maier. Univ. of Wisconsin 5
Outline n Concepts of Niagara Algebra n Operations n Optimization 6
Goals of Niagara Algebra Be independent of schema information n Query on both structure and content n Generate simple, flexible, yet powerful algebraic expressions n Allow re-use of traditional optimization techniques n 7
Example: XML Source Documents Invoice. xml Customer. xml <Invoice_Document> <invoice No = 1> <account_number>2 </account_number> <Customer_Document> <customer> <carrier>AT&T</carrier> <account>1 </account> <total>$0. 25</total> <name>Tom </name> </invoice> </customer > <customer> <account_number>1 </account_number> <account>2 </account> <carrier>Sprint</carrier> <name>George </name> <total>$1. 20</total> </invoice> </customer > </Customer _Document> <invoice> <account_number>1 </account_number> <carrier>AT&T</carrier> <total>$0. 75</total> </invoice> </Invoice_Document> 8
XML Data Model and Tree Graph Example: Invoice_Document Invoice number 2 Invoice … carrier total number total carrier <Invoice_Document> <invoice> <number>2</number> <carrier>Sprint</carrier> <total>$0. 25</total> </invoice> Ordered Tree Graph, <invoice> <number>1</number> <carrier>Sprint</carrier> <total>$1. 20</total> </invoice> Semi structured Data </Invoice_Document> AT&T $0. 25 1 Sprint $1. 20 9
XML Data Model (for Querying) SQL: relations in, relation out. n Relational Algebra: relations in, relation out. n XQuery: XML doc in, XML docs out n XML Algebra: ? ? n 10
XML Data Model [GVDNM 01] Collection of bags of vertices. n Vertices in a bag have no order. n Example: n Root invoice. xml invoice <invoice> Invoice-element-content </invoice> invoice. account_number < account_number > element-content </ account_number > [Root“invoice. xml”, invoice. account_number ] 11
Data Model Bag elements are reachable by path expressions. n Path expression consists of two parts: n ¨ An entry point ¨ A relative forward part n Example: account_number: invoice 12
Outline n Concepts of Niagara Algebra n Operations n Optimization 13
Operators n Source S , Follow , Expose , Vertex , n Source S , Select , Join , Rename , Group , Union , Intersection , Difference - , Cartesian Product . 14
Source Operator S n Input : a list of documents Output : a collection of singleton bags n Examples : n S (*) S (invoice*. xml) S (*, schema. dtd) All known XML documents All XML documents whose filename match “invoice*. xml All known XML documents that conform to schema. dtd 15
Follow operator Input : a path expression in entry point notation n Functionality : extracts vertices reachable by path expression n Output : a new bag that consists of the extracted vertex + all contents of original bag (in case of unnesting follow) n 16
Follow operator (Example*) {[Root invoice. xml , invoice, invoice. carrier]} invoice. xml invoice. carrier <carrier> carrier -element-content </carrier > <invoice> Invoice-element-content </invoice> (carrier: invoice) Root *Unnesting Follow invoice. xml <invoice> Invoice-element-content </invoice> {[Root invoice. xml , invoice]} 17
Select operator Input : a set of bags n Functionality : filters the bags of a collection using a predicate n Output : a set of bags that conform to the predicate n Predicate : Logical operator ( , , ), or simple n qualifications ( , , , ) 18
Select operator (Example) {[Root invoice. xml , invoice. xml invoice], … } invoice <invoice> Invoice-element-content </invoice> invoice. carrier =Sprint Root invoice. xml invoice Root <invoice> Invoice-element-content </invoice> {[Root invoice. xml , invoice], [Root invoice. xml invoice <invoice> Invoice-element-content </invoice> invoice. xml , invoice], ……………} 19
Join operator Input: two collections of bags n Functionality: Joins the two collections based on a predicate n Output: the concatenation of pairs of pages that satisfy the predicate n 20
Join operator (Example) {[Root invoice. xml , invoice, Root invoice. xml Root customer. xml , customer]} customer. xml <invoice> Invoice-element-content </invoice> <customer> customer-element-content </customer> account_number: invoice =number: customer Root invoice. xml invoice Root customer. xml <invoice> Invoice-element-content </invoice> {[Root invoice. xml , invoice]} customer <customer> customer-element-content </customer> {[Root customer. xml , customer]} 21
Expose operator Input: a list of path expressions of vertices to be exposed n Output: a set of bags that contains vertices in the parameter list with the same order n 22
Expose operator (Example) {[Root invoice. xml , invoice. bill_period, invoice. carrier]} invoice. bill_period <carrier> bill_period -element-content </carrier > invoice. carrier <invoice> carrier-element-content </invoice> (bill_period, carrier) Root invoice. xml invoice. carrier <invoice> Invoice-element-content </invoice> {[Root invoice. xml , <invoice> carrier-element-content </invoice> invoice. bill_period <carrier> bill_period -element-content </carrier > invoice, invoice. carrier, invoice. bill_period]} 23
Vertex operator Creates the actual XML vertex that will encompass everything created by an expose operator n Example : n (Customer_invoice)[ ( (account)[invoice. account_number], (inv_total)[invoice. total])] 24
Other operators n Group : is used for arbitrary grouping of elements based on their values ¨ Aggregate functions can be used with the group operator (i. e. average) n Rename : Changes entry point annotation of elements of a bag. ¨ Example: (invoice. bill_period, date) 25
Example: XML Source Documents Invoice. xml Customer. xml <Invoice_Document> <invoice> <account_number>2 </account_number> <Customer_Document> <customer> <carrier>AT&T</carrier> <account>1 </account> <total>$0. 25</total> <name>Tom </name> </invoice> </customer > <customer> <account_number>1 </account_number> <account>2 </account> <carrier>Sprint</carrier> <name>George </name> <total>$1. 20</total> </invoice> </customer > </Customer _Document> <invoice> <account_number>1 </account_number> <total>$0. 75</total> </invoice> <auditor> maria </auditor> </Invoice_Document> 26
Xquery Example List account number, customer name, and invoice total for all invoices that have carrier = “Sprint”. FOR $i in (invoices. xml)//invoice, $c in (customers. xml)//customer WHERE $i/carrier = “Sprint” and $i/account_number= $c/account RETURN <Sprint_invoices> $i/account_number, $c/name, $i/total </Sprint_invoices> 27
Example: Xquery output <Sprint_Invoice> <account_number>1 </account_number> <name>Tom </name> <total>$1. 20</total> </Sprint_Invoice > 28
Algebra Tree Execution Account_number name total Expose (*. account_number , *. name, *. total ) invoice(2) customer(1) Join (*. invoice. account_number=*. customer. account) invoice (2) Select (carrier= “Sprint” ) Invoice (1) invoice (2) invoice (3) Follow (*. invoice) Source (Invoices. xml) customer(1) customer (2) Follow (*. customer) Source (cutomers. xml) 29
Outline n Concepts of Niagara Algebra n Operations n Optimization 30
Optimization with Niagara Optimizer based on Niagara algebra: Use the operation more efficiently n Produce simpler expressions by combining operations n 31
Language Convention A and B are path expressions n A< B -- Path Expression A is prefix of B n An. B --- Common prefix of path A and B n AńB --- Greatest common prefix of path A and B n┴ --- Null path Expression n 32
Heuristics using Rewrite Rules Allow optimization based on path selectivity When applying un-nesting with operation Φμ 33
Interchangeability of Follow operation Φμ(A) [Φμ(B)]=Φμ (B)[Φμ (A)] TRUE or FALSE? TRUE when exists C such that C < A && C < B and C = AńB Or An. B = ┴ 34
Application of Rule on Invoice Φμ(acc_Num: invoice)[Φμ(carrier: invoice)] == Φμ(carrier: invoice)[Φμ(acc_Num: invoice)] ? TRUE or FALSE? 35
Application of Rule on Invoice Φμ(acc_Num: invoice)[Φμ(carrier: invoice)] = Φμ(carrier: invoice)[Φμ(acc_Num: invoice)] TRUE because both share common prefix “invoice”. Case AńB = invoice 36
Benefit of Rule Application NOTE: Assume acc_Num is required for each invoice element, while carrier is not THEN: Φμ(acc_Num: invoice)[Φμ(carrier: invoice)] == Φμ(carrier: invoice)[Φμ(acc_Num: invoice)] Then what algebra tree do we prefer? 37
Discussion Reduction of Input Size on first Sub-operation: Φμ(carrier: invoice) vs Φμ(acc_Num: invoice) (: 38
Can we apply the rule below? Φμ(acc_Num: invoice)[Φμ(acc_Num: Customer)] 39
Example “acc_Num: invoice” and “acc_Num: customer” are two totally different paths Case is: An. B = ┴ So yes, rule is valid. 40
Summary n XML Algebra n Operations n Optimization 41
- Slides: 40