Universal Database Systems Part 4 Databases and XML
Universal Database Systems Part 4: Databases and XML UDBS Part 4 -Winter 2001/2
Overview 3. Introduction to XML DTDs and Schemas for XML Documents Languages for XML, in particular XSL 4. Querying and Storing XML 5. Summary and Outlook 1. 2. UDBS Part 4 -Winter 2001/2 154
In Detail Query Usage Scenarios, Query Requirements, Query Data Model, l XML Query Algebra as a foundation l XQuery: a query language for XML l Storing XML Documents in Databases l DBMS vendor developments l UDBS Part 4 -Winter 2001/2 155
Why Query Languages for XML ? l XML with SAX or DOM: • SAX: very simple "event-based" queries • DOM: simple navigational queries (get. Child. Nodes, get. Next. Sibling, get. Elements. By. Tag. Name, …) l These are "low-level" APIs • • • l similar to an iterator/cursor API for RDBs used to write XML applications "high-level" querying, restructuring and transformation is tedious Thus, analogue to high-level relational query languages (SQL, QBE, Logic) needed UDBS Part 4 -Winter 2001/2 156
e-shopper‘s_heaven. com l How can e-shopper‘s let Web users search or query its product database ? e-shopper‘s heaven UDBS Part 4 -Winter 2001/2 Select director From moviecatalog Where title = ‘MI 2‘ 157
History of XQuery l l l Dave Maier‘s requirements at QL’ 98 XML-QL shortly thereafter Many other proposal between 1998 and 2000 Initial products, e. g. , XQL in Tamino @ 2000 XQuery Working Draft since Feb. 15, 2001 (revised June 7, 2001; December 20, 2001) UDBS Part 4 -Winter 2001/2 158
XML Query Specifications (Working drafts – ongoing work!) l l l l XML Query Requirements XML Query Use Cases XQuery 1. 0 and XPath 2. 0 Data Model XQuery 1. 0: An XML Query Language XML Syntax for XQuery 1. 0 (XQuery. X) XQuery 1. 0 and XPath 2. 0 Functions and Operators Version 1. 0 XML Path Language (XPath) 2. 0 UDBS Part 4 -Winter 2001/2 159
Usage Scenarios for XML Queries l l l Queries on structured, human-readable documents or collections of documents to create other documents Queries on XML data (from multiple sources) to create new XML data Mixed document- and data-oriented queries Queries on administrative data such as configuration files or user profiles Filtering queries on streams of XML data UDBS Part 4 -Winter 2001/2 160
Usage Scenarios (cont‘d) l l Queries on DOM structures to return sets of nodes Queries on collections of documents managed by native XML repositories Queries to search catalogs that describe document servers, document types, XML schemas, or documents Queries in multiple syntactic environments, e. g. , a URL, an XML page, a JSP or ASP page UDBS Part 4 -Winter 2001/2 161
Corresponding Use Cases l l l l l XMP: Experiences and Exemplars TREE: Queries that preserve hierarchy SEQ: Queries based on Sequence R: Access to Relational Data SGML: Standard Generalized Markup Language TEXT: Full-text Search NS: Queries Using Namespaces PARTS: Recursive Parts Explosion REF: Queries based on Reference FNPARM: Functions and Parameters UDBS Part 4 -Winter 2001/2 162
Use Case XMP: DTD and Data l l <!ELEMENT bib (book* )> <!ELEMENT book (title, (author+ | editor+ ), publisher, price )> <!ATTLIST book year CDATA #REQUIRED > <!ELEMENT author (last, first )> <!ELEMENT editor (last, first, affiliation )> <!ELEMENT title (#PCDATA )> <!ELEMENT last (#PCDATA )> <!ELEMENT first (#PCDATA )> <!ELEMENT affiliation (#PCDATA )> <!ELEMENT publisher (#PCDATA )> <!ELEMENT price (#PCDATA ) Sample Data: <bib> <book year="1994"> <title>TCP/IP Illustrated</title> <author><last>Stevens</last><first>W. </first></author> <publisher>Addison-Wesley</publisher> <price> 65. 95</price> </book>. . . . </bib> UDBS Part 4 -Winter 2001/2 163
Use Case XMP: Sample Queries l l l Q 1: List books published by Addison-Wesley after 1991, including their year and title Q 4: For each author in the bibliography, list the author's name and the titles of all books by that author, grouped inside a "result" element. Q 11: For each book with an author, return the book with its title and authors. For each book with an editor, return a reference with the book title and the editor's affiliation. UDBS Part 4 -Winter 2001/2 164
Use Case R: Background l l Relational data(base) represented as XML document Tables become elements with their tuples nested inside Queries in XQuery can be run on this XML representation of the actual tables Example: database used by an online auction USERS: information on registered users ITEMS: items currently/recently for sale BIDS: has all bids on record, keyed by the id of the bidder and the number of the item to which the bid applies UDBS Part 4 -Winter 2001/2 165
Use Case R: Relational Data & DTD l Tables: USERS ( USERID, NAME, RATING ) ITEMS ( ITEMNO, DESCRIPTION, OFFERED_BY, START_DATE, END_DATE, RESERVE_PRICE ) BIDS ( USERID, ITEMNO, BID_DATE ) l DTDs: <!DOCTYPE users [ <!ELEMENT users (user_tuple*)> <!ELEMENT user_tuple (userid, name, rating? )> <!ELEMENT userid (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT rating (#PCDATA)>]>. . . UDBS Part 4 -Winter 2001/2 166
Use Case R: Sample Queries l l l Q 2: For all bicycles, list the item number, description, and highest bid (if any), ordered by item number. Q 4: List item numbers and descriptions of items that have no bids. Q 14: List item numbers and average bids for items that have received three or more bids, in descending order by average bid. UDBS Part 4 -Winter 2001/2 167
Maier‘s Requirements to a Query Language Closedness: language maps XML to XML l Precise semantics: formally defined l Optimizability: queries can be improved prior to execution to achieve better performance l Adequateness: language makes full use of the various XML features l UDBS Part 4 -Winter 2001/2 168
More Requirements l l Query operations: language supports selection, extraction, reduction, restructuring, and combination (join) No schema needed: language can be used in the absence of a DTD Schema exploitation: if a DTD is available, it can be used for syntax checking Server-side processing: queries are independent of their creation context UDBS Part 4 -Winter 2001/2 169
W 3 C Query Requirements l l l Query language may have more than one syntax binding, syntax must be expressed in XML Declarativity Protocol independence Defined error conditions Update capabilities in future versions Defined for finite instances UDBS Part 4 -Winter 2001/2 170
XQuery Data Model l l Also datamodel of XSLT 2. 0 (W 3 C Working Draft) Relies on XML Information Set (W 3 C Recommendation) plus • represents both XML 1. 0 character data and the simple and complex types of XML Schema, • represents collections of documents and collections of simple and complex values l l Schema availability Namespace awareness UDBS Part 4 -Winter 2001/2 171
Data Model - Formally l l Node-labeled, tree-constructor representation with node identity 7 kinds of tree nodes: document, element, attribute, namespace, processing_instruction, comment , text Functions to construct tree nodes ("constructors") Functions to access nodes' structure ("accessors") UDBS Part 4 -Winter 2001/2 172
XQuery 1. 0 Formal Semantics (As of June 2001) l l l Is closed (XML fragment -> XML fragment) Is a functional language, no side effects Provides a semantics for the query language supports query optimization through commutativity, associativity, and other laws Features include attributes, namespaces, scalar types, element identity, collation, key constraints Is relationally complete UDBS Part 4 -Winter 2001/2 173
Sample Data <bib> <book year="2000" isbn="1 -55860 -622 -X"> <title> Data on the Web </title> <author> Abiteboul </author> <author> Buneman </author> <author> Suciu </author> </book> <book year="2001" isbn="1 -xxxxx-yyy-z"> <title> XML Query </title> <author> Fernandez </author> <author> Suciu </author> </book> </bib> UDBS Part 4 -Winter 2001/2 174
XSD Fragment <xsd: group name="Bib"> <xsd: element name="bib"> <xsd: complex. Type> <xsd: group ref="Book" min. Occurs="0" max. Occurs="unbounded"/> </xsd: complex. Type> </xsd: element> </xsd: group> <xsd: group name="Book"> <xsd: element name="book"> <xsd: complex. Type> <xsd: attribute name="year" type="xsd: integer"/> <xsd: attribute name="isbn" type="xsd: string"/> <xsd: element name="title" type="xsd: string"/> <xsd: element name="author" type="xsd: string" min. Occurs="1" max. Occurs="unbounded" /> </xsd: complex. Type> </xsd: element> </xsd: group> UDBS Part 4 -Winter 2001/2 175
Algebra Representation TYPE Bib = ELEMENT bib (Book*) TYPE Book = ELEMENT book 2 types ( ATTRIBUTE year (xs: integer) & ATTRIBUTE isbn (xs: string) ELEMENT title (xs: string), (ELEMENT author(xs: string))+) LET $bib 0 : = global variable, bound <bib> to literal XML value <book year="2000" isbn="1 -55860 -622 -X"><title>Data on the Web</title> <author>Abiteboul</author><author>Buneman</author><author>Suciu</author> </book> <book year="2001" isbn="1 -XXXXX-YYY-Z"><title>XML Query</title> <author>Fernandez</author><author>Suciu</author> </book>), </bib> : Bib RETURN … UDBS Part 4 -Winter 2001/2 176
Algebra Characteristics l l Strongly typed, i. e. , the value of variable $bib 0 must be an instance of its declared type (or the expression is ill-typed) Operations: • Projection, Iteration, Selection • Quantification • Join, Restructuring, Aggregation • Functions, Structural Recursion UDBS Part 4 -Winter 2001/2 177
Projection l Return all author elements contained in book elements contained in $bib 0: $bib 0/book/author ==>(<author>Abiteboul</author>, <author>Buneman</author>, <author>Suciu</author>, <author>Fernandez</author>, <author>Suciu</author>) : (ELEMENT author (xs: string))* UDBS Part 4 -Winter 2001/2 1. an expression 2. the value of the expression 3. the type of the expression 178
Notes on Projection Order of author elements is preserved l Duplicate elements are preserved l Although a book can have >= 1 (+) authors, the query result may contain >= 0 (*) authors l The type of an expression depends only on the type of its subexpressions l UDBS Part 4 -Winter 2001/2 179
Goals of XQuery l Design a small, clean, easily implementable language l Cover the functionality required by all the XML Query use cases in a single language l Write queries that are concise and easily understood l Derived from Quilt, which in turn borrowed from XPath, XQL, XML-QL, SQL, and OQL UDBS Part 4 -Winter 2001/2 188
Antecedents: XPath and XQL l l Closely-related languages for navigating in a hierarchy A path expression is a series of steps Each step moves along an axis (children, ancestors, attributes, etc. ) and may apply a predicate XPath has an abbreviated syntax, adapted from XQL: § § § l /book[title = "War and Peace"] /chapter[title = "War"] //figure[contains(caption, "Guns")] XQL has some additional operators: BEFORE, AFTER, . . . UDBS Part 4 -Winter 2001/2 189
Antecedent: XML-QL l l Proposed by Deutsch, Fernandez, Florescu, Levy, Suciu WHERE-clause binds variables according to a pattern, CONSTRUCT-clause generates output: WHERE <part pno = $pno> $pname </> in "parts. xml", <supp sno = $sno> $sname </> in "supp. xml", <sp pno = $pno sno = $sno> </> in "sp. xml" CONSTRUCT <purchase> <partname> $pname </> <suppname> $sname </> </purchase> UDBS Part 4 -Winter 2001/2 190
Antecedents: SQL and OQL l l l SQL and OQL are database query languages SQL derives a new table from other tables by a series of clauses: SELECT - FROM - WHERE OQL is a functional language • • A query is an expression Expressions can take several forms Expressions can be nested and combined SELECT-FROM-WHERE is one form of OQL expression UDBS Part 4 -Winter 2001/2 191
XQuery: XML Query Language l Functional language • query is an expression • expressions can be nested • strongly typed (operands must conform to designated types) Some SQL design errors corrected l Semantics: based on "core syntax" l Proposed by Chamberlin, Clark, Florescu, Robie, Simeon, Stefanescu l UDBS Part 4 -Winter 2001/2 192
Types of Expressions l l l Primary expressions (variable, literal, function call …) Path expressions Sequence expressions Arithmetic expressions Comparison expressions Logical expressions Constructors FLWR expressions Sorting expressions Conditional expressions Quantified expressions Expressions on data types UDBS Part 4 -Winter 2001/2 193
Sample Expressions l A path expression (using abbreviated XPath syntax): • document("bids. xml")//bid[itemno="47"]/bid_amount l An expression using operators and functions: • ($x + $y) * foo($z) l An element constructor: • <bid> <userid> {$u} </userid> , <bid_amount> {$a} </userid> </bid> UDBS Part 4 -Winter 2001/2 194
A Sample FLWR Expression l "Find the description and average price of each red part that has at least 10 orders" FOR $p IN document("parts. xml") //part[color = "Red"] LET $o : = document("orders. xml") //order[partno = $p/partno] WHERE count($o) >= 10 RETURN <important_red_part> {$p/description} <avg_price> {avg($o/price)} </avg_price> </important_red_part> UDBS Part 4 -Winter 2001/2 195
A FLWR Expression l A FLWR expression binds some variables, applies a predicate, and constructs a new result. ( FOR | LET )+. . . WHERE? . . . RETURN. . RETURN_clause FOR_clause LET_clause UDBS Part 4 -Winter 2001/2 WHERE_clause 196
Data Flow in a FLWR Expression UDBS Part 4 -Winter 2001/2 197
FOR Clause , FOR variable IN expression l FOR is used for iterating over one or more collections l FOR introduces one or more variables, associating with each an expression l Tuples of variable bindings are drawn from Cartesian product of the sequences of values to which the expressions evaluate l Variable bindings are generated as ordered sequence UDBS Part 4 -Winter 2001/2 198
LET Clause , LET variable : = expression l LET is also used for binding variables (without iteration) l A LET clause produces a single binding for each variable (therefore it does not affect the number of binding-tuples) l The variable is bound to the value of expression, which may contain many nodes. l Document order is preserved among the nodes in each bound collection, unless expression contains a non-order-preserving function such as distinct( ). UDBS Part 4 -Winter 2001/2 199
FOR vs. LET l FOR $x IN /library/book results in many bindings, each of which binds $x to one book in the library l LET $x : = /library/book results in a single binding which binds $x to a list containing all books in the library UDBS Part 4 -Winter 2001/2 200
WHERE Clause WHERE boolean-expression l Applies predicate(s) to the tuples of bound variables l Retains only tuples that satisfy the predicate(s) l Preserves order of tuples, if any l May contain AND and OR l Applies scalar conditions to variables bound in a FOR clause (to individual nodes), e. g. , $p/color = "Red" l Applies set conditions to variables bound by a LET clause (to sequences of nodes), e. g. , avg($p/price) > 100 UDBS Part 4 -Winter 2001/2 201
RETURN Clause RETURN expression l Constructs the result of the FLWR expression, which may be a value, a node, or an ordered forest of nodes l Executed once for each tuple of bound variables generated by FOR and LET and satisfying WHERE l Preserves order of tuples, if any l OR, can impose a new order using a SORTBY clause l Often contains references to bound variables, nested subexpressions, or an element constructor, e. g. , <item> {$item/itemno} <avg_bid> {avg($b/bid_amount)} </avg_bid> </item> SORTBY itemno UDBS Part 4 -Winter 2001/2 202
Example 1 (a) in XQuery FOR $b IN document("eshoppers. xml")//BOOK[@category="technical"] RETURN <Technical. Book> {$b/title} </Technical. Book> In XSL: <xsl: template match="/"> <xsl: for-each select="//BOOK[@category='technical']"> <Technical. Book> <xsl: value-of select="TITLE"/> </Technical. Book> </xsl: for-each> </xsl: template> UDBS Part 4 -Winter 2001/2 203
Example 1 (b) in XQuery In XSL: FOR $b IN document("eshoppers. xml") //BOOK, $v IN document("eshoppers. xml") //VIDEO WHERE $b/YEAR = $v/YEAR RETURN <Book. And. Video. In. Year> <Year> {$b/YEAR} </Year> <Book> {$b/TITLE} </Book> <Video> {$v/TITLE} </Video> </Book. And. Video. In. Year> <xsl: template match="/"> <xsl: for-each select="//BOOK[YEAR]"> <xsl: variable name="book" select=". "/> <xsl: for-each select="//VIDEO[YEAR=$book/YEAR]"> <xsl: variable name="video" select=". "/> <Book. And. Video. In. Year <Year><xsl: value-of select="$book/YEAR"/></Year> <Book><xsl: value-of select="$book/TITLE"/></Book> <Video><xsl: value-of select="$video/TITLE"/></Video> </Book. And. Video. In. Year> </xsl: for-each> </xsl: template> UDBS Part 4 -Winter 2001/2 204
Sample Document: bib. xml has the following structure: <bib> <book> <title>. . . </title> <author>. . . </author>. . . <publisher>. . . </publisher> <year>. . . </year> <price>. . . </price> </book>. . . </bib> UDBS Part 4 -Winter 2001/2 205
Simple FLWR Queries l Find all the books published in 2002 by Morgan Kaufmann: FOR $b IN document("bib. xml")//book WHERE $b/year = "2002" AND $b/publisher = "Morgan Kaufmann" RETURN $b SORTBY(author, title) l Find titles of books that have no authors: <orphan_books> FOR $b IN document("bib. xml")//book WHERE empty($b/author) RETURN $b/title </orphan_books> UDBS Part 4 -Winter 2001/2 206
More FLWRs l List each publisher and the average price of its books: FOR $p IN distinct(document("bib. xml")//publisher) LET $a : = avg(document("bib. xml")//book[publisher = $p]/price) RETURN <publisher> <name> {$p/text()} </name> <avgprice> {$a} </avgprice> </publisher> UDBS Part 4 -Winter 2001/2 207
More FLWRs l List the publishers who have published more than 100 books: <big_publishers> { FOR $p IN distinct(document("bib. xml")//publisher) LET $b : = document("bib. xml")//book[publisher = $p] WHERE count($b) > 100 RETURN $p } </big_publishers> UDBS Part 4 -Winter 2001/2 208
A Nested Query l Invert the hierarchy from publishers inside books to books inside publishers: FOR $p IN distinct(document("bib. xml")//publisher) RETURN <publisher> <name> {$p/text()} </name> { FOR $b IN document("bib. xml")//book[publisher = $p] RETURN <book> $b/title, $b/price </book> SORTBY(price DESCENDING) } </publisher> SORTBY(name) UDBS Part 4 -Winter 2001/2 209
More FLWRs l For each book whose price is greater than the average price, return the title of the book and the amount by which the book's price exceeds the average price: <result> { LET $a : = avg(document("bib. xml")//book/price) FOR $b IN document("bib. xml")//book WHERE $b/price > $a RETURN <expensive_book> {$b/title} <price_difference> {$b/price - $a} </price_difference> </expensive_book> } </result> UDBS Part 4 -Winter 2001/2 210
Conditional Expressions IF expr 1 THEN expr 2 ELSE expr 3 l Make a list of holdings, ordered by title. For journals, include the editor; otherwise include the author: FOR $h IN //holding RETURN <holding> { $h/title, IF $h/@type = "Journal" THEN $h/editor ELSE $h/author } </holding> SORTBY(title) UDBS Part 4 -Winter 2001/2 211
Quantified Expressions: Some SOME EVERY l l var IN expr SATISFIES predicate Quantified expressions are a form of predicate (return Boolean) Find titles of books in which both sailing and windsurfing are mentioned in the same paragraph: FOR $b IN document("bib. xml")//book WHERE SOME $p IN $b//para SATISFIES (contains($p, "Sailing") AND contains($p, "Windsurfing")) RETURN $b/title UDBS Part 4 -Winter 2001/2 212
Quantified Expressions: Every l Find titles of books in which sailing is mentioned in every paragraph: FOR $b IN //book WHERE EVERY $p IN $b//para SATISFIES contains($p, "sailing") RETURN $b/title UDBS Part 4 -Winter 2001/2 213
Nested Quantifications l Let employees have multiple skills and multiple duties. Find names of employees who have some duty that is not matched by a skill: FOR $e IN //emp WHERE SOME $d IN $e/duty SATISFIES not(SOME $s IN $e/skill SATISFIES $s = $d) RETURN $e/name UDBS Part 4 -Winter 2001/2 214
Functions l XQuery has a library of built-in functions, e. g. , l A query can define its own local functions Version 1 does not allow user-defined functions to be overloaded (multiple functions under the same name) The filter function can select a set of nodes from a hierarchy while preserving the original relationships among these nodes l l document, avg, sum, count, max, min, distinct, empty UDBS Part 4 -Winter 2001/2 215
Functions l l Functions can be recursive Example: "Compute the maximum depth of nested parts in the document named partlist. xml" NAMESPACE xsd = "http: //www. w 3. org/2001/XMLSchema" DEFINE FUNCTION depth($e) RETURNS xsd: integer { {-- empty element has depth 1 --} {-- otherwise, add 1 to max depth of children --} IF (empty($e/*)) THEN 1 ELSE max(depth($e/*)) + 1 } depth(document("partlist. xml") ) UDBS Part 4 -Winter 2001/2 216
Another Function l A function that returns all the elements that are "connected" to a given element by child or reference connections, and a recursive function that returns all the elements that are "reachable" from a given element by child or reference connections: In "company. xml", find all the elements that are reachable from the employee with serial number 12345 by child or reference connections. DEFINE FUNCTION connected($e) { $e/* UNION $e/@*=>* } DEFINE FUNCTION reachable($e) { $e UNION reachable(connected($e)) } reachable(document("company. xml")/emp[serial="12345"]) UDBS Part 4 -Winter 2001/2 217
The FILTER Function FILTER ( expression ) l Example: Result contains copies of all nodes of type A and B in the original hierarchy, with their original relationships preserved UDBS Part 4 -Winter 2001/2 218
Projection (Filtering a document) l "Generate a table of contents for cookbook. xml containing nested sections and their titles" <toc> { let $b : = document("book 1. xml") return filter($b//section | $b//section/title/text()) } </toc> UDBS Part 4 -Winter 2001/2 219
A Join Example l Generate a "descriptive catalog" derived from the catalog document, but containing part descriptions instead of part numbers and supplier names instead of supplier numbers. Order the new catalog alphabetically by part description and secondarily by supplier name: <descriptive-catalog> { FOR $i IN document("catalog. xml")//item, $p IN document("parts. xml")//part[partno = $i/partno], $s IN document("suppliers. xml")//supplier[suppno = $i/suppno] RETURN <item> { $p/description, $s/suppname, $i/price } </item> SORTBY(description, suppname) } </descriptive-catalog> UDBS Part 4 -Winter 2001/2 220
SQL vs. XQuery l SQL: SELECT var. title, var. isbn, var. year FROM bookcatalog AS var WHERE var. year > 1998 ORDER BY var. year DESC l XQuery: <RESULT> { FOR $var IN document("catalog. xml")//book WHERE $var/year > 1998 RETURN <BOOK> { $var/title, $var/isbn, $var/year } </BOOK> SORTBY(year) DESCENDING } </RESULT> UDBS Part 4 -Winter 2001/2 221
XQuery 1. 0 l l Many additional features, e. g. , sequencerelated operators, operations on data types, notion of a "query module" Many issues still under discussion (many changes from 02/15/01 to 06/07/01 and again to 20/12/01) A formal grammar is presented in the W 3 C document Moreover, XQuery. X is an XML representation of XQuery UDBS Part 4 -Winter 2001/2 222
XQuery Processors l Kawa-XQuery • Partial implementation • Queries are complied into Java bytecode • see http: //www. gnu. org/software/kawa/xquery l Microsoft's XQuery Language Demo at 131. 107. 228. 20 UDBS Part 4 -Winter 2001/2 223
Summary on XQuery Declarative query language for XML l Likely to become the standard for querying XML l Still working draft l Implementations under development l UDBS Part 4 -Winter 2001/2 224
- Slides: 64