Universal Database Systems Part 4 Databases and XML

Universal Database Systems Part 4: Databases and XML UDBS Part 4 -Winter 2001/2

Overview 3. Introduction to XML DTDs and Schemas for XML Documents Languages for XML, in particular XSL 4. Querying and Storing XML 5. Summary and Outlook 1. 2. UDBS Part 4 -Winter 2001/2 154

What Else Makes XML Relevant to Databases? l l l Document storage, retrieval, and indexing Views and data warehouses Transformations between relational data representations and XML Mediators, data integration Schema aspects, integrity constraints, design, reverse engineering UDBS Part 4 -Winter 2001/2 155

Big Picture application object-relational Integrate XML Data Transform WEB (HTTP) Warehouse application relational data UDBS Part 4 -Winter 2001/2 legacy data 156

What Do We Look At? Storage of XML documents l Publication of XML document l Sample vendor approaches l UDBS Part 4 -Winter 2001/2 157

Storage vs Publication l Storage • Where and how to store native XML documents • Here: Storage in relational DBs l Publication • Wrap arbitrary content in XML documents • Here: Publication of relational content UDBS Part 4 -Winter 2001/2 158

Storing XML Data l Scenario: • receive a large XML data instance • want to store, manage it, query it l Solutions: • build an XML management system from scratch (e. g. , Tamino) • preferably: use existing database systems l The Storage Problem: map XML data into relational UDBS Part 4 -Winter 2001/2 159

Approaches l Table-oriented: • As a column value • Across multiple tables As a BLOB or CLOB l With appropriate (object) functionality, e. g. , l • Extender • Data Blade UDBS Part 4 -Winter 2001/2 160

Options XML document as a text file (or CLOB) l Using ternary relations l Using a DTD or XML Schema for deriving a database schema l Alternatively: derivation of a schema through mining from the data l UDBS Part 4 -Winter 2001/2 161

Text File/CLOB l Advantages • simple • less space than you think • reasonable clustering l Disadvantages • no updates • needs specific query processor UDBS Part 4 -Winter 2001/2 162

Ternary Relation (edge-oriented) Ref Source Label Dest &o 1 &o 2 paper &o 2 title &o 3 author year author &o 4 "The Calculus" "…" &o 5 &o 6 "…" "1986" Val Node &o 3 &o 4 &o 5 &o 6 paper title author year &o 2 &o 3 &o 4 &o 5 &o 6 Value The Calculus … … 1986 Order, attributes, comments? UDBS Part 4 -Winter 2001/2 163

Other Relational Representations l l More detailed edge representation, e. g. , (Source, Dest, Name, Type, Ordinal) with additional tables for all types Universal relation for storing edges (as full outer join of all ternary relations as above) Separate value tables per type Inline representation of the various types, e. g. , (Source, Dest, Name, Int, String, Ordinal) with null values for the non-applicable types UDBS Part 4 -Winter 2001/2 164

Derive Schema from DTD (1) l DTD: <!ELEMENT bib (paper*)> <!ELEMENT paper (author*, title, year)> <!ELEMENT author (firstname, lastname)> l l Relational Schema: • • • Paper(pid, title, year) Author(aid, fn, ln) Paper. Author(pid, aid) Sometimes this is poor. E. g. • • • 80% of papers have <= 2 authors 18% have 3 authors 2% have 4 or more… UDBS Part 4 -Winter 2001/2 165

Derive Schema from DTD (2) l DTD: <!ELEMENT employee (name, address, project*)> <!ELEMENT address (street, city, state, zip)> l ODMG classes: class Employee public type tuple (name: string, address: Address, project: List(Project)) class Address public type tuple (street: string, …) UDBS Part 4 -Winter 2001/2 166

Storage vs Publication l Storage • Where and how to store native XML documents l Publication • Wrap arbitrary content in XML documents • Here: Publication of relational content § § Current business data is relational data Scalability, reliability, performance UDBS Part 4 -Winter 2001/2 167

Publication Scenario Relational Database publish XML Documents transform Web Server (XSL) UDBS Part 4 -Winter 2001/2 HTML 168

Pragmatic ("Rowset") Approach l Tables represented as simple XML trees: • • • table = root each row becomes a nested element each value becomes another nested element SUPPLIERS SNO SNAME <suppliers> <s_tuple> <sno> <sname> UDBS Part 4 -Winter 2001/2 PARTS PNO CATALOG DESCRIP <parts> <p_tuple> <pno> <descrip> SNO PRICE <catalog> <c_tuple> <sno> <price> 169

Pragmatic ("Rowset") Approach (2) No "natural" XML l No nesting, no hierarchies, no mapping for (foreign) keys l May use XSLT to obtain "real" XML l Better: Publish structured documents l UDBS Part 4 -Winter 2001/2 170

Publish XML Documents Following presentation based on l Jayavel Shanmugasundaram et al: Efficiently Publishing Relational Data as XML Documents, VLDB 2000 Two issues l Language for conversion • Flat relational data to nested XML l Implementation of conversion • Efficient conversions UDBS Part 4 -Winter 2001/2 171

Sample Conversion Language Relational data SQL extension l Natural extension UDFs l More specific l • Nesting through subqueries • UDFs to construct XML elements/attributes from SQL data • Aggregate functions to group children UDBS Part 4 -Winter 2001/2 172

Sample Relational Database Department Dept. Id Dept. Name 10 Purchasing Employee Emp. Id 101 Dept. Id Emp. Name Salary 10 John 50 K 91 Task: Publish single XML document with information on departments (including their employees and projects) UDBS Part 4 -Winter 2001/2 10 Mary 70 K Project 888 Dept. Id 10 Proj. Name Internet 795 10 Recycling Proj. Id 173

Publication Query – Structure Select DEPT(d. name, <subquery to produce emplist>, <subquery to produce projlist> ) From Department d UDBS Part 4 -Winter 2001/2 174

XML Constructor Create Function DEPT(dname: varchar(20), emplist: xml, projlist: xml) As ( <department name={dname}> <emplist> {emplist} </emplist> <projlist> {projlist} </projlist> </department> ) UDBS Part 4 -Winter 2001/2 175

Publication Query Select DEPT(d. name, (Select XMLAGG(EMP(e. name)) <subquery From Employee to produce e emplist>, Where e. deptno = d. deptno), (Select XMLAGG(PROJ(p. name)) <subquery From Project to produce p projlist> Where ) p. deptno = d. deptno)) From Department d UDBS Part 4 -Winter 2001/2 176

Query Result <department name="Purchasing"> <emplist> <employee> John </employee> <employee> Mary </employee> </emplist> <project> Internet </project> <project> Recycling </project> </projlist> </department> UDBS Part 4 -Winter 2001/2 177

Implementation of Conversion l Two main differences: • Nesting (structuring) • Tagging l Space of alternatives: Early Tagging Early Structuring Outside Engine Inside Engine Late Structuring Late Tagging Outside Engine Inside Engine Outside Engine Not applicable Inside Engine UDBS Part 4 -Winter 2001/2 178

Options l Late vs early tagging • Late tagging: final step of query processing • Early tagging: otherwise l Late vs early structuring • Late structuring: final step of query processing • Early structuring: otherwise l Inside vs outside engine • Inside: completely inside db engine • Outside: otherwise (ignored in following) UDBS Part 4 -Winter 2001/2 179

Early Tagging, Early Structuring, Outside Engine: Stored Procedure Approach l l Issue queries for sub-structures and tag them Could be a Stored Procedure (10, Purchasing) DBMS Engine Department Employee (Internet) (Recycling) (John) (Mary) Project l Problem: Too many SQL queries! UDBS Part 4 -Winter 2001/2 180

Early Tagging, Early Structuring, Inside Engine: Correlated CLOB Approach Query seen above (with UDFs to create XML) l Problem: Correlated execution of sub-queries l UDBS Part 4 -Winter 2001/2 181

Early Tagging, Early Structuring, Inside Engine: De-Correlated CLOB Approach Compute employee lists associated with all departments l Compute project lists associated with all departments l Join results above on department id l Problem: CLOBs during query processing l UDBS Part 4 -Winter 2001/2 182

Late Tagging, Late Structuring: Redundant Relation Approach l How do we represent nested content as relations? (10, John) (10, Mary) (10, Purchasing) (10, Internet) (10, Recycling) l (Purchasing, John, Internet) (Purchasing, John, Recycling) (Purchasing, Mary, Internet) (Purchasing, Mary, Recycling) Problem: Large relation due to data redundancy! UDBS Part 4 -Winter 2001/2 183

Late Tagging, Late Structuring: Outer Union Approach l How do we represent nested content as relations? Union (Purchasing, null, Internet , 0) (Purchasing, null, Recycling, 0) (Purchasing, John, null , 1) (Purchasing, Mary, null , 1) Department Employee Project (Purchasing, John) (Purchasing, Mary) Project Department (Purchasing, Internet) (Purchasing, Recycling) (10, Purchasing) l Problem: Wide tuples (having many columns) UDBS Part 4 -Winter 2001/2 184

Late Tagging, Late Structuring: Hash-based Tagger l Results not structured early • In arbitrary order l Tagger has to enforce order during tagging • Hash-based approach Inside/Outside engine tagger l Problem: Requires memory for entire document l UDBS Part 4 -Winter 2001/2 185

Late Tagging, Early Structuring: Sorted Outer Union Approach A ABn. Dnnn B ABnn. Enn C An. Cnn. Fn D E F G An. Cnnn. G Sort By: Aid, Bid, Cid l Problem: Only partial ordering required UDBS Part 4 -Winter 2001/2 186

Late Tagging, Early Structuring: Constant Space Tagger Detects changes in XML document hierarchy l Adds appropriate opening/closing tags l Inside/outside engine l UDBS Part 4 -Winter 2001/2 187

Performance of Alternatives Constructing XML inside engine more efficient than outside l When processing can be done in main memory: Late tagging, late structuring with outer union l Otherwise: Late tagging, early structuring with sorted outer union l UDBS Part 4 -Winter 2001/2 188

Sample DB Vendors and XML 1. 2. Oracle 9 i IBM DB 2 UDB V 7 UDBS Part 4 -Winter 2001/2 189

Oracle – CLOB Datatype XMLType (internally a CLOB) l Predefined functions l • create. XML: creates XMLType-instance from string (if well-formed) • extract: applies XPath expression to XMLType-instance and returns XMLType • exists. Node: checks whether XMLTypeinstance has non-empty result for given XPath expression UDBS Part 4 -Winter 2001/2 190

Oracle – Generate XML l Functions • SYS_XMLGEN: takes single argument and converts it to an element • SYS_XMLAGG: concatenates XML fragments l Utility XSU (XML SQL Utility): Implements "Rowset" approach UDBS Part 4 -Winter 2001/2 191

DB 2 – CLOB l Three datatypes • XMLCLOB (outside table) • XMLVARCHAR (inside table) • XMLFILE (external file) Checks against DTD possible l PAGE tables? Via DADs l UDBS Part 4 -Winter 2001/2 192

Basic Approach l XML document is stored • completely (in a column of type XML, or an "XML column") or as file reference, or • in multiple tables as result of a mapping (an as "XML collection") l A Document Access Definition (DAD) specifies how XML documents are stored and published, how XML maps to tables and vice versa UDBS Part 4 -Winter 2001/2 193

XML Column DB 2 XML document <? xml? > <!DOCTYPE. . . > <Order key = "1"> </Order> XMLCLOB/XMLVARCHAR UDBS Part 4 -Winter 2001/2 UDTs: - XMLCLOB - XMLVARCHAR - XMLFILE UDFs: - Import/Storage - Retrieval - Extract - Update 194

Legend l l XMLFile for external file names XMLVarchar for internal short documents XMLCLOB for internal long documents Extract • Extracts XML element/attribute values from documents • Converts values from XML documents into SQL data types • Provides scalar as well as tabular UDFs UDBS Part 4 -Winter 2001/2 195

Example Invoice. Number Order 355 . . . 356 . . <order>. . <part>. . <extended. Price>1000</. . . 357 . . . Select db 2 xml. extract. Double(Order, ‘/order/part/extended. Price‘) from Order. Table where Invoice. Number = 356 UDBS Part 4 -Winter 2001/2 196

XML Collection DB 2 XML document <? xml? > DAD <!DOCTYPE. . . > <Order key = "1"> </Order> Stored Procedures - Composition - Decomposition Collection UDBS Part 4 -Winter 2001/2 197

Sample DAD <DAD> <Xcollection> <SQL_stmt> SELECT book_id, price_date, price_text FROM book_table ORDER BY price_date </SQL_stmt> <doctype> <root_node>. . . </root_node> </doc_type> </Xcollection> </DAD> UDBS Part 4 -Winter 2001/2 198

Overview 3. Introduction to XML DTDs and Schemas for XML Documents Languages for XML, in particular XSL 4. Querying and Storing XML 5. Summary and Outlook 1. 2. UDBS Part 4 -Winter 2001/2 199

Discussion l l Do XML documents have to be stored directly in the database? XML documents are highly redundant (from a database perspective) The efficiency of a relational system (partially) comes from normalization Compromise: XML as an "intermediary" between the database and, say, a Web server UDBS Part 4 -Winter 2001/2 200

Summary l l XML asks for query languages, database-style Database vendors experiment with • • l XML extensions architectures languages internal data models Many open issues, e. g. , • • • Graphical query languages Updates ACM SIGMOD 2001 Views defined in the query language Referential integrity, triggers, rules Distributed XML storage systems UDBS Part 4 -Winter 2001/2 201

Outlook for DBMS l l l XML is and important database topic (both for practitioneers and for theoreticians) Declarative querying SQL-style is attractive Will there be a renaissance of hierarchical DBMS? Workshop: Web. DB, annually, German counterpart as GIArbeitskreis "Web und Datenbanken" Initiatives found on the Web: XML: DB and Xindice UDBS Part 4 -Winter 2001/2 202

x. Lx Competition - Results 2. Place, 248 Points • Christian Birmes • Victor Pankratius • Tobias Rieke 1. Place, 250 Points • Kai Honsel UDBS Part 4 -Winter 2001/2 203

UDBS Winter 2001/2 Thank You For Listening! UDBS Part 4 -Winter 2001/2 204
- Slides: 52