XML Data Driving Business Laks V S Lakshmanan

XML: Data Driving Business? Laks V. S. Lakshmanan, IIT Bombay and Concordia University

XML : Data Model • What is an XML Document – Linearization of a tree structure – Every node of the tree can have several character strings associated – Info content of the document is the tree structure together with the character strings Is XML just a syntax for data interchange and serialization?

XML: Data Model Types of nodes · Element Eg. <p a 1="A 1". . . an="An">c 1. . . cm</p> · Document Eg. <!DOCTYPE name [markedupdeclarations]> · Processing instruction Eg. <? xml version=“ 1. 0”? > · Comment Eg.  · Atomic data Eg. <Data>

What is a DTD? • Document Type Definition(DTD) serves as grammar • A document type definition specifies: – the elements that are permissible in a document of this type – for each element the possible attributes, their range of values and defaults – for each element, the structure of its contents, including: • which element can occur and in what order • whether text characters can occur

Example of a DTD Eg: <!DOCTYPE> Bookslist[ <!ELEMENT Bookslist (book)*> <!ELEMENT book (title, author*, publisher)> <!ELEMENT title (#PCDATA)> <!ELEMENT author(#PCDATA)> <!ELEMENT publisher(#PCDATA)> ]

XML and DTD • Well formed documents – Tags should be nested properly and attributes should be unique. • Valid documents – Well formed documents that confirm to a Document Type Definition(DTD) • DTDs are used – Constrain structure – Declare entities – Provide some default values for attributes

DTD Limitations • • too much document oriented too simple and too complicated at the same time too limited to represent complex structures IDREFs are not typed No notion of inheritance/sub-typing too many ways to represent the same thing names are global, not locals

DTD vs. Database Schema • Order is of significance in DTD and not in DB • DTD does not provide for data types • DTD cannot specify keys

XMLSchema • Why XMLSchema – Based on XML syntax – Can be parsed and manipulated like any XML document – Supports variety of data types – Allows extensions of vocabularies and inherit from elements – Provides namespace integration – Provides logical grouping of attributes

XMLSchema: An example <datatype name="Price. Type"> <basetype name="decimal"/> <min. Exclusive>0. 00</min. Exclusive> <scale>2</scale> </datatype> <element name="price" type="Price. Type"> </element> <element name='Person'>. . . </element> <element name='Employee'> <refines name='Person'/>. . . </element>

XMLSchema vs. DTD

XML Data • Superset of XMLSchema • Can express Database relationships too. . • Eg: <element. Type id="booktable"> <element id="title. ID" type="#title”/> <element type="#author”/> <element type="#pages”/> <key id="bookkey"> <key. Part href="#title. ID"/> </key> </element. Type>

Semistructured data • Data that is neither raw nor very strictly typed like in databases • Examples of semistructured data – Html file with one entry per restaurant that provides info on prices, addresses, styles – Bib. Tex files – Genome and scientific databases – Online documentation

Semistructured data: Main aspects • Structure – Irregular – Implicit – Partial • Schema – Very large – Rapidly evolving – Distinction between data and schema is blurred

Semistructured data: Data model • Object Exchange Model(OEM) – Lightweight and flexible – Data representation • As a graph with objects as vertices and labels on edges • Each object has a unique object identifier • Some objects are atomic, e. g. , integer, real, … • Complex objects have value as set of object references

OEM: An example

Semistructured data: Query Languages • Lorel – Based on OQL – Eg. , • Select author: X from biblio. book. author X • Computes the set of book authors • Forms a new node and connects it with edges labelled author to nodes resulting from evaluation of the path expression

Lorel: Salient features • Coercion • force comparison operators to handle comparisons between objects of different types like between string and integer • Eg. Select row: X from biblio. paper X where X. year=1998 Comment: ==>Year could have been string or integer

Lorel: Salient Features • Path expressions • Data model allows arbitrary nesting • Queries should hence be able to probe arbitrary depth • Provided by path expressions • Eg. select title: t from chapter(. section)* s, s. title t where t like "*XML*"

Un. QL • Based on Edge labeled Graph Model • Coercion not supported • More precise knowledge of data needed • Pattern Usage – Eg. Select title: X where {biblio: {paper: {title: X, year: Y}}} in db, Y>1998

Un. QL • Path variables – Can use path too as data – Eg. Select @P from db 1 @P. X where matches(“. *(U|u)biquitin. *”, X) ==>To determine where string “ubiquitin” appears in db 1

Semistructured vs. XML • Both are schema-less, self-describing • XML is ordered and semistructured data is not • XML can mix text and elements: – XML has lots of other stuff: entities, processing instructions, comments

Requirements of an XML Query Language • XML Output • Server-side processing • Query operations – Selection, Extraction, Reduction, Restructuring, Combination • • No schema required Exploit available schema Preserve order and association Programmatic Manipulation

Requirements of an XML Query Language • • • XML representation Mutual embedding with XML XLink and XPointer cognizant Support for new data types Suitable for metadata

XML Query Languages • XQL • XML-QL • Quilt

XQL • Simple expressions • //product[@maker='BSA'] : All products with attribute maker ‘BSA’ • Filters • author/address[@type='email']: Address nodes with attribute type as email • Subscripts • section[1, 3 to 5]: Nodes with position 1, 3, 4, 5

XQL • Supports boolean and set operators • q 1 and q 2 • q 1 union q 2 • Grouping • //invoice{q 1} : Using invoice groups the results of q 1 • Sequence • a before b • Others : node(), text(), . . .

XQL: Limitations • Flattening – As the results of patterns and filters are not modeled by an intermediate relation • Restructuring – As flattening not permitted cannot restructure • Tag variables – Not supported • Sorting

XML Query Languages • XQL • XML-QL • Quilt

XML-QL • Simple examples WHERE <book> <publisher> <name>Addison-Wesley</name> </publisher> <title> $t</title> <author> $a</author> </book> IN "www. a. b. c/bib. xml" CONSTRUCT <result> <author>$a</author> <title>$t</title> </result>

XML-QL • Grouping WHERE <book> $p </> IN "www. a. b. c/bib. xml", <title > $t </>, <publisher> <name>Addison-Wesley</> </publisher> IN $p CONSTRUCT <result> <title> $t </> WHERE <author> $a </> IN $p CONSTRUCT <author> $a</> ==> Groups by title.

XML-QL • Tag variables WHERE <$p> <title> $t </title> <year>1995 </> <$e> Smith </> IN "www. a. b. c/bib. xml", $e IN {author, editor} CONSTRUCT <$p> <title> $t </title> <$e> Smith </> ==> List of books where Smith could be either author or editor

XML-QL • Regular Path Expressions WHERE <part*> <name>$r</> <brand>Ford</> IN "www. a. b. c/bib. xml" CONSTRUCT <result>$r</> ==> Gets list of names of parts irrespective of the nesting of parts in the document.

XML-QL • Skolem functions WHERE <$> <author> <firstname> $fn </> <lastname> $ln </> <title> $t </> IN "www. a. b. c/bib. xml", CONSTRUCT <person ID=Person. ID($fn, $ln)> <firstname> $fn </> <lastname> $ln </> <publicationtitle> $t </> ==> Person. ID is a Skolem function Generates new id for distinct value of ($fn, $ln) else appends to existing node.

XML-QL • Allows integrating data from multiple sources • Can query order as well • Provides for embedding query within data • Allows function definitions • Is relationally complete

XML-QL • Is everything fine? – Pattern specifications are too verbose – Result of the WHERE clause is a relation composed of scalar values • So cannot preserve information about hierarchy and sequence • Can hence not handle hierarchy and sequence related queries

XML Query Languages • XQL • XML-QL • Quilt

Quilt • Combines strengths of XML-QL and XQL • Derives ability to navigate and select nodes based on sequence from XQL • Binding of variables done like in XML-QL

Quilt • An example FOR $b in //book WHERE exists($b/title) AND NOT exists($b/author) RETURN $b/title ==> Lists those titles of those books which do not have author info

Quilt XML Input FOR/LET Tuples of bound var. WHERE Tuples selected RETURN XML Output Flow of data in a quilt expression

Quilt: Filtering Documents • Need to preserve the relationships among selected elements • Eg: C B B C B A A A B A A filter = A|B A C B

Quilt • Can perform Sorting • Aggregation provided • Allows recursive functions

Quilt: The real power of it • Sample document <section> <section. title>Procedure</section. title> The patient was taken to the operating room where she was placed in a supine position and <Anesthesia>induced under general anesthesia. </Anesthesia> <Prep> <action>Foley catheter was placed to decompress the bladder</action> and the abdomen was then prepped and draped in sterile fashion. </Prep> <Incision> A curvilinear incision was made <Geography>in the midline immediately infraumbilical</Geography> and the subcutaneous tissue was divided <Instrument>using electrocautery. </Instrument> </Incision> The fascia was identified and <action>#2 0 Maxon stay sutures were placed on each side of the midline. </action> <Incision> The fascia was divided using <Instrument>electrocautery</Instrument> and the peritoneum was entered. </Incision> <Observation>The small bowel was identified</Observation> and <action> the <Instrument>Hasson trocar</Instrument> </action> : </section>

Quilt: The real power of it • In each section with title "Procedure", what Instruments were used in the second Incision? FOR $s IN //section[section. title="Procedure"] RETURN ($s//Incision)[2]/Instrument • In each section with title "Procedure", what are the first two instruments to be used? FOR $s IN //section[section. title="Procedure"] RETURN ($s//Instrument)[1 -2]

Quilt: The real power of it • In the first procedure, what happened between the first incision and the second incision? FOR $proc IN //section[section. title="Procedure"][1], $bet IN $proc//((* AFTER ($proc//incision)[1]) ($proc//incision)[2]) RETURN $bet BEFORE

XML Storage • Text files • Simple • Would require special purpose query processor • Relational databases • Ternary relations [Florescu et al] • Inlining methods [Shanmugasamudram et al] • STORED [Mary Fernandez]

XML Storage • Object Oriented databases[Sophie Cluet et al] • Native storage

XML Storage • Using Ternary relations • Edge labels are maintained in a table with the object ids that the edge connects • Value of leaf nodes are stored using yet another table

Store XML in Ternary Relation Ref &o 1 paper &o 2 title &o 3 author &o 4 “The Calculus” “…” year Val &o 5 “…” &o 6 “ 1986”

XML Storage • DTDs converted into DTD graph • Inlining methods • Basic inlining • Shared inlining • Hybrid inlining

Corresponding DTD graph

Element graph for Editor Element

XML Storage • Basic inlining • For each node in the DTD graph a relation is created • Creates a large no. of relations

Relations created using Basic inlining

XML Storage • Shared inlining • Create relations for elements in-degree>1 • An element node is repr in exactly 1 rel • For mutually recursive elements make one as a separate relation

Relations created using shared inlining

XML Storage • Hybrid inlining • inlines elements with in-degree > 1 that are not recursive or reached through a “*” node

Relations created using hybrid inlining

XML Storage • STORED • Uses a query language to specify mappings. • Mappings are generated using mining algorithms • Nonconforming data is stored in overflow graphs.

XML Storage • STORED(contd. ) • Given a data instance D, a STORED query is generated automatically. FROM Audit. taxpayer: $X{ name: $N, phone: $P 1, optional{phone: $P 2}} STORE R 1($X, $N, $P 1, $P 2) • Given relational mappings, generate explicit overflow mappings so that the query is lossless.

XML Storage • Object oriented method • Using DTD a hierarchy of the elements is obtained • Each element is now modeled as a class • For handling “*” of DTD a list of objects is maintained • To handle union types(Eg. , phone|email) new class can be introduced

XML Storage • e. Xcelon way – e. Xcelon XML Data Engine is a high performance XML data management engine – Based on Object. Store DBMS – When XML data gets parsed in e. Xcelon, it is represented in XMLStore as discrete XML elements. – The hierarchical structure of XML is therefore preserved in its persistent representation

XML Algebra Why yet another algebra? – Structure of data • Deeply structured • Exact structure not specific – Recursion • Structurally recursive Proposed Algebra: Too much stress on type conformance

XML Algebra • Sample Data <bib> <book> <title>Data on the Web</title> <year>1999</year> <author>Abiteboul</author> <author>Buneman</author> </book> <title> XML Query</title> <year>2000</year> <author>Mary</author> </book> </bib>

XML Algebra type Bib = bib [ Book{0, *}] type Book = book [ title [String ], year [Integer], author[ String]{1, *} ] let bib 0: Bib = bib [ book [ title [“Data on the Web”], year [1999], author[“Abiteboul”], author[“Buneman”] ] book[ title[“XML Query”], year[2000], author[“Mary”] ] ]

XML Algebra • Projection Eg: project book( children (bib 0) ) – Allows a more convenient notation as well (similar to Xpath notation) – Eg. bib 0/book/author ==> author [“Abiteboul”] author [“Buneman”] author [“Mary”] : author [ String ] {0, *}

XML Algebra • Selection Eg: for b bib 0/book in where value(b/year) <= 2000 then b ==> book [ title [ “Data on the web”], year [“ 1999”], author[“Abiteboul”], author[“Buneman”] ] : Book{0, *}

• Join: XML Algebra type Reviews = reviews [ book [ title [String], review [ String] ]{0, *} ] let review 0: Reviews = reviews[ book [ title[“XMLQuery”], review[“A fine book”] ], book [ title[“Data on Web”], review[“This is great”] ] ]

• Join XML Algebra for b bib 0/book in for r review 0/book in where value(b/title) = value(r/title) then book [ b/title, b/author, r/review] ==> book [ title [“Data on the web”], author[“Abiteboul”], author[“Buneman”] review[“A fine book”] ],

• Join XML Algebra book[ title[“XML Query”], author[“Mary”], review[“This is great”] ] : book[ title[String ], author[String]{1, *}, review[String] ]{0, *}

XML Algebra • Querying Order – Index function pairs an integer index with each element in a forest – Eg: index(book 0/author) ==> pair[fst[1], snd[author[“Abiteboul”]]], pair[fst[2], snd [author[“Buneman”]]], pair[fst[3], snd [author[“Suciu”]]] : pair[fst[Integer], snd[author[String]]]{1, *}

XML Algebra • Aggregation – Has five built-in aggregation functions: avg, count, max, min and sum – Eg: for b bib 0/book in where count(b/author) >= 2 then b/title ==> title[“Data on the web”] : title{0, *}

XML Algebra • Additional Features – Structural Recursion • To define documents with recursive structure, recursive types are used – Sorting • sort(pairs) – Grouping • Group(pairs)

Kweelt • Is a framework to query XML Data • An implementation of Quilt • Architecture :

XML Indexing 1 t t 2 a t 3 b 7 t a 8 t 4 5 c a d 9 10 11 Semistructured Data a 12 6 a b 13

XML Indexing • Data guides(Used in Lore) • Data guide is a concise and accurate summary of the data graph 1 t a 23456 b 7 8 10 12 13 d c 7 13 Data Guide 9 11

XML Indexing • T-Index t a 7 13 b 1 23456 a 8 10 12 T-Index c d 9 11

Challenges • Storage issues • Relational or native? • Query optimization • Query plan? • Other than queries…say triggers? • Updates to data • Mining of XML data