An algebra for XML Leonidas Galanis Stratis Viglas
An algebra for XML Leonidas Galanis, Stratis Viglas University of Wisconsin-Madison Department of Computer Sciences
Outline n n n What kind of operations do we need? Why are XML data different? How do we overcome the problems that arise? A concrete algebra Using this algebra inside Niagara 9/16/2020 2
What do we need? n n n Pattern retrieval Selections Projections Joins Element Construction 9/16/2020 3
So, why is it different? n n Relational algebra has selections, projections and joins Object oriented algebras have patternlike constructs (path expressions) Just use these, add a construction operator and we’re set! …not really 9/16/2020 4
Key underlying difference n n n Relational model: there is a database schema, everything is flat Object-oriented models: there is a class definition, a known kind of hierarchy What does XML have? n n n DTDs, XML Schemata can act as a schema Most XML files out there do not conform to a DTD/XML Schema We don’t really know of the data schema. We just know the data are there and they have some context 9/16/2020 5
The data model n n n XML file is a DAG of vertices Arcs coming out of each vertex Three types of arcs: n n n Attribute Element IDRef book (1) isbn (2) [2] [1] title (3) [3] author (4) [n] Each arc is named Even more, there is an ordering, on arcs and nodes 9/16/2020 6
Use the bare minimum n n n All operators operate on a set of vertices of the same type Use relative path expressions Use selections and joins to filter out the data n n Conditions are based on path expressions Use plain projections to project out specific elements Output construction based on wrapping elements with a tag Build on these principles as we go along 9/16/2020 7
Example File: books. xml <bib> <book isbn=“ 01”> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Vianu</author> <author>Hull</author> </book> <book isbn=“ 02”> <title>Principles of Database Systems</title> <author>Ramakrishnan</author> </book> <book isbn=“ 03”> <title>Niagara Blues</title> <author>Galanis</author> </book> </bib> 9/16/2020 8
Example File: articles. xml <proc> <article> <title>The OO 7 Benchmark</title> <author>De. Witt</author> <author>Carrey</author> <author>Naughton</author> </article> <title>Magic is relevant</title> <author>Ramakrishnan</author> </article> <title>The Niagara Insomniac</title> <author>Viglas</author> </article> </proc> 9/16/2020 9
Vertex specification n Given a vertex, follow down a path of descendant vertices Return all reachable vertices by the path expression Assume that given an arc, we can differentiate between element, attribute and IDRef arcs. 9/16/2020 10
Vertex specification example n n n Suppose we have reached vertices pointed to by “book” arcs We want the authors of author these books So we follow the author arc Result is a set of vertices pointed to by “author” arcs Let’s call this operator author Follow - 9/16/2020 book author (author) author 11
Selections n n Filter out the input based on some qualification (condition) n n e. g. : (book. author = “Hull”) What are the semantics? n n 9/16/2020 What kind of elements are flowing through the system? Can we overlay multiple selections? 12
Selections (example) n n n book Suppose we want the titles of books written by a specific author How far should we go into the initial Follow? If we follow to book. author, then we lose access to book. title If we follow to book, we are better off What if we want the author as well? (i. e. , only the specified author should appear in the output) title author Selection here This can be a problem… 9/16/2020 13
Query on books. xml <bib> <book isbn=“ 01”> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Vianu</author> <author>Hull</author> </book> <book isbn=“ 02”> <title>Principles of Database Systems</title> <author>Ramakrishnan</author> </book> <book isbn=“ 03”> <title>Niagara: A programmer’s waterfall</title> <author>Galanis</author> </book> </bib> 9/16/2020 14
Proposed Solution n n Permit more than one Follow operator Change the assumption: no operation on a single type of input A collection { } of bags [ ] of vertices Example: n n { [book 1, book 1. title 1], [book 2, book 2. title 2], …} Relational analogy: n n n 9/16/2020 Vertex = attribute Bag = tuple Collection = relation 15
Solution n n Even more, change the { [book 1, author 1], [book 1, author 2], …, [book 3, author 1] } semantics of the Follow { [Foundations…, Vianu], [Foundations…, Hull], …, [Niagara…, Galanis] } operator Evaluate a specified path expression in all elements of (author) all bags of a collection For each qualifying element, create a new bag containing { [book 1], [book 2], [book 3] } the old vertex plus the qualifying vertex Same as un-nesting in Object (book) -Oriented algebras 9/16/2020 16
Joins n Join two collections based on some qualification n n j(condition) What is the output of a join? n [Beech, Malhotra, Ryce]: Add an IDRef arc from one vertex to the joining vertex IDRef n n 9/16/2020 But, IDRef arcs are directed So in their model, joins are not commutative 17
Our Solution n Each bag of the resulting collection is a concatenation of the joining bags The same as concatenating tuples in the relational paradigm Even more, bags are unordered 9/16/2020 18
Problems n n Suppose we are operating on two streams: books and articles We have joined on the author We want a selection on the book’s title Using relative path expressions, what path expression are we going to specify? 9/16/2020 (title = Niagara) j(author = author) book article 19
Possible solution n n Use absolute path expressions Now we can distinguish between different sources But what if we can evaluate the path expression on different elements of the bag? For instance, given bags of [book, book. author], book. author. lastname can be evaluated on both elements of the bag Choose the element of the bag with the greatest common prefix for evaluation 9/16/2020 20
Cleaner solution n n The previous solution works, but implies the path expression evaluation principle Introduce a reverse part in the path expressions A reverse part designates backward satisfaction constraints Examples: n n n lastname: book. author instructs following the lastname arc from book. author vertices author. lastname: book instructs following the author. lastname arc from book vertices This way, just the specification of the path expression implies on which element of the bag the path expression is to be evaluated 9/16/2020 21
Projections n n With the tools we have, it’s easy to project out elements We just specify using the correct path expression which element of the bag we wish to project Let’s call this operator Expose - Example: n n (lastname: book. author, title: book) Expose creates element content 9/16/2020 22
Element construction n n We need a way to specify the vertex that encloses the projected ones Call this operator Vertex – v Creates the vertex, as well as the named arc that leads tobook_author it Example: n 9/16/2020 v(book_author) 23
One last step… n n We need to be able to construct complex elements, i. e. , a way to handle arbitrary nesting Each path expression designated inside an Expose operator, can be tagged with a Vertex operation 9/16/2020 24
Element construction example n v(book_info)[ (v(name)[lastname: book. autho book_info r], v(title)[title: book])], constructs: 9/16/2020 name title lastnames… titles… 25
The Niagara Algebra n Six basic operators: n n n n n Source Follow Select Join Expose Vertex Regular path expressions used for element specification Differentiation between tags, elements, contents Filtering and construction operators Assume an unordered XML data model 9/16/2020 26
Source operator n n n Input: the initial collection Singleton bags, each containing the root of one XML file Output: either the initial collection, or a subset of it The selection can also be based on conformance to a DTD or XML schema Examples n n 9/16/2020 Source(“*”): the initial collection (the “from *” clause) Source(“foo. xml”): { [foo. xml] } Source(“bib*. xml”): { [bib 90. xml], [bib 91. xml], … } Source(“*”, “book. dtd”): { [books. xml], [morebooks. xml], … } 27
Putting it all together… <raghu_title>Principles…</raghu_title> v(raghu_title) (book: title) “Book titles of authors named Ramakrishnan who have written an article as well” j(author: book = author: article) (book) (author: article = Ramakrishnan) (article) s(books. xml) 9/16/2020 s(articles. xml) 28
Summary n n n Operators operate on a collection of bags of vertices Path expressions identify vertices Following of path expressions, Selections and Joins filter the input XML output is constructed with Expose/Vertex operations These are complicated data, so it’s a complicated algebra …but it seems to work 9/16/2020 29
- Slides: 29