XML Publishing Bridging Theory and Practice Wenfei Fan

XML Publishing: Bridging Theory and Practice Wenfei Fan University of Edinburgh and Bell Laboratories 1

XML documents Rooted, node-labeled, ordered, unranked tree ü element: e. g. , course, prereq – tagged, subtree, – subelement, e. g. , the prereq child of course ü text node, e. g. , “CS 650”, carrying text, not tagged, leaf db course cno title course . . . type “CS 650” “Web DB” regular course prereq cno . . . cno prereq . . . 2

XML publishing: data exchange on the Web XML view RDB mapping source ü Most legacy data is stored in relational databases ü XML has become the prime standard for data exchange Web XML Q: XML view XML publishing DB 1 DB 2 3

XML publishing: an XML interface of databases query answer XML publishing middleware DTD query translation DBMS RDB Querying and updating traditional databases via XML views 4

Example: XML publishing Registrar DB db XML view R course cno title course . . . type “CS 650” “Web DB” regular course prereq cno . . . cno prereq Relational schema R 0: course (cno, title, type) prereq (cno 1, cno 2) -- prerequisite hierarchy . . . XML DTD D 0: db course prereq type course* cno, title, type, prereq cno*, prereq regular | project 5

XML publishing languages in practice view RDB XML view mapping source XML view definition languages: XML views published from RDB ü Commercial products: – Microsoft SQL Server 2005 (FOR-XML, XSD) – IBM DB 2 XML Extender (SQL/XML, DAD: SQL, RDB) – Oracle 10 g XML DB (SQL/XML, DBMS_XMLGEN) … ü Research prototypes: – XPERANTO – Tree. QL (Silk. Route) – ATG (PRATA) 6

XML publishing in practice XML view RDB mapping source Top-down from the root, via embedded relational queries db Q Q 1 course cno title . . . course type “CS 650” “Web DB” regular relationa lquery course RDB course Q 2 prereq cno . . . cno prereq . . . 7

XML publishing: question of the users What language should a user choose to express the view? ü unbounded depth, nondeterministic “shape”, cannot be decided statically at compile time prereq cno*, prereq type regular | project db course cno title course . . . course type X “CS 650” “Web DB” regular prereq X project cno . . . Few publishing languages can define this view course collection cno prereq . . . unbounded 8

XML publishing: question of database vendors ü XML view: under each course, list all its prerequisites, direct or not – collapsing prerequisite hierarchy – a tree of depth three ü Question: is it necessary to upgrade DBMS and support SQL’ 99? db course Q 1 cno title “CS 650” “Web DB” course type . . . Q cno course . . . course cno project The expressive power and complexity of XML publishing languages 9

Outline ü XML publishing transducers ü Characterization of XML publishing languages in practice ü Complexity: evaluation cost, static analyses ü Expressive power: tree generation, relational characterization ü Dynamic aspect: incremental XML publishing, view updates ü Open research issues Joint work with ü Theory: Floris Geerts, Frank Neven [PODS’ 07] ü System: Michael Benedikt, Phil Bohannon, Cheeyong Chan, Rajeev Rastogi, … [SIGMOD’ 03, 04; VLDB’ 02, 04, 05; ICDE’ 07] 10

Outline ü XML publishing transducers ü Characterization of XML publishing languages in practice ü Complexity: evaluation cost, static analyses ü Expressive power: tree generation, relational characterization ü Dynamic aspect: incremental XML publishing, view updates ü Open research issues 11

XML publishing transducers = (Q, , q 0, ) for a relational schema R ü Q: a finite set of states ü : a finite alphabet of XML tags, with a root r and text ü q 0: the start state ü : for each pair (q, a) in Q (q, a) (q 1, a 1, 1(x 1, y 1)), . . . , (qk, ak, k(xk, yk)), – to generate the children of a nodes: a 1*, . . . , ak* – register Rega: set-valued, fixed arity, with each a-node – i: query R Regai in a relational query language L – xi: a list of free variables in i, grouping attributes – deterministic: • (q, text) . -- Empty RHS: text nodes have no children 12

Top-down transduction ü Start rule: (q 0, r) -- q 0, r 0 do not appear on the RHS of any rule (q 0, db) (q, course, 1(c, t; )) 1(c, t; nil) = t’ course(c, t’, t) x = (c, t) y= recall course (cno, title, type) ü tuple register Regc: group the result by all attributes: for each distinct tuple tp in the result of 1(x; ) – create a course element – carry the tuple tp in Regc ü expand at leaf nodes (q 0, db) (q, a) labeled carrying Reg (q, course) Regc . . . (q, course) Regc 13

Registers: tuple vs. relation (q, course) (q, cno, 2(c; )), (q, type, 3(t; )), (q, prereq, 4( ; c)) 2(c; ) = t Regc(c, t) 4( ; c) = t, c’ (Regc(c’, t) prereq(c’, c)) x= recall prereq(cno 1, cno 2) y=(c) ü tuple registers Regcno, Regt ü relation register Regp : x = , the result of 4( ; c) is a set ü top down information passing: the parent register Regc in 4( ; c) (q 0, db) . . . (q, course) Regc (q, cno) (q, type) (q, prereq) Regcno Regt Regp Regc (q, course) Regc 14

Recursive transducer and stop condition (q, prereq) (q, cno, 5(c; )), (q, prereq, 5( ; c)) 5( ; c) = t, c’ (Regp(c’, t) prereq(c’, c)) relation Reg ü Stop conditions: tuple Reg – 5( ; c) returns an empty set – the RHS of (q, a) is empty (e. g. , for text nodes) – there is an ancestor node with the same label, tag and register No new information can be added to the tree (q 0, db) (q, prereq) Regp (q, cno) Regcno . . . (q, course) (q, a) (q, cno) Regcno (q, prereq) Regp STOP Rega (q, a) Rega 15

Transformation of a publishing transducer terminates on a DB of R if all leaf nodes satisfy a stop condition ü (DB): XML tree, by striking out states and registers ü (R): the set of XML trees generated by for all DB of R db db) (q 0, cno (q, cno) course (q, course) type Reg (q, type) “CS 650” Reg regular Reg “CS 650” (q, regular) course (q, course) Reg . . . . cno course (q, course) prereq (q, prereq) Reg Reg cno prereq (q, cno) Reg (q, prereq) . . . Reg 16

publishing transducers with virtual nodes = (Q, , a, q 0, ) ü a : a subset of , virtual tags Recall the view: under each course, list all its prerequisites ’ = (Q, , a = {prereq}, q 0, ) db db course cnocno type “CS 650” regular “CS 650” course type regular . . course cnoprereq cno course cno prereq Virtual nodes are removed from the output . . . 17

Various classes of publishing transducers ü PT(L, S, O) – L: the relational query language (CQ, FO, FP, with = and ) – S: register, relation vs. tuple (a special case of relation Reg) – O: output nodes, normal vs. virtual ü PTnr(L, S, O): non-recursive subset of PT(L, S, O) Example: ü View 1: PT(CQ, relation, normal) ü View 2: PT(CQ, relation, virtual) and PTnr(FP, tuple, normal) As opposedtotorecent querywork automata In contrast on schema mapping ü take a relational as input, rather an existing tree relations to XML, database not relation-to-relation orthan XML-to-XML output a new tree, ratherqueries, than accepting a tree or selecting nodes ü via embedded relational not source-to-target constraints 18

Outline ü XML publishing transducers ü Characterization of XML publishing languages in practice ü Complexity: evaluation cost, static analyses ü Expressive power: tree generation, relational characterization ü Dynamic aspect: incremental XML publishing, view updates ü Open research issues 19

Existing XML publishing languages ü Extensions of SQL by incorporating XML publishing functions – – – Microsoft SQL Server 2005 (FOR-XML) IBM DB 2 XML Extender (SQL/XML) Oracle 10 g XML DB (SQL/XML, DBMS_XMLGEN) XPERANTO … ü Annotating schema or fixed tree template with relational queries – – – Microsoft SQL Server 2005 (XSD) IBM DB 2 XML Extender (DAD: SQL, RDB) Tree. QL (Silk. Route) ATG (PRATA). . . 20

Extensions of SQL for XML publishing ü SQL/XML: XMLElement, XMLForest, XMLAgg, XMLConcat, … SELECT XMLELEMENT {NAME=“course”, XMLFOREST{ c. cno AS “cno”, c. title AS “title”}} FROM course c course db course . . . course cno title PTnr(FO, tuple, normal): no recursion, virtual nodes ü XPERANTO: PTnr(FO, tuple, normal) ü Microsoft SQL Server 2005 (FOR-XML): PTnr(FO, tuple, normal) ü Oracle 10 g XML DB – DBMS_XMLGEN: PT(FP, tuple, normal) (connect-by of SQL’ 99) 21

Annotating schema or tree template ü ATG of PRATA: DTD-directed view definition, inherited attributes prereq cno*, prereq $cno Q($prereq_p), $prereq_c = Q($prereq_p) /* semantic rules */ Q: SELECT cno 2 FROM prereq p, $prereq_p p’ WHERE p. cno 1 = $prereq_p. cno – $prereq_p: parent attribute (relation register) prereq cno . . . cno prereq PT(FO, relation, virtual): recursive views, virtual nodes, DTD-conformance ü Microsoft SQL Server 2005 (XSD): PTnr(CQ, tuple, normal): ü IBM DB 2 XML Extender DAD-SQL: PTnr(CQ, tuple, normal), DAD-RDB: PTnr(CQ, tuple, normal) ü Tree. QL (Silk. Route): PTnr(CQ, tuple, virtual) 22

Putting these together Microsoft SQL Server 2005 IBM DB 2 XML Extender Oracle 10 g XML DB FOR XML PTnr(FO, tuple, normal) annotated XSD PTnr(CQ, tuple, normal) SQL/XML PTnr(FO, tuple, normal) DAD-SQL PTnr(FO, tuple, normal) DAD-RDB PTnr(CQ, tuple, normal) SQL/XML PTnr(FO, tuple, normal) DBMS_XMLGEN PT(FP, tuple, normal) XPERANTO PTnr(FO, tuple, normal) Silk. Route Tree. QL PTnr(CQ, tuple, virtual) PRATA ATG PT(FO, relation, virtual) 23

Outline ü XML publishing transducers ü Characterization of XML publishing languages in practice ü Complexity: evaluation cost, static analyses ü Expressive power: tree generation, relational characterization ü Dynamic aspect: incremental XML publishing, view updates ü Open research issues 24

Termination and evaluation cost Given a publishing transducer defined for a relational schema R, ü does the transformation of on DB terminate on all DB of R? ü how expensive is it to compute (DB)? (DB) is always defined on any instance DB of R. Worst-case data complexity: is in PT(L, tuple, O) 2 EXPTIME if is in PT(L, relation, O) PTIME if is in PTnr(L, S, O) ü EXPTIME if ü ü q Tight bounds: DAG tree, n-digit binary counter q L and O have no impact on the worst-case data complexity 25

Static analyses For a class PT(L, tuple, O) of publishing transducers, ü The emptiness problem: given in PT(L, tuple, O), can generate a nontrivial XML tree? Does the publishing transducer make sense? ü The membership problem: given an XML tree T and transducer in PT(L, tuple, O), can generate T with some DB? Can generate XML views that the user wants? ü The equivalence problem: given 1, 2 in PT(L, tuple, O) on the same relational schema R, do 1 and XML views over all instances of R? 2 generate the same Optimization: Can 1 be replaced by a more efficient 2? 26

Matching complexity bounds for static analyses ü PT(L, S, O) when L is either FO or FP: beyond reach – emptiness, membership and equivalence: undecidable ü PT(CQ, S, O): slightly better – Emptiness • PTIME if O is normal • NP-complete if O is virtual – Membership: • 2 p-complete for PT(CQ, tuple, normal) • undecidable if S is relation or O is virtual Reduction from (a) the satisfiability problem for FO queries, and (b) the emptiness problem for 2 -head DFA – Equivalence: undecidable Reduction from the halting problem for 2 RMs 27

Complexity bounds for non-recursive transducers ü PTnr(FO, S, O): all three problems remain undecidable ü PTnr(CQ, S, O): make our lives easier – Emptiness: the same as PT(CQ, S, O) – Membership (S is tuple): • PTnr(CQ, tuple, normal): 2 p-complete – no better • PTnr(CQ, tuple, virtual): undecidable 2 p-complete Establish the small model property – Equivalence • PTnr(CQ, tuple, O): undecidable 3 p-complete Lower bound: reduction from * * *3 SAT Upper bound: a constructive proof 28

Summary: complexity bounds fragments Equivalence Emptiness Membership PT(FP, S, O) undecidable PT(FO, S, O) undecidable PT(CQ, tuple, normal) undecidable PTIME 2 p-complete PT(CQ, relation, normal) undecidable PTIME undecidable PT(CQ, S, virtual) undecidable NP-complete 2 p-complete PTnr(FO, O, S) undecidable PTnr(CQ, tuple, normal) 3 p-complete PTIME undecidable PTnr(CQ, tuple, virtual) 3 p-complete NP-complete 2 p-complete 29

Outline ü XML publishing transducers ü Characterization of XML publishing languages in practice ü Complexity: evaluation cost, static analyses ü Expressive power: tree generation, relational characterization ü Dynamic aspect: incremental XML publishing, view updates ü Open research issues 30

Containment relation PT(FP, relation, virtual) = PT(FO, relation, virtual) PT(CQ, rel, virt) PT(FP, tup, virt) PT(FP, rel, nm) PT(FP, tup, nm) PT(FO, rel, nm) PT(FO, tup, virt) PT(FO, tup, nm) PTnr(FO, tup, nm) PT(CQ, rel, nm) PT(CQ, tup, virt) PT(CQ, tup, nm) PTnr(CQ, tup, virt) PTnr(CQ, tup, nm) XML view: under each course, list all its prerequisites, direct or not No need to upgrade DBMS and support SQL’ 99 31

Compared to logical transduction ü ( dom(x), root(x), edge(x; y), <(x; y), fc(x; y), ns(x; y), a(x)) – domain, root, edge, order, first-child, next-sibling, label – define DAGs, unfold into a tree – FO-transductions, SO-transduction (fixed k-arity), PTIME FOtransductions, PSPACE-SO-transductions Publishing transducers vs. logical transductions ü L-transductions PT(L, tuple, virtual) strict for FO ü PSPACE-SO-transductions PT(FP, relation, virtual) (ordered) ü PTIME-FO-transductions PT(FO, relation, virtual) (ordered) ü fixed-depth L-transductions = PTnr(L, tuple, O) (unordered tree) ü PTnr(L, tuple, O) fixed-depth L-transductions (L: FP, FO) No need to code stop conditions 32

DTD and specialized DTD ü DTD D = ( , r, ), : a for each a – normalized: : : = a 1, …, ak | a 1 + … + ak | a*, Specialized DTD D’ = ( ’, D, g), D: a DTD, g: ’ – – T’ conforms to D’: there is T s. t. T = g(T’) and T conforms to D Captures MSO definable trees and regular trees Capturing (specialized) DTD: ü specialized DTDs are definable in PT(FO, tuple, virtual) ü normalized DTDs are definable in PT(FO, tuple, normal) ü there are normalized DTDs not definable in PT(CQ, S, O) Check each a in FO, return a default in the presence of violation DTD-directed publishing: All members of a community (or industry) agree on a DTD and then exchange data w. r. t. the predefined DTD 33

publishing transducer as a relational query ü Input: = (Q, , q 0, ) for R, an output tag o , a DB of R ü Output: the union of Rego(v) for all v in the tree generated db Q 1 course cno title course type regular . . . course relational query RDB Q 2 prereq Reg cno . . . cno prereq . . . Reg output 34

Containment hierarchy: as relational queries Flattened: PT(L, S, virtual) = PT(L, S, normal) PT(FP, relation, O) = PT(FO, relation, O) PT(FO, rel, O) PT(FP, tup, O) PT(CQ, rel, O) PT(FO, tup, O) not strict if NLOGSPACE = PTIME PT(CQ, tup, O) PTnr(FO, tuple, O) PTnr(CQ, tuple, O) 35

complexity classes and relational query languages ü PT(FO, relation, O) captures PSPACE (ordered or unordered) (a) Recognition problem can be determined using PSPACE TM (b) Simulate partial fixpoint query and define a total order ü PT(FP, tuple, O) captures FP and thus PTIME (ordered) ü PT(FO, tuple, O) – captures TC 0[FO] and thus NLOGSPACE (ordered) – TC 0[FO] (unordered) Simulate transitive closure logic and vice versa ü PT(CQ, relation, O) contains deterministic datalog ü PT(CQ, tuple, O) captures linear datalog: p(x) p 1(x 1), …, pk(xk) – deterministic: each p(x) has only one rule – linear: at most one pj is an IDB 36

non-recursive classes as relational query languages ü PTnr(FO, tuple, O) captures FO (ordered or unordered) ü PTnr(CQ, tuple, O) captures UCQ (ordered or unordered) Simulate union of conjunctive queries and vice versa Those corresponding to existing XML publishing languages ü PTnr(FO, tuple, O): SQL/XML, FOR-XML (Microsoft), IBM DAD (SQL), … ü PTnr(CQ, tuple, O): XSD (Microsoft), Tree. QL 37

Expressiveness as relational queries fragments Complexity/language PT(FP, relation, O) PSPACE PT(FO, relation, O) PSPACE PT(FP, tuple, O) FP, PTIME (ordered databases) PT(FO, tuple, O) TC 0[FO], NLOGSPACE (ordered databases) PT(CQ, relation, O) deterministic datalog PT(CQ, tuple, O) TC 0[CQ], linear datalog PTnr(FO, tuple, O) FO PTnr(CQ, tuple, O) UCQ PT(L, S, virtual) = PT(L, S, normal) 38

Outline ü XML publishing transducers ü Characterization of XML publishing languages in practice ü Complexity: evaluation cost, static analyses ü Expressive power: tree generation, relational characterization ü Dynamic aspect: incremental XML publishing, view updates ü Open research issues 39

Incremental publishing ü Input: – – a publishing transducer for relational schema R an instance DB of R XML view T = (DB) relational updates DB ü Output: XML updates T such that T + T = (DB + DB) Commercial products: limited support XML publishing middleware incremental updates RDB T DBMS DB 40

Why incremental update? DB source Updates: DB database XML publishing incremental update T cached T ü Batch computation: recompute the entire XML tree from scratch; large XML views may take several hours to produce! ü Incremental computation: compute XML change T – Idea: the new view T’ = the old view T + T – Typically more efficient to compute T (small) and update the old view T with T – Why? the new view T’ often differs slight from the old view T 41 – reuse partial results computed earlier

Reduction Approach ü Most XML middleware takes a “reduction approach”: – treat Relational Database Systems (DBMS) as a black box, – re-use as much functionality of DBMS as possible ü Why not the reduction approach for incremental updates? – XML views are recursive – Few systems support WITH…RECURSIVE (linear recursion) – Fewer support its use in views – None supports incremental update of recursive views (many algorithms are known for incremental updates of recursive views, but unfortunately not in practice) ü The lowest common denominator of functionality of DBMS -- no need for (recursive) view-update support 42

Sub-Tree Property report patient SSN name treatment policy# “ 123” “Bush” tname “insane” in. Treatment . . . patient SSN name treatment policy# “ 234” “Cheney” tname treatment “insane” Sub-tree Property: Given a transducer and DB, each sub-tree is uniquely determined by (q, a, Rega), e. g. , (q, treatment, Regtr), in. Treatment treatment 43

Storing and updating XML – a DAG representation ü Storing each XML sub-tree only once, at any level of granularity - Associate an ID with each node in the tree (Skolem function) Small, unique value derived from the node’s register - A hash table H to map from (q, type, ID) to a node in the graph - Sub-tree pool: each node has a reference count and a children list [(q 1, type 1, ID 1), (q 2, type 2, ID 2), …] ü XML update T = (E+, E-) of edges ((q 1, type 1, ID 1), (q 2, type 2, ID 2)) - E-: remove (q 2, type 2, ID 2) from the child list of (q 1, type 1, ID 1) and decrement reference count on (q 2, type 2, ID 2) - E+: insert (q 2, type 2, ID 2) in the child list of (q 1, type 1, ID 1) and increment reference count on (q 2, type 2, ID 2) - Nodes with 0 reference counts move to sub-tree pool – to be reused (treatment, “t 123”) (in. Treatment, “i. T 234”) H [(tname, “chemo”), (in. Treatment, “i. T 23”)] [(treatment, “t 345”), (treatment, “t 567”), 44 … ]

Computing XML changes ü Computing XML changes T from database changes DB by incrementalizing SQL queries in a transducer: select from where IP, P. tname 2 Procedure P, in. Treatment IP P. tname 1 = IP ü Cuts (deletions): given DB, deletions of the existing edges of T are determined by executing a fixed number of non-recursive SQL queries – no recursion is involved (sub-tree property) ü Buds (new sub-tree generation): top-down iteration, evaluating non-recursive SQL queries at each step – Each new sub-tree is computed at most once, by sub-tree reusing (sub-tree pool) – minimizing recomputations – Partial results are “complete” up to a certain level at each step, allowing lazy evaluation and parallel processing 45

Steps to Bud-Cut 1. For a set of database changes, DB, execute a fixed number of non-recursive queries which determine direct edge changes, E-, E+ report E- are cuts E+ are buds (or cross edges) patient SSN name treatment policy# 2. Generate the patient SSN name treatment policy# “ 234” “Cheney” tname in. Treatment “ 123”“Bush” tname in. Treatment sub-trees under the buds, re-using as much existing and deleted subtrees as possible 3. Collect Garbage. X treatment 46

The XML view update problem ü Input: – a publishing transducer for relational schema R – an instance DB of R – XML view T = (DB) – XML updates T ü Output: relational updates DB such that T + T = (DB + DB) Commercial systems: limited support, already hard for relational views XML publishing middleware view updates RDB T DBMS DB 47

New challenges introduced by XML view updates ü Revising the semantics of side effects T: delete course[cno=`CS 650’]//course[cno=`CS 450’]/prereq/* Subtree property: remove the prerequisites of all CS 450 occurrences? ü DTD validation (if any) ü recursively defined – XML views db – XML updates course cno “CS 650” prereq course . . . course cno prereq “CS 450” X “CS 450” ? 48

Processing XML view updates Deriving relational views V from XML views (edge relations of DAG – external storage) 1. DTD validation – reject T if violation 2. Computing view updates V from T XML T relational views V 3. Computing updates DB from V May not exist – reject T if not V 4. Update the underlying DB and view V with DB from V DB DB Main challenges: relational view updates ü Hard: deciding view updatability is intractable/undecidable ü Open: complexity, algorithm, commercial system support 49

Outline ü XML publishing transducers ü Characterization of XML publishing languages in practice ü Complexity: evaluation cost, static analyses ü Expressive power: tree generation, relational characterization ü Dynamic aspect: incremental XML publishing, view updates ü Open research issues 50

XML integration: complexity and expressiveness DTD DB DB integration DB multiple, distributed sources constraints XML view XML integration transducers ü Two-way vs. top-down: context-dependent generation ü Integrity constraints: conformance to XML schema ü Information preservation: data migration XML integration language: Attribute Integration Grammar (AIG) 51

XML shredding query answer XML shredding middleware query translation DBMS RDB ü Storing XML data in relations: storage, query processing, RDBMS transaction control, … ü Primary goal: – store part or entire XML documents – content based – increment existing relations, rather than build a new one – directed by recursive XML schema 52

XML shredding automata db Q Q 1 course Reg cno title course type regular . . . XML query course XML Q 2 prereq Reg cno . . . cno prereq . . . ü Shredding automata vs. publishing transducers Reg RDB – take an existing tree as input, rather than relations – embedded XML queries, not relational, to compute Reg – output: union of relation registers – tuples to insert – combining XML SAX parsing and shredding, e. g. , XML 2 DB 53 ü Primary goal: expressive power and complexity

Summary ü XML publishing: a synergy between theory and practice – characterization of XML publishing languages in practice; – expressive power and matching complexity bounds. helpful guidance for both the users and database vendors ü Dynamic aspects: incremental publishing and view updates. important yet overlooked by and large ü Open research issues: – XML integration transducers – XML shredding automata –. . . An attempt to bridge theory and practice 54