ChangeCentric Management of Versions in an XML Warehouse

  • Slides: 26
Download presentation
Change-Centric Management of Versions in an XML Warehouse Amélie Marian Columbia University Serge Abiteboul,

Change-Centric Management of Versions in an XML Warehouse Amélie Marian Columbia University Serge Abiteboul, Grégory Cobéna, Laurent Mignet INRIA-Rocquencourt

Overview n The Xyleme Project n Change Management n Version Management – XIDs –

Overview n The Xyleme Project n Change Management n Version Management – XIDs – XML Diff – Deltas – Storage of XML documents versions – Implementation and experiments VLDB-Sept 2001 Amélie Marian 2

The Xyleme Project n A dynamic XML Data Warehouse with high level services: –

The Xyleme Project n A dynamic XML Data Warehouse with high level services: – – User-friendly Query Engine Semantic Data Integration Version Management Query Subscription, Change Monitoring services n Xyleme project is now finished n Start-up also called Xyleme VLDB-Sept 2001 Amélie Marian 3

Change Management n Version Management n Learning about Changes n Monitoring Changes: Query Subscription

Change Management n Version Management n Learning about Changes n Monitoring Changes: Query Subscription n Querying the Past: Temporal Queries VLDB-Sept 2001 Amélie Marian 4

Version Management Our Requirements: n Obtain the current version n Get the modifications since

Version Management Our Requirements: n Obtain the current version n Get the modifications since time t n Subscribe to change notifications, query changes n Compute temporal queries n Rebuild the version Vi of a document at time ti VLDB-Sept 2001 Amélie Marian 5

Getting the Documents XML documents are fetched from the web n We only have

Getting the Documents XML documents are fetched from the web n We only have snapshots of the documents n Catalog Pr Pr N P Camera 300 TV 100 Catalog Pr Pr Pr N P N P VCR 200 TV 100 DVD 500 VCR 150 Version 2 Version 1 VLDB-Sept 2001 Pr Amélie Marian 6

XIDs n Unique identifiers needed to track XML nodes through time: • Track changes

XIDs n Unique identifiers needed to track XML nodes through time: • Track changes on a specific node (ex: a product in a catalog) • Reconstruct the history of a node n But physically adding an ID attribute to each node is expensive storage-wise XIDs: allow to attach persistent IDs to every node in a storage efficient manner VLDB-Sept 2001 Amélie Marian 7

XIDs n XIDs stored separately as a list (XID-map) – List of the nodes

XIDs n XIDs stored separately as a list (XID-map) – List of the nodes IDs in a postorder traversal of the tree – XIDnext: gives the next available XID Compact Representation n Document is not modified n VLDB-Sept 2001 Amélie Marian 13 1 12 15 3 2 14 7 10 8 11 9 XID-map (1 -3, 14 -15, 7 -13|16) 8

XML Diff n We implemented a XML diff algorithm to compute changes between two

XML Diff n We implemented a XML diff algorithm to compute changes between two versions of a document: – Use of XML structure for matching – Content matching Linear in the size of the document n XML diff has two roles: – Match nodes – Build the delta n Ongoing work on improving the XML diff VLDB-Sept 2001 Amélie Marian 9

Node Matching using a Diff Algorithm Catalog 16 Pr Delete Pr 5 10 N

Node Matching using a Diff Algorithm Catalog 16 Pr Delete Pr 5 10 N P 2 4 Camera 300 1 3 7 9 TV 100 6 8 Pr Pr 10 15 VLDB-Sept 2001 Pr Pr 21 15 N P N P VCR 200 TV 100 DVD 500 VCR 150 12 14 11 13 7 6 9 8 18 17 20 19 12 14 Update 11 13 Version 2 Version 1 XID-map: (1 -16|17) 16 Insert Diff (V 1, V 2) delete(5) update(13, 150) insert(16, 2, (17 -21)) Amélie Marian New XID-map: (6 -10, 17 -21, 11 -16|22) 10

Edit-Scripts = SEQUENCE n Sequences of basic operations over XML trees: • • Delete(n)

Edit-Scripts = SEQUENCE n Sequences of basic operations over XML trees: • • Delete(n) Update(n, v) Insert(m, k, T) Move(n, k, m) An Edit Script can be applied to a document D if its operations are consistent with D n An Edit Script applied to a document D will result in a unique document D’ n Several Edit Scripts applied to a document D can result in the same document D’ n VLDB-Sept 2001 Amélie Marian 11

Deltas (Δ) = SET n n n We introduce an alternative way of representing

Deltas (Δ) = SET n n n We introduce an alternative way of representing changes: Deltas Δi, j (unit delta) contains the Set of operations needed to go from Vi to Vj ( Diff(Vi, Vj) ) A Delta (Δ) over a document D is the sequence of unit deltas over D: Δ={Δ 1, 2, . . . , Δk-1, k} There is a (almost) unique delta from Vi to Vj We represent Deltas as XML documents VLDB-Sept 2001 Amélie Marian 12

Shortcomings of Deltas are not reversible and cannot be composed (information on position is

Shortcomings of Deltas are not reversible and cannot be composed (information on position is missing) Storage Policies n Only a) and b) a) V 1, Δ 1, 2, …Δnow-1, now lossless b) Δ 2, 1, …Δnow, now-1, n But we would like Vnow to have fast access c) V 1, Δ 2, 1, …Δnow, now-1 to: d) Δ 1, 2, …Δnow-1, now, – Vnow V now –Δ n i, now VLDB-Sept 2001 Amélie Marian 13

Completed Deltas (Δ+) n Completed deltas contain more information : • • Delete(m, k,

Completed Deltas (Δ+) n Completed deltas contain more information : • • Delete(m, k, T) Update(n, ov, nv) Insert(m, k, T) Move(n, k, m, p, q) Completed Deltas can be reversed and composed n Completed Deltas are in the spirit of some logs in DB systems n VLDB-Sept 2001 Amélie Marian 14

<delta> <unit_delta> … </unit_delta> <time from=“ 1” to=“ 2”/> <delete parent=“ 16” position=“ 1”

<delta> <unit_delta> … </unit_delta> <time from=“ 1” to=“ 2”/> <delete parent=“ 16” position=“ 1” xid-map=“(1 -5)”> <Product> <Name>Camera</Name> <Price>300</Price> </Product> </delete> <update xid=“ 13” new_value=“ 150” old_value=“ 200”/> <insert parent=“ 16” position=“ 2” xid-map=“(17 -21)”> <Product> <Name>DVD</Name> <Price>500</Price> </Product> </insert> </unit_delta> </delta> Example of XML Δ+ 15

Operations on Deltas n Compute with version: – Vi o Δ+i, j = Vj

Operations on Deltas n Compute with version: – Vi o Δ+i, j = Vj – Vi o Δi, j = Vj n Reverse: (Δ+i, j)-1= Δ+j, i n Compose: Δ+i, j; Δ+j, k =Δ+i, k n Simplify: Δ+i, j → Δi, j VLDB-Sept 2001 Amélie Marian 16

Storage of Versions n For a document D (or a query result Q), we

Storage of Versions n For a document D (or a query result Q), we store: – Current Version: Vk – XID-map (as text) of Vk – Current Δ+ = {Δ+1, 2, . . . , Δ+k-1, k} n When a new version k+1 arrives: – Compute XML diff between k and k+1, compute Δ+k, k+1 – Replace current version: Vk+1 – Replace XID-map – Append Δ+k, k+1 to Δ+ VLDB-Sept 2001 Amélie Marian 17

Levels of Versioning n Full versioning is expensive, we support different levels of versioning:

Levels of Versioning n Full versioning is expensive, we support different levels of versioning: – Full Versioning: Vnow + Δ+ – Partial Versioning: Vnow + Δ – Last Version Update: Vnow + Δnow-1, now – Change Support: Vnow + XML diff computed for Query Subscription – Not Versioned: Vnow VLDB-Sept 2001 Amélie Marian 18

Implementation n Version Manager and XML diff implemented in C++ n A change simulator

Implementation n Version Manager and XML diff implemented in C++ n A change simulator was implemented for tests n A GUI was implemented VLDB-Sept 2001 Amélie Marian 19

GUI Interface 20

GUI Interface 20

Deltas Statistics Reasonable when there are not many modifications n Relatively expensive for small

Deltas Statistics Reasonable when there are not many modifications n Relatively expensive for small documents n Depends on the quality of the diff n VLDB-Sept 2001 Amélie Marian 21

Deltas Statistics (2) 30% of modifications on the document n From left to right

Deltas Statistics (2) 30% of modifications on the document n From left to right n – Snapshots – Completed Deltas – Deltas: composition and previous version reconstruction are not possible – Composed Completed Deltas: advantages of Completed Deltas but coarser granularity and higher cost. VLDB-Sept 2001 Amélie Marian 22

Conclusion n Management of Versions based on Change Representation: – Representation in tree data

Conclusion n Management of Versions based on Change Representation: – Representation in tree data (XML) – Study of storage policies – Implementation of running prototypes n Completed Deltas: a Set of Modifications – Mathematical properties on completed deltas (algebraic group) n Current work on Query Subscription, Continuous Queries and Changes over Collections of Documents VLDB-Sept 2001 Amélie Marian 23

References n Version Management – Chien, Tsotras and Zaniolo. Efficient Management of Multiversion Documents

References n Version Management – Chien, Tsotras and Zaniolo. Efficient Management of Multiversion Documents by Object Referencing. VLDB 2001. – Chawathe, Abiteboul and Widom. Managing Historical Semistructured Data. TAPOS 1999. – Cellary and Jomier. Consistency of Versions in Object-Oriented Databases. VLDB 1990. – Adiba and Lindsay. Database Snapshots. VLDB 1980. n Diff Algorithms – Chawathe and Garcia-Molina. Meaningful Change Detection in Structured Data. Sigmod 1997. – Cobena, Abiteboul and Marian. Detecting Changes in XML Documents. Technical report INRIA. n Xyleme – Cluet, Veltri and Vodislav. Views in a Large Scale XML Repository. VLDB 2001. – Nguyen, Abiteboul, Cobena and Preda. Monitoring XML data on the Web. Sigmod 2001. VLDB-Sept 2001 Amélie Marian 24

Example: Edit-Scripts vs. Deltas n A Possible Edit-Script: P Insert(B, 1, P) Insert(C, 1,

Example: Edit-Scripts vs. Deltas n A Possible Edit-Script: P Insert(B, 1, P) Insert(C, 1, P) n A The Delta: Version 0 Insert(B, 2, P) Insert(C, 1, P) P Edit-Scripts Deltas Relative position (at time of operation) VLDB-Sept 2001 Absolute position (final) C B A Version 1 Amélie Marian 25

Example: Missing Information for Delta Composition (Δ(0, 2)) P P C A C B

Example: Missing Information for Delta Composition (Δ(0, 2)) P P C A C B P A B D A Version 1 Version 2 Δ(0, 1) Δ(1, 2) Δ+(1, 2) Insert(B, 2, P) Delete(C) Insert (D, 2, P) Delete(C, 1, P) Insert (D, 2, P) Version 0 Deltas do not give information on parents and positions of deleted elements ® Positions of inserted elements in composition cannot be computed VLDB-Sept 2001 Amélie Marian 26