Arch IS An Efficient TransactionTime Temporal Database System

Arch. IS: An Efficient Transaction-Time Temporal Database System Built on Relational Databases and XML Fusheng Wang University of California, Los Angeles

Temporal Databases: the Reality � Over 40 temporal data models and query languages have been proposed in the past �A long struggle to get around the limitations of RDBMS � No DBMS vendors have moved aggressively to extend SQL with temporal support � The � SQL problem is not due to a lack of applications. is at least in part to blame for the problem.

Outline � Transaction Time History of Database relations in XML u and Temporal Queries with XQuery � The Performance Problem u The Arch. IS System

Background: Publishing Relational Database as XML � Publishing relational DBs as XML uas actual XML documents: SQL/XML uas XML views: Silk. Route, XPeranto � Here we publish the history of relational content as XML documents or XML views u. Using a temporally grouped representation.

Example: Transaction-Time History of Tables Timestamped tuple snapshots (temporally ungrouped) deptno DOB start end Engineer d 01 1945 -04 -09 1995 -01 -01 1995 -05 -31 70000 Engineer d 01 1945 -04 -09 1995 -06 -01 1995 -09 -30 10003 70000 Sr Engineer d 02 1945 -04 -09 1995 -10 -01 1996 -01 -31 10003 70000 Tech Leader d 02 1945 -04 -09 1996 -02 -01 1996 -12 -31 name empno salary title Bob 10003 60000 Bob 10003 Bob Temporally grouped history of employees name empno salary 60000 1995 -01 -01: 199605 -31 Bob 10003 1995 -01 -01: 199612 -31 title Engineer 1995 -01 -01: 199509 -30 deptno d 01 1995 -01 -01: 199509 -30 Sr Engineer d 02 70000 1995 -10 -01: 199601 -31 1995 -06 -01: 1996 - Tech Leader DOB 1995 -10 -01: 199612 -31 1945 -04 -09 1995 -01 -01: 1996 -12 -31

XML Representation of DB History <employees tstart="1995 -01 -01" tend="1996 -12 -31"> <employee tstart="1995 -01 -01" tend="1996 -12 -31"> <empno tstart="1995 -01 -01" tend="1996 -12 -31">10003</empno> <name tstart="1995 -01 -01" tend="1996 -12 -31">Bob</name> <salary tstart="1995 -01 -01" tend="1995 -05 -31">60000</salary> <salary tstart="1995 -06 -01" tend="1996 -12 -31">70000</salary> <title tstart="1995 -01 -01" tend="1995 -09 -30">Engineer</title> <title tstart="1995 -10 -01" tend="1996 -01 -31">Sr Engineer</title> <title tstart="1996 -02 -01" tend="1996 -12 -31">Tech Leader</title> <deptno tstart="1995 -01 -01" tend="1995 -09 -30">d 01</deptno> <deptno tstart="1995 -10 -01" tend="1996 -12 -31">d 02</deptno> <DOB tstart="1995 -01 -01" tend="1996 -12 -31">1945 -04 -09</DOB> </employee>  </employees>

Departments <? xml version="1. 0" encoding="UTF-8"? > <departments tend="9999 -12 -31 tstart="1985 -01 -01"> <department tend="9999 -12 -31" tstart="1985 -01 -01"> <deptno tend="9999 -12 -31" tstart="1985 -01 -01">d 001</deptno> <deptname tend="9999 -12 -31"tstart="1985 -0101">Marketing</deptname> <mgrno tend="1991 -10 -01" tstart="1985 -01 -01">110022</mgrno> <mgrno tend="9999 -12 -31" tstart="1991 -10 -01">110039</mgrno> </department> <department tend="9999 -12 -31" tstart="1985 -01 -01"> <deptno tend="9999 -12 -31" tstart="1985 -01 -01">d 002</deptno> <deptname tend="9999 -12 -31” tstart="1985 -0101">Finance</deptname> <mgrno tend="1989 -12 -17" tstart="1985 -01 -01">110085</mgrno> <mgrno tend="9999 -12 -31” tstart="1989 -12 -17">110114</mgrno> </department>

Advantages of XML Representations � The attribute value history is grouped, and can be queried directly without coalescing � The H-document has a well-defined schema generated from the current table

Temporal Queries with XQuery � XQuery: the coming standard query language for XML � With XQuery, we can specify temporal queries without any extension: u Temporal projection, snapshot queries, temporal joins, interval queries: A SINCE B, continuous periods, period containment u Complex

Temporal Queries with XQuery � Temporal projection: retrieve the salary history of “Bob”: element salary_history { for $s in doc("employees. xml")/ employees/employee/[name=“Bob”]/salary return $s }

Temporal Queries with XQuery Snapshot queries: retrieve the departments on 1996 -01 -31: for $d in doc("depts. xml")/depts/dept [tstart(. ) <= "1996 -01 -31" and tend(. ) >= "1996 -01 -31"] let $n : = $d/name[tstart(. )<="1996 -01 -31" and tend(. )>="1996 -01 -31"] let $m : = $d/manager[tstart(. )<="1996 -01 -31" and tend(. )>= "1996 -01 -31"] return( element dept{$n, $m } )

Temporal Functions � Shield the user from the low-level details used in representing time, e. g. , “now” � Eliminate the need for the user to write complex functions, e. g. , coalescing � Predefined functions: u Restructuring: coalese($l) u Period tmeets comparison : toverlaps, tprecedes, tcontains, tequals, u Duration and date/time: tstart($e), tend($e), timespan($e) u telement(Ts, Te): constructs an empty element timestamped as tstart=Ts, tend=Te

Support for ‘now’ � ‘now’: no change until now � Internally, “end of time” values are used to denote ‘now’, e. g. , 9999 -12 -31 � Intervals are only accessed through built-in functions: tstart() returns the start of an interval, tend() returns the end or CURRENT_DATE if it’s different from 9999 -12 -31 � In the output, tend value can be: u “ 9999 -12 -31” u CURRENT_DATE by using rtend($e) that recursively replaces all the occurrence of 9999 -12 -31 with the current date, u “now”, using externalnow($e) that recursively replaces all the occurrence of 9999 -12 -31" with the string now".

Outline � Viewing Relation History in XML u Temporal Queries with XQuery � Performance u The Issues Arch. IS (Archival Information System) Project

Architecture: Two approaches 1. Native XML database approach: store H-documents directly into XML DB—limited performance and scalability 2. XML-enabled RDBMS. Design issues include: u mapping (shredding) the XML views representing the Hdocuments into tables (H-tables) u translation of queries from the XML views to the H-tables u indexing, clustering and query mapping techniques u Arch. IS: Archival Information System

The Arch. IS System: Architecture Current Database Relational Data SQL Queries Active Rules/ update logs Temporal XML Data H-views (H-documents) H-tables Temporal XML Queries A R C H I S

H-tables � Assumptions u Each entity or relation has a unique key ( or composite keys) to identify it which will not change along the history. e. g. , employee: empno � H-tables: u attribute history table: store history of each attribute u key table: built for the key u global relation table: record the history of relations � e. g. : current database: u employee(empno, title) name, sex, DOB, deptno, salary,

H-tables current table employee (cont’d) H-tables relations(relationname, tstart, tend) global relation table empno key table employee_id(id, tstart, tend) name attribute history employee_name(id, name, tstart, tend) … table … salary employee_salary(id, salary, tstart, tend) title employee_title(id, title, tstart, tend)

H-tables (cont’d) � Sample contents of employee_salary: ID ======= 100022 100022 100023. . . SALARY ======= 58805 61118 65103 64712 65245 43162 TSTART ===== 02/04/1985 02/05/1986 02/05/1987 02/05/1988 02/04/1989 07/13/1988 TEND ===== 02/04/1986 02/04/1987 02/04/1988 02/03/1989 02/03/1990 07/13/1989

Query Mapping � General purpose query mapping: XPeranto � In Arch. IS, we have well-defined mapping between H-documents (or H-views) and H-tables � We map temporal XQuery queries into SQL, utilizing SQL/XML u SQL/XML is a new standard to map between RDBMS and XML u Both tag-binding and structure construction is pushed inside the relational engine, thus be very efficient

XQuery Mapping to SQL with SQL/XML � Temporal projection: retrieve the salary history of “Bob”: element salary_history { for $s in doc("employees. xml")/ employees/employee/[name=“Bob”]/salary return $s } select XMLElement (Name "salaryhistory", XMLAgg (XMLElement (Name as "salary", XMLAttributes (S. tstart as tstart, S. tend as "tend"), S. salary))) from employee_salary as S, employee_name as N where N. id = S. id and N. name = 'Bob' group by N. id

SQL/XML Publishing Functions � XMLElement and XMLAttribute select XMLElement (Name "dept", XMLAttributes (tstart as "tstart", tend as "tend"), deptname) from dept where deptname = ‘Sales’ <dept tstart = "02/04/1985" tend = "12/31/9999"> Sales </dept>

SQL/XML Publishing Functions XMLAgg select XMLElement (Name as "new_employees", XMLAttributes ("02/04/2003" as "Since") XMLAgg (XMLElement (Name as "employee", e. name)) from employee_name as e where e. tstart >= ‘ 02/04/2003’ <new_employees Since = "02/04/2003"> <employee>Bob</employee> <employee>Jack</employee> </new_employees>

Performance Study: Experimental Setup � Systems: Tamino, DB 2, and Arch. IS uses Berkeley. DB as its storage manager, and it builds on top of it its SQL query engine � Temporal data set: the history of 300, 024 employees over 17 years u The simulation models real world salary increases, changes of titles, and changes of departments u The size of the XML data is 334 MB u The single large XML document is cut into a collection of 15, 000 small XML documents with around 25 KB each � Machine: Pentium IV 2. 4 GHz PC with Red. Hat 8. 0

Performance Study: Query Performance DB 2 and Arch. IS: with clustering Tamino: without clustering snapshot query Q 2 on Arch. IS is 137 times faster than that on Tamino; interval query Q 5 is 91 times faster; history Q 6 is 25 times faster; Q 4 4 times faster, and Q 3 near 3 times faster. Tamino with clustering: snapshot Q 2 is 3. 3 times faster than without clustering ( still 41 times slower than arch. IS); interval query Q 5 is 2. 9 times faster than without clustering ( still 31 times slower than on Arch. IS); history queries are much slower

Update Performance � For RDBMS, only the current segment is used for updates. For Tamino, current data and historical data are clustered together � Update an employee’s salary: u � DB 2: 0. 29 seconds; Tamino: 1. 2 seconds Assume that every employee gets updated once a year: about 1/260 of the total employee get updated every day on average u DB 2: 1. 52 seconds; Tamino: 15 seconds u In the worse case for segment-based archiving: 39 seconds for copying segments and 36 segments for compression: but only once

Arch. IS Summary � We built a transaction time temporal database on RDBMS and XML, with: u XML to support temporally grouped (virtual) representations of the database history u XQuery to express powerful temporal queries on such views u The XML historical views are shredded back into relations (H-tables) u SQL/XML for executing the queries on the XML views as equivalent queries on the relational DB u Temporal clustering schemes based on the concept of page useful deliver good performance.

Temporal Clustering and Indexing � Tuples in H-tables are stored in the order of updates, thus neither temporally clustered nor clustered by objects � Traditional indexes such as B+ Tree will not help on snapshot queries, and better temporal clustering is needed � For every segment, usefulness: U = Nlive/Nall u At the beginning, U =100%, and it decreases with updates u The minimum tolerable usefulness: Umin

Segment-based Clustering Scheme Live All All Segment 1 Segment 2 Segment 3 segstart 1 segend 1 segstart 2 segend 2 tstarttuple <= segend. SEG tendtuple >= segstart. SEG segstart 3 segend 3

Segment-based Clustering Scheme � Initially all tuples for an attribute history table archived in a live segment SEGlive with usefulness U =100%. With updates, when U drops below Umin: 1. A new segment is allocated; 2. The interval of this segment is recorded in the table segment(segno, segstart, segend); 3. All tuples in SEGlive are copied into a new segment Si sorted by id; 4. All live tuples in SEGlive are copied into a new live segment SEGlive', and the old live segment is dropped; After that, the new segment SEGlive’ becomes the new starting segment for updates

Segment-based Clustering Scheme (cont’d) � Sample segments: Segment 1 (01/01/1985 - 10/17/1991): ID SALARY TSTART TEND 100002 40000 02/20/1988 02/19/1989 100002 42010 02/20/1989 02/19/1990 100002 42525 02/20/1990 02/19/1991 100002 42727 02/20/1991 12/31/9999. . . Segment 2 (10/18/1991 - 07/08/1995): ID SALARY TSTART TEND 100002 42727 02/20/1991 02/19/1992 100002 45237 02/20/1992 02/18/1993 100002 46465 02/19/1993 02/18/1994 100002 47418 02/19/1994 02/18/1995 100002 47273 02/19/1995 12/31/9999. . .

Advantages of Segment-based Clustering Scheme � The current live segment always has a high usefulness, assuring efficient updates; � Records are globally temporally clustered on segments; � For snapshot queries, only one segment is used; for interval queries, only segments involved are used; � Flexibility to control the number of redundant tuples in segments with Umin

Storage Usage of Segment-based Clustering Relative storage size with different Umin Nseg <= N 0/(1 -Umin) NS

Query Performance on Temporal Data with Segment-based Clustering Queries: Point: Q 1 Snapshot: Q 2 Interval: Q 5 History: Q 3, Q 4, Q 6