Evolution of Hybrid DBMS Based on ROX Relational

Evolution of Hybrid DBMS Based on: ROX: Relational Over XML [VLDB 2004] System RX: One Part Relational, One Part XML [SIGMOD 2005] 1

The Trend n n The use of XML for representing information grows rapidly It is natural to store it natively (in XML) This opens new opportunities to use it later to exchange so called business objects The meaning behind all this is that portions of XML will have to be queried (by XQuery, for example) 2

The Situation Today n n n RDBMS (Relational DBMS) have been evolving for the past 2 decades It is still an active research field in academics as well as in industry SQL support is required for every system, otherwise it is hardly considered serious An enormous commercial success Industry-wide product developing continues 3

The Problem n n n A lot of applications today use RDBMS The trend suggests that they will have to access information stored as XML Rewriting an application to support XML accessing can be very expensive Various solutions were suggested Some of them exist today and some are predicted 4

Solution I n n n XML-Over-Relational (XOR) architecture (exists) Classic RDBMS storage “Shredding” XML document into a relational table 5

Solution I (contd. ) n XQuery to SQL translation layer n Advantages: q Slight modification of existing RDBMSs 6

Solution I (contd. ) n Disadvantages: q q q n Problematic XQuery to SQL translation Everything shredded incl. unused documents Inefficient for complex queries Some research prototypes: q q q Lego. DB XPeranto Shre. X 7

Solution II n n n Co-processor architecture (exists) Classic RDBMS storage XML documents stored as text in LOBs or VARCHARs ID Receive. Date Purchase. Order 4023 2001 -12 -01 <purchase. Order xmlns=“…”> <originator> <contact. Name>…</contact. Name> … </originator> <order>…</order> </purchase. Order> 8

Solution II (contd. ) n XML data opaque to RDBMS n Implies external XQuery processor: q q Implemented as userdefined function Communicates in textual format with SQL processor 9

Solution II (contd. ) n Implemented in most commercial RDBMSs (IBM, Oracle, …) n Advantages: q q n Modularity of query processors Simplicity Disadvantages: q Loose coupling of query processors 10

Solution III n n n Side-by-Side architecture (exists) Evolvement of XOR architecture Tighter coupling between query processors Inherently complex Intermediate solution 11

System RX – Overview n n n n An instance of Solution III above Developed by IBM Research Centers Extension of DB 2 UDB Same components as in existing relational DBMS Applications can easily migrate from relational to XML Some components (eg. optimization) have still unresolved issues which are open for research It is example of Hybrid System 12

System RX – Architecture n n Native XML store Unified query model used for XQuery & SQL XML indexes for efficient query evaluation Relational views of XML data for relationalcentric users 13

System RX – XML Store n n n XML documents stored as instances of QDM (XQuery Data Model) trees Trees are stored in binary form with each node having pointers to children/parent Saves repeated parsing & validation Related nodes stored on the same page Direct access to a page saves root node traversal 14

System RX – XML Store (contd. ) n n Node names & URIs are compressed into identifiers to save space A group of XML documents are viewed as a column in relational table. The column type is XML Type: However, instead of using LOB, Regions Index is used to reach the relevant page SQL/XML defines functions which produce / consume the XML Type 15

System RX – XML Store (contd. ) 16

System RX – Querying XML Data n XML-centric users use XQuery: 17

System RX – Querying XML Data (contd. ) n XMLTable presents relational view of XML data. In this case, for each bib document it evaluates FLWOR expression. Relational-centric users use SQL: Each time returns a row which corresponds to (price, names) schema. 18

System RX – Querying XML Data (contd. ) n n Each query, either XQuery or SQL is parsed into a query graph which is an instance of an extended query-graph model (QGM) The extended model is used to capture what is possible in SQL and XQuery – models the data flow in the query The query graph for both of the above queries is very similar Optimization is performed on query graph 19

System RX – XML Indexes n Uses 2 types of indexes: q q n n n Path Index – maps a reverse path to a path ID. Reverse path is a list of node labels from leaf to root: (name, author, book) Value Index – maps node values to path ID Implemented with 2 B+ trees Special syntax for index creation Indexes are chosen carefully to give maximal efficiency without too much storage overhead 20

System RX – Query Run-Time Evaluation n Extends relational query run-time evaluation to support XQuery: q q q XML Navigation – evaluates path & predicate expressions over XML store. Returns node references to be used by other run-time components Index Run-time – path-indexes used to locate path IDs for given path expression. Value-indexes used to constrain only the needed paths XQuery Functions Library 21

Back to: Solution IV (ROX) n n n ROX (Relational-Over-XML) architecture (predicted) Evolvement of Solution III – less complex, because of a “thinner” SQL support Native XML storage: q q Documents are broken into nodes Node information stored in B+ tree 22

ROX overview n The direct opposite of the XOR architecture: q q n n n XML is stored natively XQuery: primary query & processing language Data modeled by QDM SQL is supported through parse-rewrite layer Requires full implementation of XQuery engine XQuery & QDM subsumes SQL Implies gradual evolution (of System RX ? ) 23

ROX overview (contd. ) n n Output of SQL queries is a tabular view over XML documents In other words, ROX implies System RX’s Some XML rowset translation required infrastructure. However, SQL is no longer Implies that XML documents have schemas complements the DB, but is just an with sufficient homogeneity extension ! Relational optimization depends on schema homogeneity 24

Issues with ROX n Semantic perspective of SQL to XQuery translation: q q n Different data models Some differences in operational semantics XQuery is designed for structured data manipulation Arithmetic & boolean operators translation is easy Normalization: q q XML storage must permit normalization & denormalization De-normalized documents can be more efficient 25

Issues with ROX (contd. ) n Performance q q Sort order of XML tree: depth-first or breadth-first Document structure is stored inline with data XML index required for better efficiency Native store allows creation of indices over XML XPath expressions Node IDs XML (Path) Index • Pre-calculated path expressions • Node IDs = Node references 26

Optimization Issue n n Join & predicate expressions in SQL query must be matched to XPath expression and placed in correct places Automating this is a separate challenge XQuery queries must also be optimized System RX solves these tasks to some degree (XQuery optimization is a problem in System RX too) 27

Manageability Issue n “Google” model for DB: q q n Everything stored in one large heap One index over entire heap suffices Virtually no design No normalization needed An interesting approach but problematic: q q Normalization is still needed Logical boundaries required for admin purposes Hardware performance issues impose design Materialized views impose design 28

Experimental Prototype 29

XML Wrapper n n n Component of IBM’s DB 2 Information Integrator In System RX, XML documents are Creates relational represented by a column type. The ROX views (“nicknames”) of prototype uses a table. As if XMLTable XML data stored in function was already used. It means that XML Store System RX gives more flexibility for Nickname creation relational views over XML. syntax similar to CREATE TABLE 30

n n Queries the XML in order to produce rows according to the nickname Uses Xerces XML parser and Xalan XPath evaluator Homogeneity plays important role Consider: XML Wrapper (contd. ) Not considered as part of the nickname! 31

Walkthrough I n SQL query to be evaluated: n SQL parse tree is given to the Query Optimizer uses XML Wrapper to: n q q Get alternative execution plans Get cost estimates for each plan 32

Walkthrough I (contd. ) n Various execution plans: q q REGION only NATION only Rows with REGION and NATION columns reduced by the predicate NATION with r_regionkey as input; returns rows with an equal n_regionkey column Ø Ø r_regionkey is primary_key n_regionkey is foreign_key 33

Walkthrough I (contd. ) n n The last 2 plans are different Each plan has a data structure associated with it q q q Necessary data in order to be executed later Can be an XQuery For example (REGION scan): 34

Walkthrough I (summary) 35

Walkthrough II n The best execution plan is fed to Query Runtime n Any data associated with the plan is fed to XML Wrapper to get rows n First request in our case: scan(REGION) 36

Walkthrough II (contd. ) n For each row of REGION that XML Wrapper returns: q q q We get a value of r_regionkey, say: k Next request: scan(NATION, k) k references REGION element being a parent of NATION elements to return ! These elements can be already in memory n_count is just the number of these elements n DB 2 handles the “GROUP BY” n Possibly more efficient if handled by XML Wrapper 37

Walkthrough II (summary) 38

Dataset for Experiment n n n Uses TPC-H dataset (http: //www. tpc. org/tpch/spec/tpch 2. 3. 0. pdf) Benchmark dataset for business oriented queries Consists of 8 entities (scale factor of 1): q q q q REGION (5) NATION (25) SUPPLIER (~ 10 K) PART (~ 200 K) PARTSUPP (~ 800 K) CUSTOMER (~ 150 K) ORDERS (~ 1500 K) LINEITEM (~ 6 M) 39

Dataset for Experiment (1 -level) n Unnest (1 -level nesting): one XML document per row per relational table (entity) One row from the REGION entity 40

Dataset for Experiment (2 -level) n Nest 2 (2 -level nesting): q q q LINEITEM elements nested within correct ORDERS element PARTSUPP nested within PART All the rest as Unnest <ORDERS> <O_ORDERKEY>123</O_ORDERKEY> <O_ORDERDATE>12 -03 -02</O_ORDERDATE>. . . <LINEITEM> <L_ORDERKEY>123</L_ORDERKEY> <L_QUANTITY>4</L_QUANTITY>. . . </LINEITEM> </ORDERS> <PART> <P_PARTKEY>76</P_PARTKEY> <P_PARTNAME>wheel</P_PARTNAME>. . . <PARTSUPP> <PS_PARTKEY>76</PS_PARTKEY> <PS_SUPPKEY>4</PS_SUPPKEY> <PS_AVAILQTY>500</PS_AVAILQTY>. . . </PARTSUPP> </PART> 41

Dataset for Experiment (3 -level) n Nest 3 (3 -level nesting): q q q n LINEITEM elements nested within ORDERS elements nested within CUSTOMER elements All the rest as Unnest Maximal level possible <CUSTOMER> <C_CUSTKEY>99</C_CUSTKEY> <C_NAME>Some. Firm Inc. </C_NAME>. . . <ORDERS> <O_ORDERKEY>123</O_ORDERKEY> <O_CUSTKEY>99</O_CUSTKEY> <O_ORDERDATE>12 -03 -2002</O_ORDERDATE>. . . <LINEITEM> <L_ORDERKEY>123</L_ORDERKEY> <L_QUANTITY>4</L_QUANTITY>. . . </LINEITEM> </ORDERS> </CUSTOMER> 42

Experiment Environment n n 4 Power. PC processors AIX 5. 1 OS 16 GB main memory Data managed on 22 5 GB SCSI disks 43

Storage Comparison n Storage (number of disk pages used): q q n Native XML storage is ~5 times larger compared to relational storage of the same data XML store uses Unicode encoding Document structure duplicated for every record XML data stored in text format incl. numbers Bufferpool (disk pages stored in memory): q q Under same constraint XML takes more time Larger scale factor of dataset constraints the bufferpool (< 10%) 44

Queries for Experiment n TPC-H Q 10 and Q 22 performance compared q q Q 10 (customers, parts shipment problems) joins: n NATION n CUSTOMER Exactly the Nest 3 structure ! n ORDERS n LINEITEM Q 22 (countries, customers of which have no orders and good balance), occasional join: n CUSTOMER n ORDERS 45

Experiment Results n Performance of queries varies under different schemas (Unnest/Nest 2/Nest 3) 46

Analysis of Results n Nest 3 – better for Q 10, worst for Q 22: q q n Q 10: Saves joins Q 22: Needless reading of ORDERS & LINEITEM information for each CUSTOMER Nest 2 should be better for Q 10: q q XML index used to join ORDERS, LINEITEM in Unnest XML index performs well same results for Unnest 47

XML Index Benefits n TPC-H Q 5 used for XML Index performance comparison q q Joins 6 out 8 entities Uses all 6 of possible equi-join predicates 48

XML Index Benefits (contd. ) n n n Nest 2 structure saves expensive join in Hash. Join Carefully chosen index is better performance How you select what indexes to build ? – it is an open research problem [XIST: An XML Index Selection Tool] 49

Conclusion n n ROX prototype shows that it is possible to integrate XQuery and SQL queries. However work is still required to make it more efficient System RX is more mature – provides better efficiency and achieves the same goal It seems that ROX architecture is a natural evolution path of System RX However, my opinion is that economic factors will make System RX retain full relational support for quite a long time System RX has many things to improve in its XQuery processing, and it will 50