Designing Functional Dependencies For XML Mong Li LEE

  • Slides: 41
Download presentation
Designing Functional Dependencies For XML Mong Li LEE, Tok Wang LING, Wai Lup LOW

Designing Functional Dependencies For XML Mong Li LEE, Tok Wang LING, Wai Lup LOW EDBT 2002

Contents 1. 2. 3. 4. 5. 6. 7. Introduction FDs for XML : FDXML

Contents 1. 2. 3. 4. 5. 6. 7. Introduction FDs for XML : FDXML Replication cost model using FDXML Verification of FDXML Performance Studies Conclusion Q&A 2

Introduction

Introduction

Introduction l l l XML - Extensible Markup Language Simplified descendant of Standard Generalized

Introduction l l l XML - Extensible Markup Language Simplified descendant of Standard Generalized Markup Language (SGML) Used for information interchange over the Web – – l l Presentation-Oriented Publishing (POP) Message-Oriented Middleware (MOM) New view of XML : Data model Why is XML suitable as a data model ? – – Data semantics Data independence 4

Introduction Motivation l l l Projects have suppliers who supply them with a quantity

Introduction Motivation l l l Projects have suppliers who supply them with a quantity of parts at a certain price. Each project is identified by a JName. Each supplier is identified by a SName. Each part is identified by a Part. No. Constraint : Supplier must supply a part at the same price regardless of projects. JName, SName, Part. No Qty SName, Part. No Price 5

Introduction Motivation l l Use XML to model the Project-Supplier-Part database Additional requirements: –

Introduction Motivation l l Use XML to model the Project-Supplier-Part database Additional requirements: – – l Preserve natural inherent hierarchical structure. Order of nesting : Project, Supplier, Part Possible solutions. . . 6

Introduction Solution 1 JSP Project @JName Supplier ‘Road Works’ S P @Pid Qty @Pid

Introduction Solution 1 JSP Project @JName Supplier ‘Road Works’ S P @Pid Qty @Pid ‘ 500’ l l l S S @Sid P P Qty @Pid ‘ 200’ Supplier @SName @JName ‘Garden’ @Sid Project Qty @ denotes attributes @Sid is a reference to a Supplier element. @Pid is a reference to a Part Element. @Pid ‘ABC Part Trading’ @Part. No Qty Price @Part. No ‘P 789’ ‘DEF Pte Ltd’ Part Price @Part. No ‘ 80’ ‘P 123’ ‘ 10’ ‘P 123’ Price ‘ 12’ ‘ 1000’ ‘ 50000’ l l Normalized. No (little) redundancy. Extensive use of references, pointing relationships. Model not natural. Difficult to understand. Less efficient from query processing point of view. 7

Introduction Solution 2 JSP Supplier @SName ‘ABC Trading’ @Part. No ‘P 123’ Part @Part.

Introduction Solution 2 JSP Supplier @SName ‘ABC Trading’ @Part. No ‘P 123’ Part @Part. No Price ‘ 10’ Project @JName ‘Garden’ l l ‘P 789’ Qty ‘ 200’ Part ‘DEF Pte Ltd’ @Part. No Price ‘ 80’ @JName Project Qty ‘P 123’ Project @JName Price Qty ‘Garden’ ‘ 500’ ‘Road ‘ 50000’ Works’ ‘ 12’ @JName ‘Road Works’ Project Qty ‘ 1000’ A good solution with clear semantics. But requires re-ordering of elements (i. e. from Project, Supplier, Part to Supplier, Part, Project. But this is not what the user wants. 8

Introduction Solution 3 JSP Project @JName ‘Garden’ ‘Road Works’ Supplier ‘ABC Trading’ ‘P 789’

Introduction Solution 3 JSP Project @JName ‘Garden’ ‘Road Works’ Supplier ‘ABC Trading’ ‘P 789’ Supplier @SName @Part. No Project @JName Part Price Qty @Part. No Price ‘ABC Trading’ Qty @Part. No @SName Part Price ‘DEF Pte Ltd’ Qty @Part. No Part Price ‘P 789’ ‘P 123’ ‘ 12’ ‘ 10’‘ 50000’ ‘ 80’ ‘ 500’ ‘P 123’ ‘ 10’ ‘ 200’ l Ordering (Project, Supplier, Part) is maintained. l De-normalized. Controlled redundancy. l Containment (Parent-Child) relationships. l Natural model. Easy to understand. l More efficient from processing point of view (compared to Sol 1). BUT ö Data redundancy. Possible data inconsistency. ö How do we know that Sname, Part. No Price ? Qty ‘ 1000’ 9

FDXML

FDXML

FDXML Functional Dependency in Relational Databases l Let r be a relation on scheme

FDXML Functional Dependency in Relational Databases l Let r be a relation on scheme R. X and Y subsets of attributes in R. l l Relation r satisfies the FD X Y if for every XValue x, Y( X=x(r)) has at most one tuple. E. g. SName, Part. No Price This definition is defined for flat tables. How can we extend it for the hierarchical structure of XML databases? 11

FDXML Functional Dependency for XML l An XML functional dependency, FDXML: (Q, [ Pxi

FDXML Functional Dependency for XML l An XML functional dependency, FDXML: (Q, [ Pxi , . . . , Pxn Py ]) where – Q is the FDXML header path, a fully qualified path expression (i. e. the expression starts from the root) – Each Pxi is a LHS entity type ( which consists of an element name in the XML document, and the optional key attibute(s) ). – Py is a RHS entity type ( which consists of an element name in the XML document, and an optional attribute name ). – For any 2 instance subtrees identified by Q, if all LHS entities agree on their values, they must also agree on the value of the RHS entity, if it exists. 12

FDXML Example FDXML JSP Project @JName ‘Garden’ ‘Road Works’ Supplier Part @Part. No Price

FDXML Example FDXML JSP Project @JName ‘Garden’ ‘Road Works’ Supplier Part @Part. No Price Qty @Part. No Price ‘P 789’ Supplier @SName ‘ABC Trading’ Project @JName ‘ 80’ ‘ 500’ ‘P 123’ ‘ABC Trading’ Qty ‘ 10’ ‘ 200’ @Part. No ‘P 789’ Part Price ‘DEF Pte Ltd’ Qty @Part. No ‘ 10’ ‘ 50000’ ‘P 123’ Part Price Qty ‘ 12’ ‘ 1000’ ( /JSP/Project , [ Supplier , Part Price ] ) 13

FDXML Different Notations for FDXML Show identifier of elements ( /JSP/Project , [ Supplier

FDXML Different Notations for FDXML Show identifier of elements ( /JSP/Project , [ Supplier {SName} , Part {Part. No} Price ] ) ( /JSP/Project , [ Supplier , Basic Notation Part Price ] ) Header path is implied ( [ Supplier , Part Price ] ) 14

FDXML Distributing FDXML l l Can make use of existing XML tools if FDXML

FDXML Distributing FDXML l l Can make use of existing XML tools if FDXML is expressed in XML too. Need a DTD to facilitate distribution of FDXMLs <!ELEMENT Constraints (Fd*)> <!ELEMENT Fd (Header. Path, LHS+, RHS)> <!ATTLIST Fd Fid ID #REQUIRED> <!ELEMENT LHS (Element. Name, Attribute*)> <!ELEMENT RHS (Element. Name, Attribute*)> <!ELEMENT Header. Path (#PCDATA)> <!ELEMENT Element. Name (#PCDATA)> <!ELEMENT Attribute (#PCDATA)> l Can be easily translated to its XML Schema equivalent. 15

FDXML Distributing FDXML l DTD for the running Project-Supplier-Part database. <!ELEMENT JSP (Project)*> <!ELEMENT

FDXML Distributing FDXML l DTD for the running Project-Supplier-Part database. <!ELEMENT JSP (Project)*> <!ELEMENT Project (Supplier*)> <!ELEMENT Supplier (Part*)> <!ELEMENT Part (Price? , Quantity? )> <!ATTLIST Project JName IDREF REQUIRED> <!ATTLIST Supplier SName IDREF #REQUIRED> <!ATTLIST Part. No IDREF #REQUIRED> <!ELEMENT Price (#PCDATA)> <!ELEMENT Quantity (#PCDATA)> 16

FDXML Distributing FDXML l FDXML for the Project-Supplier-Part XML database. Conceptual Notation ( /JSP/Project

FDXML Distributing FDXML l FDXML for the Project-Supplier-Part XML database. Conceptual Notation ( /JSP/Project , [ Supplier , Part Price ] ) DTD for FDXML <!ELEMENT Constraints (Fd*)> <!ELEMENT Fd (Header. Path, LHS+, RHS)> <!ATTLIST Fd Fid ID #REQUIRED> <!ELEMENT LHS (Element. Name, Attribute*)> <!ELEMENT RHS (Element. Name, Attribute*)> <!ELEMENT Header. Path (#PCDATA)> <!ELEMENT Element. Name (#PCDATA)> <!ELEMENT Attribute (#PCDATA)> FDXML Instance <Constraints> <Fd Fid="SP_Price_FD"> <Header. Path>/JSP/Project</Header. Path> <LHS> <Element. Name>Supplier</Element. Name> <Attribute>SName</Attribute> </LHS> <Element. Name>Part</Element. Name> <Attribute>Part. No</Attribute> </LHS> <RHS> <Element. Name>Price</Element. Name> </RHS> </Fd> </Constraints> 17

Replication Cost Model for FDXML

Replication Cost Model for FDXML

Replication Cost Model for FDXML l Data replication is sometimes unavoidable (or even desirable!)

Replication Cost Model for FDXML l Data replication is sometimes unavoidable (or even desirable!) – l Measure the degree of replication – l Provided it does not get out of hand. Gauge if it is worth the increased effort for checking consistency, and the increased risk of data inconsistency. We need a replication cost model. 19

Replication Cost Model for FDXML Definitions Full FDXML A full FDXML is one which

Replication Cost Model for FDXML Definitions Full FDXML A full FDXML is one which the LHS entity types are minimal, that is, no redundant LHS entity types. Lineage A set of nodes, L, in a tree is a lineage if: 1. There is a node N in L such that all the nodes in the set are ancestors of N, and 2. For every node M in L, if L contains an ancestor of M, it also contains the parent of M. * Informal definition : “a straight and unbroken line of elements" 20

Replication Cost Model for FDXML Definitions Well-structured FDXML Consider the DTD : <!ELEMENT H

Replication Cost Model for FDXML Definitions Well-structured FDXML Consider the DTD : <!ELEMENT H 1 (H 2 *)> … <!ELEMENT Hm (P 1*)> … <!ELEMENT Pk (Pk+1*)> The FDXML, F =(Q, [P 1, … , Pk Pk+1]), where Q = /H 1/…/Hm, holds on this DTD. F is well-structured if : 1. there is a single RHS entity type (i. e. Pk+1). 2. the ordered XML elements in Q (i. e. H 1, …, Hm), LHS entity types (i. e. P 1, …, Pk) and RHS entity type (i. e. Pk+1), in that order, form a lineage. 3. The LHS entity types are minimal (i. e. no redundant LHS entity types). 21

Replication Cost Model for FDXML Definitions (last one!) Context Cardinality The context cardinality of

Replication Cost Model for FDXML Definitions (last one!) Context Cardinality The context cardinality of XML element X to XML element Y is the number of times Y can participate in a relationship with X in the context of X’s entire ancestry in the XML document. Denoted as: where D is the schema on which this context cardinality is defined, and Q is the header path of X. JSP (Document root) Supplier Project X In ERD 1: M Part (Participation Constraint) Supplier Y Part Traditional Cardinality “The number of parts a supplier can supply to a project ” Project 1: N Context Cardinality Part 22

Replication Cost Model for FDXML Replication Cost Model H 1 H 2 Suppose we

Replication Cost Model for FDXML Replication Cost Model H 1 H 2 Suppose we have the following wellstructured FDXML and it holds on DTD D. Hm-1 Hm The model for the replication factor is P 1 Pk Pk+1 23

Replication Cost Model for FDXML Using the Cost Model F = ( /JSP/Project, [Supplier,

Replication Cost Model for FDXML Using the Cost Model F = ( /JSP/Project, [Supplier, Part Price]) JSP Project (Max. no. of Projects under /JSP) (Max. no. of 20 Supplier projects a What if each supplier is now constrained to supply to at most 20 projects? supplier can supply to, in the context of /JSP) Part Price 24

Replication Cost Model for FDXML Design insights from Cost Model l Length of FDXML

Replication Cost Model for FDXML Design insights from Cost Model l Length of FDXML header path, Q, should be as short as possible. l Minimize value of 2 nd parameter of RF(F). – l If there are several acceptable designs, choose the one with the smallest value for the 2 nd parameter of RF(F). Use model to gauge extra storage requirements due to replication. 25

Verification of FDXML

Verification of FDXML

Verification of FDXML Scenario FDXML Specifications XML Database Distribution XML Database FDXML Specifications Verification

Verification of FDXML Scenario FDXML Specifications XML Database Distribution XML Database FDXML Specifications Verification Process Verification Results 27

Verification of FDXML Verification Process State Variables FDXML Specifications Context XML Parser information XML

Verification of FDXML Verification Process State Variables FDXML Specifications Context XML Parser information XML Database l Only a single pass through the database is required. Hash structure (with LHS values as hash keys) Set-up using information from FDXML 28

Verification of FDXML Running the verification process 29

Verification of FDXML Running the verification process 29

Performance Studies

Performance Studies

Performance Studies Dataset l l l DBLP – a widely-used, large XML bibliographical database.

Performance Studies Dataset l l l DBLP – a widely-used, large XML bibliographical database. 80, 000 journal records Check dependency Journal, Volume Year A sample DBLP journal record <article key="journals/is/Hofstede. V 97"> <author>A. H. M. ter Hofstede</author> <author>T. F. Verhoef</author> <title>On the Feasibility of Situational Method Engineering. </title> <pages>401 -422</pages> <year>1997</year> <volume>22</volume> <journal>IS</journal> <number>6/7</number> <url>db/journals/is/is 22. html#Hofstede. V 97</url> </article> 31

Performance Studies DOM vs. SAX l Document Object Model (DOM) – l Simple API

Performance Studies DOM vs. SAX l Document Object Model (DOM) – l Simple API for XML (SAX) – l l l Builds in-memory tree of nodes. Event-driven parsing DOM requires too much memory for large datasets. By maintaining simple context information, we do not need the whole database to be in memory. SAX parsing is more suitable for our verification technique. 32

Performance Studies DOM vs. SAX Out of memory error • Experiments done on P

Performance Studies DOM vs. SAX Out of memory error • Experiments done on P 3 700 MHz machine (128 MB RAM) running Win. NT 4. 0 33

Performance Studies Memory requirements l Hash structure for efficient access. l How much memory

Performance Studies Memory requirements l Hash structure for efficient access. l How much memory does the hash structure (with LHS values as hash keys) take? l Affects the feasibility of incremental checking. 34

Performance Studies Memory requirements 2960 No. of entries in the hash table No. of

Performance Studies Memory requirements 2960 No. of entries in the hash table No. of “errors” • • • 149 Experiments done on P 3 700 MHz machine (128 MB RAM) running Win. NT 4. 0. A SAX-based parser is used to parse the XML data. FDXML verification does not take up much memory and scales up well. 35

Conclusion

Conclusion

Conclusion Contributions l Representation for FDs in XML databases. l Replication cost model based

Conclusion Contributions l Representation for FDs in XML databases. l Replication cost model based on FDXML. l FDXML verification. l A framework for FDXML use and deployment. 37

Conclusion Future work l Inference rules for FDXML. l Incremental FDXML checking for XML

Conclusion Future work l Inference rules for FDXML. l Incremental FDXML checking for XML updates. l Integration of FDXML with next generation XML DBMS. Mining FDXML from XML databases. MVDXML l l 38

Conclusion Everything in ONE slide l To make XML a data model FDXML l

Conclusion Everything in ONE slide l To make XML a data model FDXML l l l To distribute/disseminate the known FD constraints Schema for FDXML Is redundancy in the XML database controlled? Replication cost model To verify FDXML efficiently A single-pass hash-based technique 39

References l l l P. Buneman, S. Davidson, W. Fan, C Hara, WC Tan.

References l l l P. Buneman, S. Davidson, W. Fan, C Hara, WC Tan. Keys for XML. In Proceedings of WWW’ 10, Hong Kong, China 2001. TW Ling, CH Goh, ML Lee. Extending classical functional dependencies for physical database design. Information and Software Technology, 9(38): 601 -608, 1996. Jennifer Widom. Data Management for XML: Research Directions. IEEE Data Engineering Bulletin, 22(3): 44 -52, 1999 XY Wu, TW Ling, ML Lee, G Dobbie. Designing Semistructured Databases Using the ORA-SS Model. In Proceedings of the 2 nd International Conf on Web Information Systems Engineering (WISE). IEEE Computer Society, 2001. Michael Ley. DBLP Bibliography. 40

Q&A

Q&A