XML Research Issues in Database Perspective Kyuseok Shim

XML Research Issues in Database Perspective Kyuseok Shim shim@cs. kaist. ac. kr http: //cs. kaist. ac. kr/~shim Korea Advanced Institute of Science and Technology Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 1

XML Working Groups • Core XML – XML, namespaces, XML Inforset • XML Linking – Xpath, Xpointer, Xlink • XML Schema – XML Schema • XML Query – XML Query, XML Query Data Model • Document Object Model (DOM) • XSL Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 2

XML • A W 3 C standard to complement HTML • An instance of semistructured data [Abi 97] – Document Type Descriptor (DTD) • Origin: SGML • Tags describe the semantics of the data – HTML simply specify how the data time is to be displayed • An element can contain a sequence of nested subelements • Sub-elements may themselves be tagged elements or character data Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 3

Document Type Definition (DTD) • A part of XML specification • An XML document may have a DTD • Grammar for describing the structure of XML document • The structure of an element is specified by a regular expression • Terminology for XML – well-formed: if tags are correctly closed – valid: if it has a DTD and conforms to it • For exchanges of data, validation is useful Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 4

Document Type Definition (DTD) • Syntax – comma: sequence – |: or – (): grouping – ? , *, +: zero or one, zero or more, one or more occurrences – ANY: allows an arbitrary XML fragment to be nested within the element Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 5

A DTD Example <!ENTITY USA “United States of America”> <!ELEMENT book (booktitle, author*)> <!ATTLIST book id ID #IMPLIED> <!ELEMENT booktitle (#PCDATA)> <!ELEMENT author (name, (address | affiliation))> <!ELEMENT name (#PCDATA)> <!ELEMENT address ANY> <!ELEMENT affiliation (#PCDATA)> Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 6

An XML Document Example <book id=“ 123”> <booktitle> The Selfish Gene </booktitle> <author id=“dawkins”> <name> Richard Dawkins </name> <address> <city> Timbuktu </city> <zip> 99999 </zip> </address> </author> </book> <booktitle> The C Programming Language</booktitle> <author> <name> Brian W. Kernighan </name> <address> <country> &USA; </country> </address> </author> <name> Dennis M. Ritchie </name> <affiliation> Bell Labs </affiliation> </author> </book> Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 7

An XML Namespace • Provides a simple method for qualifying element and attribute names used in Extensible Markup Language documents by associating them with namespaces identified by URI references. • Is a collection of names, identified by a URI reference, which are used in XML documents as element types and attribute names. <x xmlns: edi='http: //ecommerce. org/schema'> <!the 'price' element's namespace is http: //ecommerce. org/schema --> <edi: price units='Euro'>32. 18</edi: price> </x> Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 8

XML Schemas • Recently proposed – http: //www. w 3 c. org/TR/xmlschema-1 – http: //www. w 3 c. org/TR/xmlschema-2 • Unifies previous schema proposals • Generalizes DTDs • Use XML syntax Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 9

XML Schema <element. Type name = “article”> <sequence> <element. Type. Ref name = “title”/> <element. Type. Ref name = “author” min. Occurs=“ 0”/> </sequence> </element. Type> DTD: <!ELEMENT article (title, author*)> Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 10

XTRACT: Extracting DTD from XML Documents • [Garofalakis, Gionis, Rastogi, Seshadri, Shim 99] • DTDs contain valuable information on the structure of the documents – play a critical role in the storage as well as formulation and optimization of queries • DTDs are not mandatory – it is frequently possible the XML database does not have accompanying DTDs • XTRACT can infer concise and semantically meaningful DTDs for XML documents Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 11

XTRACT: Motivation • DTD is very useful! – Plays a crucial role in efficient storage of XML data • [SHT+99], [DFS 99] : DTDT is exploited to generate effective relational schema – Devise efficient plans for queries • [GW 97], FS 97] : DTD allows to restrict the search only relevant portions of the data – Aids users to form meaningful queries over the XML database • However, XML document may not always have an accompanying DTD Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 12

XTRACT: Related Work • Mining DTDs from a collection of XML documents has not been addressed in the literature • Extraction of schema from semistructured data – [NAM 98, GW 97, FS 97] • attempts to find typing for semistructured data • finding a typing is tantamount to grouping objects that have similar edges • In DTD, outgoing edges from a type can be described by an arbitrary regular expression • No ordering is imposed for edges Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 13

XTRACT: Related Work • [Gol 67, Gol 78, Ang 78] – Infer formal languages from examples – Purely theoretical and focus on investigating the computational complexity of the language inference problem • [KMU 95] – Infers a pattern language from positive examples – MDL principle was used – Assume the set of simple patterns is available – Cannot find general regular expressions – Patterns are not known apriori Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 14

XTRACT: Problem Formulation • Given a set I of N input sequences nested within elements e • Compute a DTD for e such that every sequence in I conforms to the DTD Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 15

XTRACT: Naive Approaches • Factor as much as possible – e. g. t, taa, taaaa • t | t (a| a(a | aa))) • much more voluminous and a lot less intuitive • Find the automaton with the smallest number of states that accepts I and drive regular expressions from automaton – may not be the shortest regular expression Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 16

XTRACT: Desirable DTDs • The DTD should be concise (i. e. small in size) – easy to understand succinct • The DTD should be precise – not cover too many sequences not contained in I – not too general and captures the structure f input sequences Trade-off! Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 17

XTRACT: Example I = {ab, ababab} • (a | b)* – a gross over-generalization of the input – completely fails to capture any structure inherent in input • ab | ababab, ab | ab(ab | abab) – accurately reflect the structure of the input sequences but do not generalize • (ab)* – succinct and generalizes the input sequence without loosing too much structure information Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 18

XTRACT: MDL Principle • An information-theoretic measure for quantifying and thereby resolving the tradeoff between the conciseness and preciseness • MDL principle has been successfully applied in a variety of situations – e. g. decision tree classifiers • Roughly speaking, the best theory to infer from a set of data is the one that minimizes the sum of – the length of theory, in bits (conciseness) – the length of the data, in bits, when encoded with the help of theory (preciseness) Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 19

XTRACT: Example I = {ab, ababab} • (a | b)* – abab: cost of 5 (the number of repetitions (4) + 4 characters to represent chosen character) – MDL cost = 6 (encoding DTD) + 3 + 5 + 7 = 21 • ab | ababab – MDL cost = 14 + 3 = 17 • ab | ab(ab | abab) – MDL cost = 14 + 1 + 2 = 19 • (ab)* – MDL cost = 5 + 3 = 8 Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 20

XTRACT • Generalization – generalizes zero or more candidate DTDs by replacing patters in the input sequence with metacharacters like * – e. g. abab => (ab)*, bbbe => b*e • Factorization – factors common subexpressions from the generalized candidate DTDs – e. g. b*d | b*e => b* (d | e) • Minimum Description Length (MDL) Principle – MDL ranks each candidate DTD and chooses the minimum cost DTD Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 21

XTRACT: Example Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 22

XML Storage • Existing approaches either sacrifice efficiency or flexibility unnecessary – Traditional DBMSs (RDB or OODB) have rigid schemas. • Integrating a new site requires complex mapping and potential loss of information • Integrating a new site may require schema evolution. – Existing fully semi-structured data storage techniques sacrifice query efficiency and space. • they require excessive interpretation (harming query efficiency) and • redundant storage Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 23

XML Storage • Need to store and query XML data flexibly and efficiently – improve the tradeoffs for storage space and query efficiency for a given degree of flexibility. – allows user to choose the degree of storage flexibility Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 24

XML Storage • • text file relational DBMS object-oriented DBMS build special purpose repository Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 25

XML Storage: Text File • To store the flat streams, file system or a BLOB manager in DBMS is used – e. g. [Abiteboul, Cluet, Milo: VLDB’ 93] • Pros – simple – fast for storing and retrieving whole documents – less space than one think – reasonable clustering • Cons – incremental update is difficult – require special purpose query processor – accessing documents’ structure is only possible through parsing Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 26

XML Storage: Relational DBMS • Advantages – RDBMS products are mature and scales well – Traditional and semi-structured data can co-exist – RDBMS can process even complex queries on large databases within seconds • Disadvantages – expensive to reconstruct the original XML data from relational data – updates are both complicated and expensive for a certain cases – extra efforts to translate XML queries and updates into SQL Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 27

XML Storage: RDMBS (1) • [Florescu, Kossmann: IEEE Data Eng. Bulletin 99] Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 28

XML Storage: RDBMS (2) • [Shanmugasundaram et al. 99] • process DTD to generate a relational schema – Use DTD graph and element graph • three approaches – Basic – Shared – Hybrid Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 29

DTD Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 30

XML Document Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 31

The Basic Inline Technique • Creates relations for every element – an XML document can be rooted at any element in a DTD – element graph is used to decide the relations • Inlines as many descendants as possible – e. g. the author relation has attributes firstname, lastname, address and authorid • Creates a separate relation to handle “*” in DTD graph using a foreign key • Expresses the recursive relationship using the notion of relational keys Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 32

Building an Element Graph • Do a depth first traversal of the DTD graph starting at the element node • Each node is marked as “visited” the first time reached • Each node is unmarked once all of its children have been traversed • If an unmarked node in DTD graph is reached, a new node with the same name is created in the element graph • If an attempt is made to traverse marked DTD node, backpointer edge is added Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 33

DTD Graph Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 34

An Example Element Graph Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 35

Creation of Relations • Given an element graph, relations are created as follows: – A relation is produced for the root element – All descendent elements are inlined into that relation except • children directly below a “*” node • each node having a backpointer edge pointing to it – A separate relation is created for each of the above exception node – Each relation has ID and parent. ID fields Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 36

Basic Inline Schema Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 37

Basic Inline Technique • Pros – List all authors of books • Cons – List all authors having first name Jack (5 separate queries) – Large number of relations are created Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 38

Shared Inline Technique • Relations are created for all elements in the DTD graph whose nodes have in-degree greater than one • Nodes with an in-degree of one are inlined • Nodes with an in-degree of zero are made separate relations • Of mutually recursive elements all having in-degree one, one of them is made a separate relation – e. g. monograph and editor Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 39

Shared Inlining Technique • • Small number of relations compared to Basic schema Use is. Root field for inlining problems Requires only one query for finding all authors Still Basic is superior for reducing the number of joins Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 40

Shared Inlining Technique • Additionally inlines elements with in-degree greater than one that are not recursive or reached through a “*” node – e. g. author is inlined with book and monograph Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 41

XML Storage: STORED • • [Deutsch, Fernandez, Suciu: SIGMOD’ 99] Semistructured data into relational data Integrate both relational and overflow systems Use data mining algorithm to find out frequent subtrees – due to the fact that there is no notion of DTD in semistructured data • Overflow mapping is used to insure lossless – overflow objects or object parts are stored in a separate semistructured data object repository • Incremental updates and ordering of elements are not considered Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 42

XML Storage: STORED • Derive schema from data with data mining algorithm Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 43

XML Storage: OODBMS • Stores XML elements with the structured semantics • Flexible locking down to element level – In RDBMS, due to disassembly of XML data into various tables, implementing an effective locking scheme is hard – In using flat file, no portion of a document being modified is available to other users • Use a separate record for each tree node • Systems available – POET (POET Content Management system) – Excelon (Object. Design) – LORE Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 44

XML Storage: NATIX • [Kanne, Moerkotte: ICDE’ 00] • Native repository • Classical record manager – Accesses raw disk or file system files – Provides a memory space divided into segments (equal sized pages) • Tree storage manager – maps treed used to model documents • Schema manager – maintains the system catalog data (e. g. DTD) – system catalog is stored in XML format Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 45

NATIX • Store whole document in one record, instead of storing each tree node in a separate record • Semantically split large tree based on underlying tree structure • Partition the data into subtrees and store each subtree in a record • Connected subtrees residing in other records are represented by proxy objects – proxy objects consist of RID – substituting all proxies by the respective subtrees reconstruct the original data tree Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 46

XML Query Processing • [Mc. Hugh, Widom: Workshop 99] – Expand regular path expressions at compile time using structural summary – Guarantee to visit, at run-time, a subset of the objects visited with the original path expression – e. g. Library. # • Proceedings. Conference. Paper • Books. Book • Movies. Movie. Based. On Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 47

XML Query Processing • [Fernandez, Suciu: ICDE 98] – Optimize regular path expressions – Restrict navigation to only a fragment of the data – Use state extents to eliminate and reduce navigation • [Mc. Hugh, Widom: VLDB 99] – Propose cost-based query optimizer • Transform a query into a logical query plan • Explore the space of possible physical plans – Introduce new types of indexes for efficient traversals through data graphs – Suggest an appropriate set of statistics and devise methods for computing and storing statistics Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 48

XML Query Processing • [Christophides, Cluet, Simeon: SIGMOD 00] • Propose an XML algebra – Captures the expressive power of semistructured or XML query languages – Can wrap more structures languages such as SQL or OQL • New optimization techniques – Exploit type information – Push query evaluation to external source Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 49

XML View of Relational Data • [Fernandez, Tan, Suciu: WWW 00] • Mediator system • Automatically convert the relational data into XML – An XML view of the relational database is defined using a declarative query language – Some other application formulates a query over the virtual view • Exploit fully underlying RDBMS query engine Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 50

XML View of Relational Data • [Shanmugasundaram et al. : VLDB 00] • Propose to use new scalar and aggregate in SQL to construct complex XML document • Explore different execution plans for generating the contents of XML documents • Construct XML document inside the relational engine benefits most for performance • Outer union plan Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 51

Metadata Management • Generic data model – Not impossible, but unlikely – Proliferation of data models – No proof anyone is superior – Semantics aren’t fully captured in any data model Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 52

Metadata Management • [Philip Bernstein: VLDB 00’s Panel] • Generality - representation of metadata must apply to all application areas • Usefulness – exploit application-specific semantics • Is there an effective middle-ground? • Define generic high-level operations on models and mappings, e. g. , Match, Merge, Select, Compose, … – Match(M 1, M 2, , map), Merge(M 1, M 2, map), Compose(map 1, map 2) • Implement operations on a DBMS Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 53

Metadata Management 3 2. m ap rdb 1 Saturday, October 28 2000 3. map 4 map 1 dtd 1 1. map 2 dtd 2 1. map 2= Match(dtd 1, dtd 2) 2. map 3 = map 1 map 2 3. <map 4, rdb 2 > = Copy(map 3 -1) rdb 2 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 54

Metadata Management: Clio • [Miller, Haas, Hernandez: VLDB 00] • Tool to support mapping between data representations – Mapping represented as SQL – Heterogeneous query middleware to examine data and schemas • Build database competencies in query and schema management, data mining – Exploit user knowledge of target semantics – Enhance user knowledge of source schema and data – Provide knowledge of query subtleties, alternative mappings Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 55

Metadata Management: Clio • User indicates what schema and data values are needed for target • Tool enumerates and ranks mappings – Many possible; subtle differences – Best mappings are simple, but lose least information possible – Allows immediate user feedback Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 56

Filtering XML Documents • [Altinel, Franklin: VLDB 00] • Xfilter: provides highly efficient matching of XML documents to user profiles • Event filtering system • Highly scalable • Use XPath as a profile language Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 57

XML Data Compression • [Liefke, Suciu: SIGMOD 00] • Structure, consisting of tags and attributes, is compressed separately • Group related data items and compress each related group separately • Apply semantic compression • Automatic data mining tools to cluster data needs to be developed Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 58

Future Research Issues • XML views of traditional databases – Relational database – Object-relational database • XML Storage – Object-relational databases – Alternative storage methods • Indexes for XML data • XML query processing and optimization – Centralized and distributed processing Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 59

Future Research Issues • • Schema mapping Mixing structure search with full-text search XML-based mediators XML data compression Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 60

Summary • XML provides a lot of challenges to database community – XML Storage Issues – XML Indexes – DTD Extraction – Query language – Query processing – Metadata Management Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 61

Biography of Kyuseok Shim is an Assistant Professor in Computer Science Department at KAIST in Korea. He is also currently an Advisory Committee Member for ACM SIGKDD. Before joining KAIST, he was a member of technical staff (MTS) in the Database Systems Research Department at Bell Laboratories. While he was in Bell Laboratories, he started and worked for Serendip data mining project and e. Xcalibur XML storage project. Before joining Bell Laboratories, he worked for Rakesh Agrawal's Quest data mining project at IBM Almaden Research Center. He also worked with Surajit Chaudhuri as a summer intern for two summers at Hewlett Packard Laboratories. He received B. S. degree in Electrical Engineering from Seoul National University, and the MS and Ph. D. degree in Computer Science from University of Maryland, College Park. Kyuseok Shim has been working in the area of databases focusing on XML, data mining, data warehousing, query processing and query optimization. He has published more than 30 research papers in prestigious international conferences and journals. He has also served as a program committee member on several international conferences including ICDE'97, SIGKDD'98, SIGMOD'99, SIGKDD'99, ICDE'00, VLDB'00 and SIGKDD’ 01. Saturday, October 28 2000 XML Research Issues in Database Perspective - KISS’ 00 Fall Page 62