XML Salman Azhar Semistructured Data XML Extensible Markup
XML Salman Azhar Semi-structured Data XML (Extensible Markup Language) Well-formed and Valid XML Document Type Definitions IDs and IDREFs These slides use some figures, definitions, and explanations from Elmasri-Navathe’s Fundamentals of Database Systems and Molina-Ullman-Widom’s Database Systems 2/6/05 Salman Azhar: Database Systems 1
Framework 1. Information Integration : u 2. Semi-structured Data : u 3. Making databases from various places work as one. A new data model designed to cope with problems of information integration. XML : u A standard language for describing semistructured data schemas and representing data. 2/6/05 Salman Azhar: Database Systems 2
1. Information Integration u Generally databases in an enterprises have: u Several underlying database management systems u u u Oracle, Informix, MS SQL Server, Sybase (SQL Server), DB 2, MS Access, etc. Several underlying database schemas u Information in an employee table can contain u u u 2/6/05 Employee Name, SSN, DOB, title, hrs. Per. Week. modified. Time, modified. By Employee Name, SSN, DOB, title, degree, create. Time, create. By Employee Name, SSN, DOB, title, salary, modified. Time, modified. By, create. Time, create. By Salman Azhar: Database Systems 3
2. Semi-structured Data u A new data model designed to cope with problems of information integration u Accommodates of different DBMS u u u Oracle, Informix, MS SQL Server, Sybase (SQL Server), DB 2, MS Access, etc. Integrates different schemas u u u 2/6/05 Employee Name, SSN, DOB, title, hrs. Per. Week, modified. Time, modified. By Employee Name, SSN, DOB, title, degree, create. Time, create. By Employee Name, SSN, DOB, title, salary, create. Time, create. By, modified. Time, modified. By Salman Azhar: Database Systems 4
3. XML u A standard language for describing semi-structured data schemas and representing data. 2/6/05 Salman Azhar: Database Systems 5
The Information-Integration Problem Major bottleneck in enterprise application integration n n For example… n Hewlett Packard split into HP and Agilent n n HP bought Compaq n 2/6/05 Need to separate data into different destinations Need to integrate data from different sources Salman Azhar: Database Systems 6
The Information-Integration Problem Related data exists in many places and could, in principle, work together. But different databases differ in: n n 1. Model w 2. Schema w 3. normalized/denormalized? Terminology w 4. relational, object-oriented? are consultants employees? Retirees? Subcontractors? Conventions w 2/6/05 meters versus feet? Salman Azhar: Database Systems 7
Example n Consider merger of two stores in a Mall n n may be some overlap in the products sold but the databases are different 2/6/05 Salman Azhar: Database Systems 8
Example n Each company has a database n One may use a relational DBMS n n One stores the phones of distributors, n n the other does not One distinguishes products in one department n n the other keeps the data in an MS-Word document the other doesn’t One counts inventory by number of items, n 2/6/05 the other by cases Salman Azhar: Database Systems 9
Two Approaches to Integration 1. Warehousing u Makes a copy of the data u 2. More developed of the two Mediation u Creates a view of the data u 2/6/05 Newer and less developed Salman Azhar: Database Systems 10
Warehouse Diagram User query Result Warehouse 2/6/05 Wrapper Source 1 Source 2 Salman Azhar: Database Systems 11
A Mediator Result User query Mediator Query Result Wrapper Query 2/6/05 Wrapper Result Source 1 Query Result Source 2 Salman Azhar: Database Systems 12
Warehousing Make copies of the data sources at a central site and transform it to a common schema u n n n Reconstruct data daily/weekly Do not try to keep it more up-to-date than that. Pro: n n n very well-developed several commercial tools are available Con: n n 2/6/05 data can be old since updates are expensive 24 -hour availability threatened by large data updates Salman Azhar: Database Systems 13
Mediation Create a view of all sources, as if they were integrated u n n Answers a view query by translating it to terminology of the sources and querying them Pro: n n Current data Con: n n 2/6/05 Can be slow as it requires real time merger of different data sources Lack of tools available Salman Azhar: Database Systems 14
Warehouse Diagram User query Result Warehouse 2/6/05 Wrapper Source 1 Source 2 Salman Azhar: Database Systems 15
A Mediator Result User query Mediator Query Result Wrapper Query 2/6/05 Wrapper Result Source 1 Query Result Source 2 Salman Azhar: Database Systems 16
Semi-structured: Motivation n Most effective approach to Information Integration: n n Semi-structured Data Model or Semi-structured Objects 2/6/05 Salman Azhar: Database Systems 17
Semi-structured: Motivation n Main limitation of Object-Oriented Models: n Object Models are Strongly Typed n n Objects of a class have one structure only Semi-structured approach solves this problem 2/6/05 Salman Azhar: Database Systems 18
Semi-structured Data n Purpose: n Represent data from independent sources more flexibly than n n 2/6/05 either relational or object-oriented models Salman Azhar: Database Systems 19
Semi-structured Data n Each object has a class of their own and properties are defined whatever labels are attached to that object n Properties mean n 2/6/05 attributes, relationships, methods, etc. Salman Azhar: Database Systems 20
Semi-structured Data n Think of objects n but with the type of each object is the objects its own business n n not that of its “class” Labels to indicate meaning of substructures 2/6/05 Salman Azhar: Database Systems 21
Semi-structured Graphs n Easy to think of Semi-structured data as Graphs n n Nodes = objects Labels on arcs = n n 2/6/05 attributes leading to a leaf node relationships leading to another node Salman Azhar: Database Systems 22
Semi-structured Graphs n Atomic values at leaf nodes n n nodes with no arcs out Flexibility: no restriction on… n n labels out of a node number of successors with a given label 2/6/05 Salman Azhar: Database Systems 23
Example: Data Graph Root object represents the entire DB. Often look like trees, but are not. root The restaurant object for KFC (arc-in called rest; arc-out labeled name to KFC) soda rest soda manf name sells. At manf Pepsi. Co prize name year Pepsi Sobe name addr KFC Main St Notice a new kind of data. 2003 award Best. Seller The soda object for Pepsi (arc-in called soda; arc-out called name to Pepsi) 2/6/05 Salman Azhar: Database Systems 24
Stage is Now Set for XML n A technology has application to different situations n n 2/6/05 foundations remain the same applications changes Salman Azhar: Database Systems 25
Extensible Markup Language (XML) n XML n n HTML n n uses tags for semantics (e. g. , “this is an address”) uses tags formatting (e. g. , “italic”), Key idea: n n 2/6/05 create tag sets for a domain (e. g. , genomics) translate all data into properly tagged XML docs Salman Azhar: Database Systems 26
Well-Formed and Valid XML n Well-Formed XML n allows you to invent your own tags n n similar to labels in semi-structured data graph Valid XML n n involves a DTD (Document Type Definition) DTD gives n n n 2/6/05 a grammar for the use of labels limits the set of labels our of node the order and number of times a label occurs Salman Azhar: Database Systems 27
Well-Formed XML n All XML documents have n n n Header defines n n n Header Body version specifies that the document is in well-formed XML Body can include n n 2/6/05 root tag several properly matching tags Salman Azhar: Database Systems 28
Well-Formed XML: Header n Start the document with a declaration n n surrounded by <? … ? >. Normal declaration for Well-Formed XML is: <? XML VERSION = “ 1. 0” STANDALONE = “yes” ? > n n Version indicates version number Standalone = “yes” means no DTD n 2/6/05 no DTD means well-formed XML Salman Azhar: Database Systems 29
Well-Formed XML: Body n Body of document is a root tag surrounding nested tags. n Body can include: n several properly matching tags n n special tag called root tag n n 2/6/05 (as in html structure) can have a special meaning such as document type or can be generic Salman Azhar: Database Systems 30
Tags n Tags, as in HTML n are normally matched pairs, as n n n may be nested arbitrarily some tags requiring no matching ending n n 2/6/05 <BLAH> … </BLAH> such as <P> in HTML, are also permitted however, we will not use these in examples Salman Azhar: Database Systems 31
Example: Well-Formed XML <? XML VERSION = “ 1. 0” STANDALONE = “yes” ? > <RESTS> <REST> <NAME>Taco Bell</NAME> One of several nested <SODA><NAME>Pepsi</NAME> REST tags representing <PRICE>1. 00</PRICE></ SODA> information about a <SODA><NAME>Sobe</NAME> single REST <PRICE>2. 00</PRICE></SODA> <NAME> tag specifies </REST > the REST name <SODA> tags <REST> … Literal Data items have names and </REST > are contained at price for each the atomic level … Soda nested in </RESTS> <NAME> and Root tag RESTS surrounds the entire document <PRICE> tags 2/6/05 Salman Azhar: Database Systems 32
XML and Semi-structured Data n Consider this… n Is Well-Formed XML documents with nested tags is exactly the same idea as trees of semi-structured data? n Tags n n Nodes n n represent data between matching tags Parent-child relationship n 2/6/05 are the labels on edges is immediate nesting in XML Salman Azhar: Database Systems 33
XML and Semi-structured Data n n Semi-structured approach allows for non -tree structures We shall see that XML also enables nontree structures n 2/6/05 mimics the semi-structured data model Salman Azhar: Database Systems 34
Group Exercise n Convert the following into a Semistructured representation <? XML VERSION = “ 1. 0” STANDALONE = “yes” ? > <RESTS> <REST> <NAME>Taco Bell</NAME> <SODA><NAME>Pepsi</NAME> <PRICE>1. 00</PRICE></ SODA> <SODA><NAME>Sobe</NAME> <PRICE>2. 00</PRICE></SODA> </REST > <REST> … Note: Do not turn over to the </REST > next page before attempting … </RESTS> this exercise yourself! 2/6/05 Salman Azhar: Database Systems 35
Solution: The semi-structured representation <? XML VERSION = “ 1. 0” STANDALONE = “yes” ? > <RESTS> <REST> <NAME>Taco Bell</NAME> RESTS <SODA><NAME>Pepsi</NAME> <PRICE>1. 00</PRICE></ SODA> <SODA><NAME>Sobe</NAME> <PRICE>2. 00</PRICE></SODA> REST </REST > <REST> … </REST > NAME … SODA </RESTS> REST SODA Taco Bell NAME Pepsi 2/6/05 PRICE 1. 00 NAME Sobe PRICE . . . Note: Data is stored in leaf nodes and structure (tags) in internal nodes 2. 00 Salman Azhar: Database Systems 36
Valid XML n Switching gears: Well-formed to Valid XML n n n Valid XML is the most interesting use of XML Essentially a context-free grammar for describing XML tags and their nesting Specified by DTD n Each domain of interest creates one DTD that describes all the documents this group will share n 2/6/05 For example, electronic components, travel industry, etc. , will have their own DTDs Salman Azhar: Database Systems 37
DTD Structure Note: !DOCTYPE is key word with <root tag> being the name of DOCTYPE <!DOCTYPE <root tag> [ <!ELEMENT <name> ( <components> ) <more elements> ]> Between [ … ] list of ELEMENT definition Each !ELEMENT has a <name> with the allowed list of <components> usually in the order listed 2/6/05 Salman Azhar: Database Systems 38
DTD Elements n Element definition consists n n of its name (tag) and a parenthesized description of any nested tags n n n includes order of subtags and their multiplicity (0, 1, or many times) Leaves (text elements) n 2/6/05 have #PCDATA in place of nested tags Salman Azhar: Database Systems 39
Example: DTD <!DOCTYPE RESTS [ RESTS can have * (0 or more) REST <!ELEMENT RESTS (REST*)> REST has NAME and <!ELEMENT REST (RNAME, SODA+)>then + (1 or more) SODA… Order matters! <!ELEMENT NAME (#PCDATA)> SODA has NAME followed PRICE SODA’s NAME and PRICE are data (#PCDATA): No more tags just text GROUP EXERCISE: COMPLETE THE DTD ]> Note: Do not turn over to the next page before attempting this exercise yourself! 2/6/05 Salman Azhar: Database Systems 40
Example: DTD <!DOCTYPE RESTS [ RESTS can have * (0 or more) REST <!ELEMENT RESTS (REST*)> REST has NAME and <!ELEMENT REST (RNAME, SODA+)>then + (1 or more) SODA… Order matters! <!ELEMENT NAME (#PCDATA)> <!ELEMENT SODA (NAME, PRICE)> NAME and PRICE are data (#PCDATA): No more tags just text <!ELEMENT NAME (#PCDATA)> <!ELEMENT PRICE (#PCDATA)> SODA has NAME followed PRICE ]> 2/6/05 Salman Azhar: Database Systems 41
Element Descriptions Rules n n Subtags must appear in order shown A tag may be followed by a symbol to indicate its multiplicity: n n n Identical to UNIX regular expressions. * = zero or more. + = one or more. ? = zero or one. Alternative sequences of tags can be connected by n the symbol | 2/6/05 Salman Azhar: Database Systems 42
Example: Element Description n A name is n n Either an optional title (e. g. , “Dr. ”), a first name, and a last name, in that order, or it is an IP address <!ELEMENT NAME ( (TITLE? , FIRST, LAST) | IPADDR Alternative symbol )> 2/6/05 Salman Azhar: Database Systems 43
Use of DTDs In order to specify a document follows a particular DTD u 1. Set STANDALONE = “no” a) b) 2/6/05 Either include the DTD as a preamble of the XML document Follow DOCTYPE and the <root tag> by SYSTEM and a path to the file where the DTD is stored Salman Azhar: Database Systems 44
Example (a) <? XML VERSION = “ 1. 0” STANDALONE = “no” ? > <!DOCTYPE RESTS [ DTD <!ELEMENT RESTS (REST*)> <!ELEMENT REST (NAME, SODA+)> <!ELEMENT NAME (#PCDATA)> <!ELEMENT SODA (NAME, PRICE)> <!ELEMENT NAME (#PCDATA)> <!ELEMENT PRICE (#PCDATA)> ]> Document <RESTS> Same as earlier but this time it conforms to the above DTD <REST> <NAME>Taco Bell</NAME> <SODA><NAME>Pepsi</NAME> <PRICE>1. 00</PRICE></ SODA> <SODA><NAME>Sobe</NAME> <PRICE>2. 00</PRICE></SODA> </REST > <REST> … </RESTS> 2/6/05 Salman Azhar: Database Systems 45
Example (b) n Assume the RESTS DTD is in file rest. dtd <? XML VERSION = “ 1. 0” STANDALONE = “no” ? > Get the DTD <!DOCTYPE Rests SYSTEM “rest. dtd”> <RESTS> <REST> <NAME>Taco Bell</NAME> <SODA><NAME>Pepsi</NAME> <PRICE>1. 00</PRICE></ SODA> <SODA><NAME>Sobe</NAME> <PRICE>2. 00</PRICE></SODA> </REST > <REST> … </RESTS> 2/6/05 Salman Azhar: Database Systems from the file rest. dtd Document Same as earlier but this time it conforms to the DTD in rest. dtd 46
Attributes n n Attributes are another important component of DTD and XML docs Opening tags in XML can have attributes n n like <A HREF = “…”> in HTML In DTD <!ATTLIST <elementname>… > n 2/6/05 gives a list of attributes and their data types for this element Salman Azhar: Database Systems 47
Example: Attributes n Rests can have an attribute kind n which is either qsr, family, or other. The element definition is unchanged n However, we add an ATTLIST. <!ELEMENT REST (NAME SODA*)> <!ATTLIST REST kind “qsr” | “family” | “other”> n 2/6/05 Salman Azhar: Database Systems 48
Example: Attribute Use In a document that allows REST tags, we might see: <REST kind = “qsr”> New info: kind = “qsr” <NAME>KFC</NAME> <SODA><NAME>Pepsi</NAME> <PRICE>1. 00</PRICE></SODA>. . . </REST> n 2/6/05 Salman Azhar: Database Systems 49
IDs and IDREFs n n Introduce links from one object to another Allows the structure of an XML document to be a general graph n n rather than just a tree. These are pointers from one object to another n 2/6/05 in analogy to HTML’s NAME = “blah” and HREF = “#blah” Salman Azhar: Database Systems 50
Creating IDs n n We give an element Elephant an attribute Attention of type ID in the DTD When using tag <Elephant> in an XML document, give its attribute Attention a unique value. n For example, n 2/6/05 <Elephant Attention = “ 213”> Salman Azhar: Database Systems 51
Creating IDREFs n IDREFs are similar to IDs: n To allow objects of type Fig to refer to another object with an ID attribute, n n Or, let the attribute have type IDREFS, n 2/6/05 give Fig an attribute of type IDREF (single string of type ID) so the Fig –object can refer to any number of other objects (any number strings of type ID). Salman Azhar: Database Systems 52
Example: IDs and IDREFs n Let us redesign our RESTS DTD to include both REST and SODA sub-elements n n Both rests and sodas will have ID attributes called name Rests have PRICE sub-objects, n n Sodas have attribute sold. By, n 2/6/05 consisting of a number (the price of one soda) and an IDREF the. Soda leading to that soda which is an IDREFS leading to all the rests that sell it Salman Azhar: Database Systems 53
The DTD RESTS have 0+ REST and 0+ SODA <!DOCTYPE Rests [ <!ELEMENT RESTS (REST*, SODA*)> REST objects have name as an <!ELEMENT REST (PRICE+)> ID attribute and have one or more PRICE sub-objects <!ATTLIST REST name ID> PRICE objects <!ELEMENT PRICE (#PCDATA)> have a <!ATTLIST PRICE the. Soda IDREF> number (the price) and <!ELEMENT SODA ()> one reference to a soda <!ATTLIST SODA name ID, sold. By IDREFS> ]> Soda objects have an ID attribute called name, and a sold. By attribute that is a set of Rest names 2/6/05 Salman Azhar: Database Systems 54
Example Document <RESTS> <REST name = “Taco Bell”> <PRICE the. Soda = “Pepsi”>1. 00</PRICE> <PRICE the. Soda = “Sobe”>2. 00</PRICE> </REST> … <SODA name = “Pepsi”, sold. By = “KFC, Taco. Bell, …”> </SODA> … </RESTS> 2/6/05 <!DOCTYPE Rests [ <!ELEMENT RESTS (REST*, SODA*)> <!ELEMENT REST (PRICE+)> <!ATTLIST REST name ID> <!ELEMENT PRICE (#PCDATA)> <!ATTLIST PRICE the. Soda IDREF> <!ELEMENT SODA ()> <!ATTLIST SODA name ID, sold. By IDREFS> ]> Salman Azhar: Database Systems 55
Recap n n n Semi-structured Data XML (Extensible Markup Language) Well-formed and Valid XML Document Type Definitions IDs and IDREFs 2/6/05 Salman Azhar: Database Systems 56
Perspective n Here XML is used as a EDI medium n n EDI = electronic data interchange There are many other using for XML n 2/6/05 Each has its own utilization Salman Azhar: Database Systems 57
Questions? n Questions? ? ? n 2/6/05 Doesn’t mean you will get all the answers! Salman Azhar: Database Systems 58
- Slides: 58