CSE 636 Data Integration XML Semistructured Data Document

  • Slides: 42
Download presentation
CSE 636 Data Integration XML Semistructured Data Document Type Definitions

CSE 636 Data Integration XML Semistructured Data Document Type Definitions

Semistructured Data • Another data model, based on trees • Motivation: flexible representation of

Semistructured Data • Another data model, based on trees • Motivation: flexible representation of data – Often, data comes from multiple sources with differences in notation, meaning, etc. • Motivation: sharing of documents among systems and databases 2

Graphs of Semistructured Data • Nodes = objects • Labels on arcs (attributes, relationships)

Graphs of Semistructured Data • Nodes = objects • Labels on arcs (attributes, relationships) • Atomic values at leaf nodes (nodes with no arcs out) • Flexibility: no restriction on: – Labels out of a node – Number of successors with a given label 3

Example: Data Graph root beer bar beer manf name served. At Bud A. B.

Example: Data Graph root beer bar beer manf name served. At Bud A. B. manf prize name M’lob name addr Joe’s Maple The bar object for Joe’s Bar year 1995 award Gold The beer object for Bud 4

XML HTML • Uses tags formatting the presentation (e. g. , “italic”) • Hard

XML HTML • Uses tags formatting the presentation (e. g. , “italic”) • Hard for applications to process XML = Extensible Markup Language • Uses tags for semantics (e. g. , “this is an address”) – Similar to labels in semistructured data • Allows you to invent your own tags • Easy for applications to process 5

HTML XML <html> <body> <h 1> Bibliography </h 1> <p> <i>Foundations of Databases</i> Abiteboul,

HTML XML <html> <body> <h 1> Bibliography </h 1> <p> <i>Foundations of Databases</i> Abiteboul, Hull, Vianu <br/> Addison Wesley, 1995 </p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br/> Morgan Kaufmann, 1999 </p> </body> </html> <? xml version = “ 1. 0” standalone = “yes” ? > <bibliography> <book> <title>Foundations of Databases</title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> 6

Why XML is of Interest to Us • XML is just syntax for data

Why XML is of Interest to Us • XML is just syntax for data – Note: we have no syntax for relational data – But XML is not relational: semistructured • This is exciting because: – – Can translate any data to XML Can ship XML over the Web (HTTP, SOAP) Can input XML into any application Thus: data sharing and exchange on the Web 7

XML Data Sharing and Exchange Applications XML DB Applications XML Data Transform Integrate Web

XML Data Sharing and Exchange Applications XML DB Applications XML Data Transform Integrate Web (HTTP, SOAP) Warehouse Relational DB Web Site Web Service 8

XML Tags & Elements • Tags: book, title, author, … – XML tags are

XML Tags & Elements • Tags: book, title, author, … – XML tags are case sensitive • Tags, as in HTML, are normally matched pairs – <book> … </book> – Start tag: <book>, End tag: </book> • Elements: everything between tags – Example 1: <title>Foundations of Databases</title> – Example 2: <book> <title>Foundations of Databases</title> </book> • Elements may be nested arbitrarily • Empty element: <book></book> – Abbreviation <book/> 9

XML Attributes <book price = “ 55” currency = “USD”> <title> Foundations of Databases

XML Attributes <book price = “ 55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book> • Attributes are alternative ways to represent data 10

Replacing Attributes with Elements <book> <title> Foundations of Databases </title> <author> Abiteboul </author> …

Replacing Attributes with Elements <book> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> <price> 55 </price> <currency> USD </currency> </book> 11

Elements vs. Attributes • Too many attributes make documents hard to read • Attributes

Elements vs. Attributes • Too many attributes make documents hard to read • Attributes do not specify document structure • Attributes are good for simple information 12

More XML: CDATA Section • Syntax: <![CDATA[. . . any text here. . .

More XML: CDATA Section • Syntax: <![CDATA[. . . any text here. . . ]]> • Example: <example> <![CDATA[ some text here </not. Atag> <>]]> </example> 13

More XML: Entity References • Syntax: &entityname; • Example: <element> this is less than

More XML: Entity References • Syntax: &entityname; • Example: <element> this is less than < </element> • Some entities: < < > > & & &apos; ‘ " “ & Unicode char 14

More XML: Comments • Syntax <!--. . Comment text. . . --> • Yes,

More XML: Comments • Syntax <!--. . Comment text. . . --> • Yes, they are part of the data model !!! 15

XML Semantics: a Tree ! Attribute node <data> <person age=“ 25” > person <name>

XML Semantics: a Tree ! Attribute node <data> <person age=“ 25” > person <name> Mary </name> <address> <street> Maple </street> age <no> 345 </no> address name <city> Seattle </city> </address> </person> 25 <person> street no Mary <name> John </name> <address>Thailand</address> <phone> 23456 </phone> Maple 345 </person> </data> Element node data person name address phone city Thai John Seattle 23456 Text node • Order matters!!! 16

Well-Formed XML • Start the document with a declaration, surrounded by <? xml …

Well-Formed XML • Start the document with a declaration, surrounded by <? xml … ? > • Normal declaration is: <? xml version = “ 1. 0” standalone = “yes” ? > – “Standalone” = “no DTD provided” • Has single root element surrounding nested elements • Has matching tags 17

XML Data • XML is self-describing • Schema elements become part of the data

XML Data • XML is self-describing • Schema elements become part of the data – Relational schema: person(name, phone) – In XML <person>, <name>, <phone> are part of the data, and are repeated many times • Consequence: XML is much more flexible • XML = semistructured data – Well-Formed XML with nested tags is exactly the same idea as trees of semistructured data – XML also enables nontree structures, as does the semistructured data model 18

XML is Semistructured Data • Missing attributes: <person> <name> John</name> <phone>1234</phone> </person> <name>Joe</name> </person>

XML is Semistructured Data • Missing attributes: <person> <name> John</name> <phone>1234</phone> </person> <name>Joe</name> </person> • Could represent in a table with nulls no phone ! name phone John 1234 Joe 19

XML is Semistructured Data • Repeated attributes <person> <name>Mary</name> <phone>2345</phone> <phone>3456</phone> </person> • Impossible

XML is Semistructured Data • Repeated attributes <person> <name>Mary</name> <phone>2345</phone> <phone>3456</phone> </person> • Impossible in tables: name phone Mary 2345 two phones ! 3456 ? ? ? 20

XML is Semistructured Data • Attributes with different types in different objects <person> <name>

XML is Semistructured Data • Attributes with different types in different objects <person> <name> structured name ! <first>John</first> <last>Smith</last> </name> <phone>1234</phone> </person> • Nested collections (no 1 NF) • Heterogeneous collections: – <db> contains both <book>s and <publisher>s 21

Document Type Definition (DTD) • • Part of the original XML specification An XML

Document Type Definition (DTD) • • Part of the original XML specification An XML document may have a DTD Valid XML: if it has a DTD and conforms to it Validation is useful in data exchange 22

Very Simple DTD <!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title, author*, year?

Very Simple DTD <!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title, author*, year? )> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (#PCDATA)> ]> 23

DTD: The Content Model content model • Content model: <!ELEMENTtag (CONTENT)> – Complex –

DTD: The Content Model content model • Content model: <!ELEMENTtag (CONTENT)> – Complex – – = a regular expression over other elements Text-only = #PCDATA Empty = EMPTY Any = ANY Mixed content = (#PCDATA | B | C)* 24

DTD: Regular Expressions DTD sequence <!ELEMENT name (first. Name, last. Name)) optional <!ELEMENT name

DTD: Regular Expressions DTD sequence <!ELEMENT name (first. Name, last. Name)) optional <!ELEMENT name (first. Name? , last. Name)) zero or more <!ELEMENT person (name, phone*)) one or more <!ELEMENT person (name, phone+)) alternation <!ELEMENT person (name, (phone|email))) XML <name> <first. Name>…</first. Name> <last. Name>…</last. Name> <name> </name> <last. Name>…</last. Name> </name> <person> <name>…</name> <first. Name>…</first. Name> <phone>…</phone> <last. Name>…</last. Name> <phone>…</phone> </name> <phone>…</phone> <person> … <name>…</name> </person> <phone>…</phone> <name>…</name> <person> … </person> <name>…</name> </person> <phone>…</phone> <person> </person> <name>…</name> <person> <phone>…</phone> <name>…</name> </person> <email>…</email> </person> 25

DTD: Attributes <!ELEMENT person (ssn, name, office, phone? )> <!ATTLIST person age CDATA #REQUIRED

DTD: Attributes <!ELEMENT person (ssn, name, office, phone? )> <!ATTLIST person age CDATA #REQUIRED height CDATA #IMPLIED> <person age=“ 25” height=“ 6”> <name>. . . </person> 26

DTD: Attributes <!ATTLISTtag (name type kind)+> Types: • CDATA = string • (Mon |

DTD: Attributes <!ATTLISTtag (name type kind)+> Types: • CDATA = string • (Mon | Wed | Fri) = enumeration • ID = key • IDREF = foreign key • IDREFS = foreign keys separated by space • others = rarely used Kind: • #REQUIRED • #IMPLIED = optional • “value” = default value • “value” #FIXED = the only value allowed 27

XML: IDs and References • Attributes can be pointers from one object to another

XML: IDs and References • Attributes can be pointers from one object to another – Compare to HTML’s NAME = “foo” and HREF = “#foo” • Allows the structure of an XML document to be a general graph, rather than just a tree 28

XML: Creating ID’s • Give an element E an attribute A of type ID

XML: Creating ID’s • Give an element E an attribute A of type ID • When using tag <E> in an XML document, give its attribute A a unique value • Example: <E A = “xyz”> 29

XML: Creating References • To allow objects of type F to refer to another

XML: Creating References • To allow objects of type F to refer to another object with an ID attribute, give F an attribute of type IDREF • Or, let the attribute have type IDREFS, so the F –object can refer to any number of other objects 30

XML: IDs and References <person id=“o 555”> <name>Jane</name> </person> <person id=“o 456”> <name> Mary

XML: IDs and References <person id=“o 555”> <name>Jane</name> </person> <person id=“o 456”> <name> Mary </name> <children idref=“o 123 o 555”/> </person> <person id=“o 123” mother=“o 456”> <name>John</name> </person> • IDs and references in XML are just syntax 31

DTD: ID and IDREF(S) Attributes <!ELEMENT person (ssn, name, office, phone? )> <!ATTLIS person

DTD: ID and IDREF(S) Attributes <!ELEMENT person (ssn, name, office, phone? )> <!ATTLIS person age CDATA #REQUIRED id ID #REQUIRED manager IDREF #REQUIRED manages IDREFS #REQUIRED > <person age=“ 25” id=“p 29432” manager=“p 48293” manages=“p 34982 p 423234”> <name>. . </name>. . . </person> 32

Use of DTDs 1. Set standalone = “no” 2. Either: a) Include the DTD

Use of DTDs 1. Set standalone = “no” 2. Either: a) Include the DTD as a preamble of the XML document, or b) Follow DOCTYPE and the <root tag> by SYSTEM and a path to the file where the DTD can be found, or c) Mix the two. . . (e. g. to override the external definition) 33

Example (a) <? xml version = “ 1. 0” standalone = “no” ? >

Example (a) <? xml version = “ 1. 0” standalone = “no” ? > <!DOCTYPE BARS [ The DTD <!ELEMENT BARS (BAR*)> <!ELEMENT BAR (NAME, BEER+)> <!ELEMENT NAME (#PCDATA)> <!ELEMENT BEER (NAME, PRICE)> The document <!ELEMENT PRICE (#PCDATA)> ]> <BARS> <BAR><NAME>Joe’s Bar</NAME> <BEER><NAME>Bud</NAME> <PRICE>2. 50</PRICE></BEER> <BEER><NAME>Miller</NAME> <PRICE>3. 00</PRICE></BEER> </BAR> <BAR> … </BARS> 34

Example (b) • Assume the BARS DTD is in file bar. dtd <? xml

Example (b) • Assume the BARS DTD is in file bar. dtd <? xml version = “ 1. 0” standalone = “no” ? > <!DOCTYPE BARS SYSTEM “bar. dtd”> Get the DTD <BARS> from the file <BAR><NAME>Joe’s Bar</NAME> bar. dtd <BEER><NAME>Bud</NAME> <PRICE>2. 50</PRICE></BEER> <BEER><NAME>Miller</NAME> <PRICE>3. 00</PRICE></BEER> </BAR> <BAR> … </BARS> 35

DTDs as Grammars <!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title, author*, year?

DTDs as Grammars <!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title, author*, year? )> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (#PCDATA)> ]> 36

DTDs as Grammars Same thing as: db book title author year publisher : :

DTDs as Grammars Same thing as: db book title author year publisher : : = : : = (book|publisher)* (title, author*, year? ) string • A DTD is a EBNF (Extended BNF) grammar • An XML tree is precisely a derivation tree • A valid XML document = a parse tree for that grammar 37

DTDs as Grammars <!DOCTYPE paper [ <!ELEMENT paper <!ELEMENT section <!ELEMENT title <!ELEMENT text

DTDs as Grammars <!DOCTYPE paper [ <!ELEMENT paper <!ELEMENT section <!ELEMENT title <!ELEMENT text ]> (section*)> ((title, section*) | text)> (#PCDATA)> <paper> <section> <text> </section> <title> </title> <section> … </section> </paper> • XML documents can be nested arbitrarily deep 38

DTDs as Schemas Not so well suited: • impose unwanted constraints on order: –

DTDs as Schemas Not so well suited: • impose unwanted constraints on order: – <!ELEMENT person (name, phone)> • references cannot be constrained – ID/IDREFS can reference any ID • can be too vague: – <!ELEMENT person ((name|phone|email)*)> 39

DTDs as Schemas No context-dependant typing dealer Used. Cars New. Cars a d model

DTDs as Schemas No context-dependant typing dealer Used. Cars New. Cars a d model year • Cannot distinguish between used car ads and new car ads – Different structure in different contexts 40

XML APIs • Document Object Model - DOM – – – Manipulation of XML

XML APIs • Document Object Model - DOM – – – Manipulation of XML Data Provides a representation of an XML Document as a tree Reads XML Document into memory http: //www. w 3. org/DOM Many implementations (Sun JAXP, Apache Xerces, …) • Simple API for XML - SAX – Event-based framework for parsing XML data – http: //www. saxproject. org/ 41

References • Lecture Slides – Jeffrey D. Ullman – http: //www-db. stanford. edu/~ullman/dscb/pslides. html

References • Lecture Slides – Jeffrey D. Ullman – http: //www-db. stanford. edu/~ullman/dscb/pslides. html – Dan Suciu – http: //www. cs. washington. edu/homes/suciu/COURSES/590 DS/02 xmlsynta x. htm – http: //www. cs. washington. edu/homes/suciu/COURSES/590 DS/11 dtd. htm – Alon Levy – http: //www. cs. washington. edu/education/courses/csep 544/02 sp/lectures/l ecture 5 cut. ppt • BRICS XML Tutorial – A. Moeller, M. Schwartzbach – http: //www. brics. dk/~amoeller/XML/index. html • W 3 C's XML homepage – http: //www. w 3. org/XML • XML School: an XML tutorial – http: //www. w 3 schools. com/xml 42