From Semistructured Data to XML Dan Suciu ATT

  • Slides: 118
Download presentation
From Semistructured Data to XML Dan Suciu AT&T Labs http: //www. research. att. com/~suciu/vldb

From Semistructured Data to XML Dan Suciu AT&T Labs http: //www. research. att. com/~suciu/vldb 99 -tutorial. pdf

How the Web is Today • HTML documents • all intended for human consumption

How the Web is Today • HTML documents • all intended for human consumption • many generated automatically by applications Easy to fetch any Web page, from any server, any platform

Limits of the Web Today • application cannot consume HTML • HTML wrapper technology

Limits of the Web Today • application cannot consume HTML • HTML wrapper technology is brittle – screen scraping • OO technology (Corba) requires controlled environment • companies merge, form partnerships; need interoperability fast people are inventive: send data by fax !

Paradigm Shift on the Web • new Web standard XML: – XML generated by

Paradigm Shift on the Web • new Web standard XML: – XML generated by applications – XML consumed by applications • data exchange – across platforms: enterprise interoperability – across enterprises Web: from collection of documents to data and documents

Database Community Can Help • • • query optimization, processing views, transformations data warehouses,

Database Community Can Help • • • query optimization, processing views, transformations data warehouses, data integration mediators, query rewriting secondary storage, indexes

But Needs a Paradigm Shift Too • Web data differs from database data: –

But Needs a Paradigm Shift Too • Web data differs from database data: – self-describing, schema-less – structure changes without notice – heterogeneous, deeply nested, irregular – documents and data mixed together • designed by document, not db experts • need Web data management

What This Tutorial is About • what the database community has done – semistructured

What This Tutorial is About • what the database community has done – semistructured data model – query languages, schemas • what the Web community has done: – data formats/models: XML, RDF – transformation language (XSL), schemas • where they meet and where they differ

Outline • • • Semistructured data and XML Query languages Schemas Systems issues Conclusions

Outline • • • Semistructured data and XML Query languages Schemas Systems issues Conclusions

Part 1 Semistructured Data and XML

Part 1 Semistructured Data and XML

Semistructured Data Origins: • integration of heterogeneous sources • data sources with non-rigid structure

Semistructured Data Origins: • integration of heterogeneous sources • data sources with non-rigid structure • biological data • Web data

The Semistructured Data Model Bib &o 1 complex object paper book references &o 12

The Semistructured Data Model Bib &o 1 complex object paper book references &o 12 &o 24 references author title year &o 29 references author http page author title publisher title author &o 43 &25 &96 1997 firstname lastname atomic object last firstname lastname &243 “Serge” “Abiteboul” “Victor” Object Exchange Model (OEM) first &206 “Vianu” 122 133

Syntax for Semistructured Data Bib: &o 1 { paper: &o 12 { … },

Syntax for Semistructured Data Bib: &o 1 { paper: &o 12 { … }, book: &o 24 { … }, paper: &o 29 { author: &o 52 “Abiteboul”, author: &o 96 { firstname: &243 “Victor”, lastname: &o 206 “Vianu”}, title: &o 93 “Regular path queries with constraints”, references: &o 12, references: &o 24, pages: &o 25 { first: &o 64 122, last: &o 92 133} } }

Syntax for Semistructured Data May omit oid’s: { paper: { author: “Abiteboul”, author: {

Syntax for Semistructured Data May omit oid’s: { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } } }

Characteristics of Semistructured Data • • missing or additional attributes multiple attributes different types

Characteristics of Semistructured Data • • missing or additional attributes multiple attributes different types in different objects heterogeneous collections self-describing, irregular data, no a priori structure

Comparison with Relational Data row row name phone “John” 3634 “Sue” 6343 “Dick” 6363

Comparison with Relational Data row row name phone “John” 3634 “Sue” 6343 “Dick” 6363 { row: { name: “John”, phone: 3634 }, row: { name: “Sue”, phone: 6343 }, row: { name: “Dick”, phone: 6363 } }

XML • a W 3 C standard to complement HTML • origins: structured text

XML • a W 3 C standard to complement HTML • origins: structured text SGML • motivation: – HTML describes presentation – XML describes content • • http: //www. w 3. org/TR/REC-xml (2/98)

From HTML to XML HTML describes the presentation

From HTML to XML HTML describes the presentation

HTML <h 1> Bibliography </h 1> <p> <i> Foundations of Databases </i> Abiteboul, Hull,

HTML <h 1> Bibliography </h 1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteoul, Buneman, Suciu Morgan Kaufmann, 1999

XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu

XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML describes the content

XML Terminology • • • tags: book, title, author, … start tag: <book>, end

XML Terminology • • • tags: book, title, author, … start tag: <book>, end tag: </book> elements: <book>…<book>, <author>…</author> elements are nested empty element: <red></red> abbrv. <red/> an XML document: single root element well formed XML document: if it has matching tags

More XML: Attributes <book price = “ 55” currency = “USD”> <title> Foundations of

More XML: Attributes <book price = “ 55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book> attributes are alternative ways to represent data

More XML: Oids and References <person id=“o 555”> <name> Jane </name> </person> <person id=“o

More XML: Oids and References <person id=“o 555”> <name> Jane </name> </person> <person id=“o 456”> <name> Mary </name> <children idref=“o 123 o 555”/> </person> <person id=“o 123” mother=“o 456”><name>John</name> </person> oids and references in XML are just syntax

XML Data Model • does not exists • Document Object Model (DOM): – –

XML Data Model • does not exists • Document Object Model (DOM): – – http: //www. w 3. org/TR/REC-DOM-Level-1 (10/98) class hierarchy (node, element, attribute, …) objects have behavior defines API to inspect/modify the document

XML Parsers • traditional: return data structure (DOM? ) • event based: SAX (Simple

XML Parsers • traditional: return data structure (DOM? ) • event based: SAX (Simple API for XML) – http: //www. megginson. com/SAX – write handler for start tag and for end tag

XML Namespaces • http: //www. w 3. org/TR/REC-xml-names (1/99) • name : : =

XML Namespaces • http: //www. w 3. org/TR/REC-xml-names (1/99) • name : : = [prefix: ]localpart <book xmlns: isbn=“www. isbn-org. org/def”> <title> … </title> <number> 15 </number> <isbn: number> …. </isbn: number> </book>

XML Namespaces • syntactic: <number> , <isbn: number> • semantic: provide URL for schema

XML Namespaces • syntactic: <number> , <isbn: number> • semantic: provide URL for schema <tag xmlns: mystyle = “http: //…”> defined here … <mystyle: title> … </mystyle: title> <mystyle: number> … </tag>

XML v. s. Semistructured Data • both described best by a graph • both

XML v. s. Semistructured Data • both described best by a graph • both are schema-less, self-describing

Similarities and Differences <person id=“o 123”> { person: &o 123 <name> Alan </name> {

Similarities and Differences <person id=“o 123”> { person: &o 123 <name> Alan </name> { name: “Alan”, <age> 42 </age> age: 42, <email> ab@com </email> email: “ab@com” } </person> } <person father=“o 123”> … </person> father person name Alan age 42 email ab@com { person: { father: &o 123 …} } person father name age email Alan similar on trees, different on graphs 42 ab@com

More Differences • XML is ordered, ssd is not • XML can mix text

More Differences • XML is ordered, ssd is not • XML can mix text and elements: <talk> Making Java easier to type and easier to type <speaker> Phil Wadler </speaker> </talk> • XML has lots of other stuff: entities, processing instructions, comments

RDF • http: //www. w 3. org/TR/REC-rdf-syntax (2/99) • purpose: metadata for Web –

RDF • http: //www. w 3. org/TR/REC-rdf-syntax (2/99) • purpose: metadata for Web – help search engines • syntax in XML • semantics: edge-labeled graphs

RDF Syntax <rdf: Description about=“www. mypage. com”> <about> birds, butterflies, snakes </about> <author> <rdf:

RDF Syntax <rdf: Description about=“www. mypage. com”> <about> birds, butterflies, snakes </about> <author> <rdf: Description> <firstname> John </firstname> <lastname> Smith </lastname> </rdf: Description> </author> </rdf: Description>

RDF Data Model www. mypage. com about author birds, butterflies, snakes firstname John lastname

RDF Data Model www. mypage. com about author birds, butterflies, snakes firstname John lastname Smith the RDF Data Model is very close to semistructured data

More RDF Examples related www. mypage. com about www. anotherpage. com author birds, butterflies,

More RDF Examples related www. mypage. com about www. anotherpage. com author birds, butterflies, snakes firstname John lastname Smith author Joe Doe

<rdf: Description about=“www. mypage. com”> <about> birds, butterflies, snakes </about> <author> <rdf: Description ID=“&o

<rdf: Description about=“www. mypage. com”> <about> birds, butterflies, snakes </about> <author> <rdf: Description ID=“&o 55”> <firstname> John </firstname> <lastname> Smith </lastname> </rdf: Description> </author> </rdf: Description> <rdf: Description about=“www. anotherpage. com”> <related> <rdf: Description about=“www. mypage. com”/> </related> <author rdf: resource=“&o 55”/> <author> Joe Doe </author> </rdf: Description>

RDF Terminology subject predicate object statement

RDF Terminology subject predicate object statement

More RDF: Containers • bag, sequence, alternative <rdf: Description> <a> <rdf: Bag> <rdf: li>

More RDF: Containers • bag, sequence, alternative <rdf: Description> <a> <rdf: Bag> <rdf: li> s 1 </rdf: li> <rdf: li> s 2 </rdf: li> </rdf: Bag> </a> </rdf: Description>

RDF Containers (cont’d) a rdf: type Bag rdf_1 s 1 rdf_2 s 2

RDF Containers (cont’d) a rdf: type Bag rdf_1 s 1 rdf_2 s 2

More RDF: Higher Order Statements “the author of www. thispage. com says: ‘the topic

More RDF: Higher Order Statements “the author of www. thispage. com says: ‘the topic of www. thatpage. com is environment’ “ www. thispage. com www. thatpage. com topic author says environment RDF uses reification

Summary of Data Models • semistructured data, XML, RDF • data is self-describing, irregular

Summary of Data Models • semistructured data, XML, RDF • data is self-describing, irregular • schema embedded in the data

Part 2 Query Languages • • • Semistructured data and XML Query languages Schemas

Part 2 Query Languages • • • Semistructured data and XML Query languages Schemas Systems issues Conclusions

Query Languages: Motivation • granularity of the HTML Web: one file • granularity of

Query Languages: Motivation • granularity of the HTML Web: one file • granularity of Web data varies: – single data item: “get John’s salary” – entire database: “get all salaries” – aggregates: “get average salary” • need query language to define granularity

Query Languages: Outline • for semistructured data: – Lorel – Un. QL – Stru.

Query Languages: Outline • for semistructured data: – Lorel – Un. QL – Stru. QL • for XML: XML-QL • a different paradigm – structural recursion – XSL

Lorel • part of the Lore system (Stanford) • adapts OQL to semistructured data

Lorel • part of the Lore system (Stanford) • adapts OQL to semistructured data example: select X. title from Bib. paper X where X. year > 1995 select Bib. paper. title abbreviated to: from Bib. paper where Bib. paper. year > 1995

Lorel v. s. OQL • implicit coercions: 1995 to “ 1995” • missing attributes

Lorel v. s. OQL • implicit coercions: 1995 to “ 1995” • missing attributes – empty answer v. s. type error • set-valued attributes – in X. year>1995, X may have several years • regular path expressions (next)

Regular Path Expressions select X. title from Bib. paper X, Bib. (paper|book) Y where

Regular Path Expressions select X. title from Bib. paper X, Bib. (paper|book) Y where Y. author. lastname? = “Ullman” and Y. reference+ X Useful for: • syntactic substitute for inheritance: paper|book • navigating partially known structures: lastname? • transitive closure: reference+

Un. QL • Unstructured Query Language • patterns, templates, structural recursion • patterns: select

Un. QL • Unstructured Query Language • patterns, templates, structural recursion • patterns: select T where Bib. paper: { title: T, year: Y, journal: “TODS”} and Y > 1995

Un. QL: Templates select result: { fn: F, ln: L, pub: { title: T,

Un. QL: Templates select result: { fn: F, ln: L, pub: { title: T, year: Y }} where Bib. paper: { title: T, year: Y, journal: “TODS”} and Y > 1995 Result looks like: { result: { fn: “John”, ln: “Smith”, pub: { title: “P equals NP”, year: 2005}}, result: { fn: “Joe”, ln: “Doe”, pub: { title: “Errata to P=NP”, year: 2006}} …}

Skolem Functions • Maier, 1986 – in OO systems • Kifer et al, 1989

Skolem Functions • Maier, 1986 – in OO systems • Kifer et al, 1989 – F-logic • Hull and Yoshikawa, 1990 – deductive db (ILOG) • Papakonstantinou et al. , 1996 – semistructured db (MSL) • illustrate with Strudel (next)

Skolem Functions in Stru. QL • Strudel: a Web Site Management System • Stru.

Skolem Functions in Stru. QL • Strudel: a Web Site Management System • Stru. QL: its query language

Example: Bibliography Data {Bib: { paper: { author: “Jones”, author: “Smith”, title: “The Comma”,

Example: Bibliography Data {Bib: { paper: { author: “Jones”, author: “Smith”, title: “The Comma”, year: 1994 } }, { paper: …. . } }

Example: A Complex Web Site person Home. Page(“Smith”) yearentry Year. Page(“Smith”, 1996) publication Pub.

Example: A Complex Web Site person Home. Page(“Smith”) yearentry Year. Page(“Smith”, 1996) publication Pub. Page(“The Comma”) title person Home. Page(“Jones”) yearentry author Year. Page(“Smith”, 1994) author Root() Home. Page(“Mark”) yearentry Year. Page(“Jones”, 1994) publication Year. Page(“Mark”, 1996) Year. Page(“Jones”, 1998) publication Pub. Page(“The Dot”) publication title publication author

Example: Skolem Functions in Stru. QL where Root -> “Bib” -> X, X ->

Example: Skolem Functions in Stru. QL where Root -> “Bib” -> X, X -> “paper” -> P, P -> “author” -> A, P -> “title” -> T, P -> “year” -> Y create Root(), Home. Page(A), Year. Page(A, Y), Pub. Page(P) link Root() -> “person” -> Home. Page(A), Home. Page(A) -> “yearentry” -> Year. Page(A, Y), Year. Page(A, Y) -> “publication” -> Pub. Page(P), Pub. Page(P) -> “author” -> Home. Page(A), Pub. Page(P) -> “title” -> T

XML-QL: A Query Language for XML • http: //www. w 3. org/TR/NOTE-xml-ql (8/98) •

XML-QL: A Query Language for XML • http: //www. w 3. org/TR/NOTE-xml-ql (8/98) • features: – regular path expressions – patterns, templates – Skolem Functions • based on OEM data model

Pattern Matching in XML-QL where <book language=“french”> <publisher> <name> Morgan Kaufmann </name> </publisher> <author>

Pattern Matching in XML-QL where <book language=“french”> <publisher> <name> Morgan Kaufmann </name> </publisher> <author> $a </author> </book> in “www. a. b. c/bib. xml” construct $a

Simple Constructors in XML-QL where <book language = $l> <author> $a </> in “www.

Simple Constructors in XML-QL where <book language = $l> <author> $a </> in “www. a. b. c/bib. xml” construct <result> <author> $a </> <lang> $l </> Note: </> abbreviates </book> or </result> or. . . <result> <author>Smith</author><lang>English</lang></result> <author>Smith</author><lang>Mandarin</lang></result> <author>Doe</author><lang>English</lang></result>

Skolem Functions in XML-QL where <book language = $l> <author> $a </> in “www.

Skolem Functions in XML-QL where <book language = $l> <author> $a </> in “www. a. b. c/bib. xml” construct <result> <author id=F($a)> $a</> <lang> $l </> <result> <author>Smith</author> <lang>English</lang> <lang>Mandarin</lang> </result> <author>Doe</author> <lang>English</lang> </result>

A Different Paradigm: Structural Recursion Data as sets with a union operator: {a: 3,

A Different Paradigm: Structural Recursion Data as sets with a union operator: {a: 3, a: {b: ”one”, c: 5}, b: 4} = {a: 3} U {a: {b: ”one”, c: 5}} U {b: 4}

Structural Recursion Example: retrieve all integers in the data f(T 1 U T 2)

Structural Recursion Example: retrieve all integers in the data f(T 1 U T 2) = f({L: T}) = f({}) = f(V) = a 3 b a b c “one” 5 f(T 1) U f(T 2) f(T) {} if is. Int(V) then {result: V} else {} result 4 3 result 5 result 4 standard textbook programming on trees

Structural Recursion Example: increase all engine prices by 10% f(T 1 U T 2)

Structural Recursion Example: increase all engine prices by 10% f(T 1 U T 2) = f(T 1) U f(T 2) f({L: T}) = if L= engine then {L: g(T)} else {L: f(T)} f({}) = {} f(V) = V engine part price 100 body price part 1000 g(T 1 U T 2) = g(T 1) U g(T 2) g({L: T}) = if L= price then {L: 1. 1*T} else {L: g(T)} g({}) = {} g(V) = V engine price 1000 100 part price 110 body price part 1100 price 1000 100

XSL • two W 3 C drafts: XSLT and XPATH – http: //www. w

XSL • two W 3 C drafts: XSLT and XPATH – http: //www. w 3. org/TR/xpath, 7/99 – http: //www. w 3. org/TR/WD-xslt, 7/99 • in commercial products (e. g. IE 5. 0) • purpose: stylesheet specification language: – stylesheet: XML -> HTML – in general: XML -> XML

XSL Templates and Rules • query = collection of template rules • template rule

XSL Templates and Rules • query = collection of template rules • template rule = match pattern + template Retrieve all book titles: <xsl: template> <xsl: apply-templates/> </xsl: template> <xsl: template match = “/bib/*/title”> <result> <xsl: value-of/> </result> </xsl: template>

XPath Expressions in Match Patterns bib * / /bib bib/paper bib//paper|book @price bib/book/@price matches

XPath Expressions in Match Patterns bib * / /bib bib/paper bib//paper|book @price bib/book/@price matches a bib element matches any element matches the root element matches a bib element under root matches a paper in bib, at any depth matches a paper or a book matches a price attribute matches price attribute in book, in bib

Flow Control in XSL <xsl: template> <xsl: apply-templates/> </xsl: template> <xsl: template match=“a”> <A><xsl:

Flow Control in XSL <xsl: template> <xsl: apply-templates/> </xsl: template> <xsl: template match=“a”> <A><xsl: apply-templates/></A> </xsl: template> <xsl: template match=“b”> <B><xsl: apply-templates/></B> </xsl: template> <xsl: template match=“c”> <C><xsl: value-of/></C> </xsl: template>

XSL is Structural Recursion Equivalent to: f(T 1 U T 2) = f(T 1)

XSL is Structural Recursion Equivalent to: f(T 1 U T 2) = f(T 1) U f(T 2) f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t) f({}) = {} f(V) = V XSL query = single function XSL query with modes = multiple function

XSL and Structural Recursion XSL: • trees only • may loop Structural Recursion: •

XSL and Structural Recursion XSL: • trees only • may loop Structural Recursion: • arbitrary graphs • always terminates add the following rule: <xsl: template match = “e”> <xsl: apply-patterns select=“/”/> </xsl: template> stack overflow on IE 5. 0

Summary of Query Languages • • • studied extensively in semistructured data some quite

Summary of Query Languages • • • studied extensively in semistructured data some quite powerful features no standard for XML QL yet (WG soon) XSL available today (for stylesheets) XSL = structural recursion

Part 3 Schemas • • • Semistructured data and XML Query languages Schemas Systems

Part 3 Schemas • • • Semistructured data and XML Query languages Schemas Systems issues Conclusions

Schemas • why ? here lies our interest – XML: to describe semantics –

Schemas • why ? here lies our interest – XML: to describe semantics – semistructured data: to improve processing • what ? – semistructured data: foundational – XML: several concrete proposals

Schemas • when ? – semistructured data, XML: a posteriori – RDBMS: a priori,

Schemas • when ? – semistructured data, XML: a posteriori – RDBMS: a priori, to interpret binary data • how ? – semistructured data: schema is independent – XML: schema is hardwired with the data

Outline • schemas for semistructured data: – foundations – schema extraction • schemas for

Outline • schemas for semistructured data: – foundations – schema extraction • schemas for XML: – DTD – XML-Schema – RDF-Schema

Schemas: An Example Some database: &r 1 person company person manages company works-for employee

Schemas: An Example Some database: &r 1 person company person manages company works-for employee &p 1 &c 1 &p 2 &c 2 c. e. o. &p 3 c. e. o. works-for position phone name address namepositionworks-for nameaddress &s 0 &s 1 &s 2 &s 3 &s 4 &s 5 &s 6 url &s 7 &s 8 &s 9 description “Smith” “Manager” “Widget” “Trenton” “Jones” “ 5552121” “Gadget” description salesrep &a 2 “Sales” eval &a 5 1998 &a 4 1997 &a 3 task &a 6 “below target” “www. gp. fr” &a 1 procurement &s 10 “Paris” “Dupont” contact “on target”

Lower-Bound Schemas person Root company works-for Company c. e. o. Employee name address name

Lower-Bound Schemas person Root company works-for Company c. e. o. Employee name address name managed-by string

Upper Bound Schemas person Root company Company works-for managed-by Employee c. e. o. |

Upper Bound Schemas person Root company Company works-for managed-by Employee c. e. o. | employee name | address | url name | phone | position description string - Any

The Two Questions to Ask Conformance: does that data conform to this schema ?

The Two Questions to Ask Conformance: does that data conform to this schema ? Classification: if so, then which objects belong to what classes ?

Graph Simulation Definition Two edge-labeled graphs G 1, G 2 A simulation is a

Graph Simulation Definition Two edge-labeled graphs G 1, G 2 A simulation is a relation R between nodes: • if (x 1, x 2) in R, and (x 1, a, y 1) in G 1, then exists (x 2, a, y 2) in G 2 (same label) s. t. (y 1, y 2) in R G 1 x 1 R x 2 a a y 1 R G 2 y 2 Note: a simulation can be efficiently computed [Henzinger, et a. 1995]

Using Simulation Data graph D, schema S • upper bound schema: – conformance: find

Using Simulation Data graph D, schema S • upper bound schema: – conformance: find simulation R from D to S – classification: check if (x, c) in R • lower bound schema – conformance: find simulation R from S to D – classification: check if (c, x) in R [Buneman et al 1997]

Example person Root company works-for managed-by Company c. e. o. “Smith” “Manager” name address

Example person Root company works-for managed-by Company c. e. o. “Smith” “Manager” name address name Employee &r 1 person company person manages company works-for &p 1 c. e. o. &c 1 employee&p 2 &c 2 c. e. o. &p 3 works-for phone name address position nameaddress name &s 0 &s 1 &s 2 &s 3 &s 4 &s 5 &s 6 url &s 7 &s 8 &s 9 description string “Widget” “Trenton” “Jones” &a 1 procurement &a 2 “Paris”“Dupont” “Sales” “ 5552121”“Gadget” description &s 10 eval &a 5 1998 &a 4 1997 &a 3 task &a 6 “below target” “www. gp. fr” salesrep contact person Root company works-for managed-by Company Employee c. e. o. | employee name | address | url name | phone | position description string - “on target” Lower Bound Database Upper Bound simulation: efficient technique for checking conformance to schema Any

Application 1: Improve Secondary Storage Company person Root company works-for managed-by Company name address

Application 1: Improve Secondary Storage Company person Root company works-for managed-by Company name address name Employee c. e. o. string Employee Lower-bound schema Store rest in overflow graph

Application 2: Query Optimization Bib paper year int journal select X. title from Bib.

Application 2: Query Optimization Bib paper year int journal select X. title from Bib. _ X where X. *. zip = “ 12345” book address title string title author string last first zip city streetname string Upper-bound schema select X. title from Bib. book X where X. address. zip = “ 12345” [Fernandez, Suciu 1998]

Schema Extraction (From Data) Problem statement • given data instance D • find the

Schema Extraction (From Data) Problem statement • given data instance D • find the “most specific” schema S for D In practice: S too large, need to relax [Nestorov et al. 1998]

Schema Extraction: Sample Data &r employee employee manages manages &p 1 &p 2 managedby

Schema Extraction: Sample Data &r employee employee manages manages &p 1 &p 2 managedby worksfor &p 4 &p 5 managedby company worksfor &c &p 3 worksfor employee &p 6 &p 7 managedby worksfor &p 8

Lower Bound Schema Extraction Root &r employee company employee Bosses &p 1, &p 4,

Lower Bound Schema Extraction Root &r employee company employee Bosses &p 1, &p 4, &p 6 worksfor Company &c manages managedby worksfor Regulars &p 2, &p 3, &p 5, &p 7, &p 8

Upper Bound Schema Extraction: Data Guides Root &r company managedby worksfor Bosses &p 1,

Upper Bound Schema Extraction: Data Guides Root &r company managedby worksfor Bosses &p 1, &p 4, &p 6 worksfor Company &c employee Employees &p 1, &p 3, P 4 &p 5, &p 6, &p 7, &p 8 manages managedby worksfor manages Regulars &p 2, &p 3, &p 5, &p 7, &p 8

Schemas in XML • Document Type Definition (DTD) • XML Schema • RDF Schema

Schemas in XML • Document Type Definition (DTD) • XML Schema • RDF Schema

Document Type Definition: DTD • part of the original XML specification • an XML

Document Type Definition: DTD • part of the original XML specification • an XML document may have a DTD • terminology for XML: – well-formed: if tags are correctly closed – valid: if it has a DTD and conforms to it • validation is useful in data exchange

DTDs as Grammars <!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title, section*) |

DTDs as Grammars <!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title, section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)> ]> <paper> <section> <text> </section> <title> </title> <section> … </section> </paper>

DTDs as Schemas Not so well suited: • impose unwanted constraints on order <!ELEMENT

DTDs as Schemas Not so well suited: • impose unwanted constraints on order <!ELEMENT person (name, phone)> • references cannot be constraint • can be to vague: <!ELEMENT person ((name|phone|email)*)>

XML Schemas • • • very recent proposal unifies previous schema proposals generalizes DTDs

XML Schemas • • • very recent proposal unifies previous schema proposals generalizes DTDs uses XML syntax two documents: structure and datatypes – http: //www. w 3. org/TR/xmlschema-1 – http: //www. w 3. org/TR/xmlschema-2

XML Schemas <element. Type name=“paper”> <sequence> <element. Type. Ref name=“title”/> <element. Type. Ref name=“author”

XML Schemas <element. Type name=“paper”> <sequence> <element. Type. Ref name=“title”/> <element. Type. Ref name=“author” min. Occurs=“ 0”/> <element. Type. Ref name=“year”/> <choice> <element. Type. Ref name=“journal”/> <element. Type. Ref name=“conference”/> </choice> </sequence> </element. Type> DTD: <!ELEMENT paper (title, author*, year, (journal|conference))>

RDF Schemas • http: //www. w 3. org/TR/PR-rdf-schema (3/99) • object-oriented flavor

RDF Schemas • http: //www. w 3. org/TR/PR-rdf-schema (3/99) • object-oriented flavor

RDF Schemas • recall RDF data: subject predicate object statement – resources – properties

RDF Schemas • recall RDF data: subject predicate object statement – resources – properties • RDF schema: – classes – properties

RDF Schemas Data: <rdf: Description ID=“car 001”> <name> My Honda </name> <miles> 50000 </miles>

RDF Schemas Data: <rdf: Description ID=“car 001”> <name> My Honda </name> <miles> 50000 </miles> <rdf: type resource=“#Motor. Vehicle”/> </rdf: Description>

RDF Schemas Schema: <rdf: Description ID=“Motor. Vehicle”> <rdf: type resource=“#Class”/> <rdf: sub. Class. Of

RDF Schemas Schema: <rdf: Description ID=“Motor. Vehicle”> <rdf: type resource=“#Class”/> <rdf: sub. Class. Of resource=“#Resource”/> </rdf: Description> <rdf: Description ID=“Truck”> <rdf: type resource=“#Class”/> <rdf: sub. Class. Of resource=“#Motor. Vehicle”/> </rdf: Description>

RDF Schemas car 001 name type Truck sub. Class. Of type Class miles Motor.

RDF Schemas car 001 name type Truck sub. Class. Of type Class miles Motor. Vehicle type My Honda 50000

RDF Schemas • different from object-oriented systems: – OO: define a class by set

RDF Schemas • different from object-oriented systems: – OO: define a class by set of properties – RDF: define a property in terms of its classes • metadata in RDF: – an RDF schema described as an RDF data

Summary of Schemas • in SS data: – graph theoretic – data and schema

Summary of Schemas • in SS data: – graph theoretic – data and schema are decoupled – used in data processing • in XML – from grammar to object-oriented – schema wired with the data – emphasis on semantics for exchange

Part 4 Systems Issues • • • Semistructured data and XML Query languages Schemas

Part 4 Systems Issues • • • Semistructured data and XML Query languages Schemas Systems issues Conclusions

Systems Issues • servers • mediators

Systems Issues • servers • mediators

Servers for Semistructured Data / XML • storage • index • query evaluation [Mc.

Servers for Semistructured Data / XML • storage • index • query evaluation [Mc. Hugh, Widom 1999]

XML Storage • • • text file (XML) store in ternary relation use DTD

XML Storage • • • text file (XML) store in ternary relation use DTD to derive schema mine data to derive schema build special purpose repository (Lore)

XML Storage: Text File • advantages – simple – less space than one thinks

XML Storage: Text File • advantages – simple – less space than one thinks – reasonable clustering • disadvantage – no updates – require special purpose query processor

Store XML in Ternary Relation Ref &o 1 paper &o 2 title &o 3

Store XML in Ternary Relation Ref &o 1 paper &o 2 title &o 3 author &o 4 “The Calculus” “…” year Val &o 5 “…” [Florescu, Kossman 1999] &o 6 “ 1986”

Use DTD to derive Schema • DTD: <!ELEMENT employee (name, address, project*)> <!ELEMENT address

Use DTD to derive Schema • DTD: <!ELEMENT employee (name, address, project*)> <!ELEMENT address (street, city, state, zip)> • ODMG classes: class Employee public type tuple (name: string, address: Address, project: List(Project)) class Address public type tuple (street: string, …) • [Christophides et al. 1994 , Shanmugasundaram et al. 1999]

Mine Data to Derive Schema paper Paper 1 paper year author title authortitleauthor title

Mine Data to Derive Schema paper Paper 1 paper year author title authortitleauthor title fn ln Paper 2 [Deutsch et al. 1999]

Indexing Semistructured Data • coercions: 1995 v. s. “ 1995” • regular path expressions

Indexing Semistructured Data • coercions: 1995 v. s. “ 1995” • regular path expressions – data guides [Goldman, Widom, 1997] – T-indexes [Milo, Suciu, 1999]

Indexing All Paths in the Data 1 t t 2 a t 3 b

Indexing All Paths in the Data 1 t t 2 a t 3 b 7 a 8 t t 4 c a d 5 a 9 10 11 12 6 a b 13 Semistructured Data t 1 t 23456 a b 7 8 10 12 13 d c 7 13 Data Guide 9 11 1 a 23456 b c b 7 13 8 10 12 T-Index d 9 11

Mediators for Semistructured Data / XML • XML = virtual view of Relational/OO/OR sources

Mediators for Semistructured Data / XML • XML = virtual view of Relational/OO/OR sources • mediator = translation, integration • issues: – query composition and rewriting [Papakonstatinou, et al. 1996] – limited source capabilities [Yerneni, et al. 1999]

Example: An XML Mediator Store SB • relational database: • virtual XML view: <store>

Example: An XML Mediator Store SB • relational database: • virtual XML view: <store> <name> n 1 </name> <book>. . . </book>. . . </store> <name>n 2 </name> <book>. . . </book> … </store> Book

Example: An XML Mediator • specify mediator declaratively (a view): from where Store, SB,

Example: An XML Mediator • specify mediator declaratively (a view): from where Store, SB, Book Store. sid=SB. sid and SB. bid=Book. bid construct <store ID=f(Store. sid)> <name> Store. name </name> <book> Book. title </book> </store>

Example: An XML Mediator • users ask XML-QL queries: – find stores who sell

Example: An XML Mediator • users ask XML-QL queries: – find stores who sell “The Calculus” where <store> <name> $n </name> <book> The Calculus </book> <store> construct <result> $n </result>

Example: An XML Mediator • system composes query with view: from Store, SB, Book

Example: An XML Mediator • system composes query with view: from Store, SB, Book where Store. sid=SB. sid and SB. bid=Book. bid and Book. title=“The Calculus” construct <result> Store. name </result>

Summary of Systems • unclear today how XML will be used – materialized ?

Summary of Systems • unclear today how XML will be used – materialized ? Need servers – virtual ? Need mediators • most work is still ahead

Part 5 Conclusions • • • Semistructured data and XML Query languages Schemas Systems

Part 5 Conclusions • • • Semistructured data and XML Query languages Schemas Systems issues Conclusions

Summary • • XML = what is out there semistructured data = what we

Summary • • XML = what is out there semistructured data = what we can process paradigm shift, for both Web and db covered in tutorial: – data models, queries, schemas

Current and Future Technologies • Web applications possible today: – export relational data to

Current and Future Technologies • Web applications possible today: – export relational data to XML (e. g. Oracle) – import XML directly into applications • Web applications in the future: – mediator technology (XML view) – store/process native XML data – compress XML – mine/analyze XML

Why This Is Cool for Database Researchers • put to work what you teach

Why This Is Cool for Database Researchers • put to work what you teach in CS 101 ! – tree traversals (structural recursion, XSL) – automata theory (DTD’s, path expressions) – graph theory (simulation) • adapt old DB tricks to new kind of data • save the trees: from fax to XML The End

Further Readings www. w 3. org/XML www-db. stanford. edu/~widom www-rocq. inria. fr/~abiteboul db. cis.

Further Readings www. w 3. org/XML www-db. stanford. edu/~widom www-rocq. inria. fr/~abiteboul db. cis. upenn. edu www. research. att. com/~suciu Abiteboul, Buneman, Suciu Data on the Web: From Relational to Semistructured to XML Morgan Kaufmann, 1999 (appears in October)