XML Schemas and Queries Zachary G Ives University

  • Slides: 51
Download presentation
XML, Schemas, and Queries Zachary G. Ives University of Pennsylvania CIS 455 / 555

XML, Schemas, and Queries Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems

Readings & Reminders § Reminder: Homework 1 Milestone 2 due 2/15 @ 11: 59

Readings & Reminders § Reminder: Homework 1 Milestone 2 due 2/15 @ 11: 59 PM § XML, DTD, Schema § XPath § XSLT § For next week: Altinel & Franklin paper on XFilter 2

Kinds of Content § Keyword search and inverted indices are great for locating text

Kinds of Content § Keyword search and inverted indices are great for locating text documents § … But what if we want to index and/or share other kinds of content? § § § Spreadsheets Maps Purchase records Objects etc. § Let’s talk about structured data representation and transport, then later indexing and retrieval… 3

Sending Data § How do we send data within a program? § What is

Sending Data § How do we send data within a program? § What is the implicit model? § How does this change when we need to make the data persistent? § What happens when we are coupling systems? § How do we send data between programs on the same machine? § Between different machines? 4

Marshalling § Converting from an in-memory data structure to something that can be sent

Marshalling § Converting from an in-memory data structure to something that can be sent elsewhere § Pointers -> something else § Specific byte orderings § Metadata § Note that the same logical data gets a different physical encoding § A specific case of Codd’s idea of logical-physical separation § “Data model” vs. “data” 5

Communication and Streams § When storing data to disk, we have a combination of

Communication and Streams § When storing data to disk, we have a combination of sequential and random access § When sending data on “the wire”, data is only sequential § “Stream-based communication” based on packets § What are the implications here? § Pipelining, incremental evaluation, … 6

Why Data Interchange Is Hard Need to be able to understand: § Data encoding

Why Data Interchange Is Hard Need to be able to understand: § Data encoding (physical data model) May have syntactic heterogeneity s Endian-ness, marshalling issues s Impedance mismatches § Data representation (logical data model) May have semantic heterogeneity Imprecise and ambiguous values/descriptions 7

Examples MP 3 ID 3 format – record at end of file offset length

Examples MP 3 ID 3 format – record at end of file offset length description 0 3 "TAG" identifier string. 3 30 Song title string. 33 30 Artist string. 63 30 Album string. 93 4 Year string. 97 28 Comment string. 125 1 Zero byte separator. 126 1 Track byte. 127 1 Genre byte. 8

Examples JPEG “JFIF” header: Start of Image (SOI) marker -- two bytes (FFD 8)

Examples JPEG “JFIF” header: Start of Image (SOI) marker -- two bytes (FFD 8) JFIF marker (FFE 0) length -- two bytes identifier -- five bytes: 4 A, 46, 49, 46, 00 (the ASCII code equivalent of a zero terminated "JFIF" string) § version -- two bytes: often 01, 02 § § the most significant byte is used for major revisions the least significant byte for minor revisions § units -- one byte: Units for the X and Y densities 0 => no units, X and Y specify the pixel aspect ratio 1 => X and Y are dots per inch 2 => X and Y are dots per cm § § § Xdensity -- two bytes Ydensity -- two bytes Xthumbnail -- one byte: 0 = no thumbnail Ythumbnail -- one byte: 0 = no thumbnail (RGB)n -- 3 n bytes: packed (24 -bit) RGB values for the thumbnail pixels, n = Xthumbnail * Ythumbnail 9

Finding File Formats § http: //www. wikipedia. org/ § http: //www. wotsit. org/ §

Finding File Formats § http: //www. wikipedia. org/ § http: //www. wotsit. org/ § etc. 10

The Problem § You need to look into a manual to find file formats

The Problem § You need to look into a manual to find file formats § (At best, e. g. , MS. DOC file format) § The Web is about making data exchange easier… Maybe we can do better! § “The mother of all file formats” 11

Desiderata for Data Interchange § Ability to represent many kinds of information Different data

Desiderata for Data Interchange § Ability to represent many kinds of information Different data structures § Hardware-independent encoding Endian-ness, UTF vs. ASCII vs. EBCDIC § Standard tools and interfaces § Ability to define “shape” of expected data With forwards- and backwards-compatibility! § That’s XML… 12

Consumers of XML A myriad of tools and interfaces, including: § DOM – document

Consumers of XML A myriad of tools and interfaces, including: § DOM – document object model Standard OO representation of an XML tree § SAX – simple API for XML An event-driven parser interface for XML s start. Element, end. Element, etc. § § Ant – Java-based “make” tool with XML “makefile” XPath, XQuery, XSLT Web service standards Anything AJAX (“mash-ups”) 13

XML as a Data Model XML “information set” includes 7 types of nodes: §

XML as a Data Model XML “information set” includes 7 types of nodes: § § § § Document (root) Element Attribute Processing instruction Text (content) Namespace: Comment XML data model includes this, plus typing info, plus order info and a few other things 14

Example XML Document Processing Instr. <? xml version="1. 0" encoding="ISO-8859 -1" ? > <dblp>

Example XML Document Processing Instr. <? xml version="1. 0" encoding="ISO-8859 -1" ? > <dblp> Open-tag <mastersthesis mdate="2002 -01 -03" key="ms/Brown 92"> <author>Kurt P. Brown</author> <title>PRPL: A Database Workload Specification Language</title> <year>1992</year> Element <school>Univ. of Wisconsin-Madison</school> </mastersthesis> <article mdate="2002 -01 -03" key="tr/dec/SRC 1997 -018"> <editor>Paul R. Mc. Jones</editor> Attribute <title>The 1995 SQL Reunion</title> <journal>Digital System Research Center Report</journal> <volume>SRC 1997 -018</volume> Close-tag <year>1997</year> <ee>db/labs/dec/SRC 1997 -018. html</ee> <ee>http: //www. mcjones. org/System_R/SQL_Reunion_95/</ee> 15 </article>

XML Data Model Visualized (~ Document Object Model) Root ? xml 2002… element article

XML Data Model Visualized (~ Document Object Model) Root ? xml 2002… element article mdate author title year school 1992 key editor title journal volume year ee ee 2002… tr/dec/… PRPL… Kurt P…. p-i dblp key ms/Brown 92 attribute text mastersthesis mdate root Digital… Univ…. 1997 The… Paul R. db/labs/dec SRC… http: //www. 16

A Few Common Uses of XML Serves as an extensible HTML § Allows custom

A Few Common Uses of XML Serves as an extensible HTML § Allows custom tags (e. g. , used by MS Word, openoffice) § Supplement it with stylesheets (XSL) to define formatting Provides an exchange format for data (still need to agree on terminology) § Tables, objects, etc. Format for marshalling and unmarshalling data in Web Services 17

XML as a Super-HTML (MS Word) <h 1 class="Section 1"> <a name="_top“ />CIS 550:

XML as a Super-HTML (MS Word) <h 1 class="Section 1"> <a name="_top“ />CIS 550: Database and Information Systems</h 1> <h 2 class="Section 1">Fall 2003</h 2> <p class="Mso. Normal"> <place>311 Towne</place>, Tuesday/Thursday <time Hour="13" Minute="30">1: 30 PM – 3: 00 PM</time> </p> 18

XML Easily Encodes Relations Student-course-grade id course grade 1 330 -f 03 B 23

XML Easily Encodes Relations Student-course-grade id course grade 1 330 -f 03 B 23 455 -s 04 A <student-course-grade> <tuple> <sid>1</sid><course>330 -f 03</course><grade>B</grade> </tuple> <sid>23</sid><course>455 -s 04</course><grade>A</grade> </tuple> </student-course-grade> 19

It Also Encodes Objects (with Pointers Represented as IDs) <projects> <project class=“cse 455” >

It Also Encodes Objects (with Pointers Represented as IDs) <projects> <project class=“cse 455” > <type>Programming</type> <member. List> <team. Member>Joan</team. Member> <team. Member>Jill</team. Member> </member. List> <code. URL>www…. </code. URL> <incorporates. Project. From class=“cse 330” /> </project> … 20

XML and Code § Web Services (. NET, Java web service toolkits) are using

XML and Code § Web Services (. NET, Java web service toolkits) are using XML to pass parameters and make function calls – marshalling as part of remote procedure calls § SOAP + WSDL § Why? Easy to be forwards-compatible Easy to read over and validate (? ) Generally firewall-compatible § Drawbacks? XML is a verbose and inefficient encoding! But if the calls are only sending a few 100 s of bytes, who cares? 21

XML When Tags Are Used by Different Sources § Namespaces allow us to specify

XML When Tags Are Used by Different Sources § Namespaces allow us to specify a context for different tags § Two parts: § Binding of namespace to URI § Qualified names <tag xmlns: myns=http: //www. fictitious. com/mypath xmlns=“http: //www. default/mypath”> <thistag>is in default namespace</thistag> <myns: thistag>this a different tag</myns: thistag> </tag> 22

XML Isn’t Enough on Its Own It’s too unconstrained for many cases! § How

XML Isn’t Enough on Its Own It’s too unconstrained for many cases! § How will we know when we’re getting garbage? § How will we query? § How will we understand what we got? 23

Document Type Definitions (DTDs) DTD is an EBNF grammar defining XML structure § XML

Document Type Definitions (DTDs) DTD is an EBNF grammar defining XML structure § XML document specifies an associated DTD, plus the root element § DTD specifies children of the root (and so on) DTD defines special significance for attributes: § IDs – special attributes that are analogous to keys for elements § IDREFs – references to IDs § IDREFS – space-delimited list of IDREFs 24

An Example DTD: <!ELEMENT dblp((mastersthesis | article)*)> <!ELEMENT mastersthesis(author, title, year, school, committeemember*)> <!ATTLIST

An Example DTD: <!ELEMENT dblp((mastersthesis | article)*)> <!ELEMENT mastersthesis(author, title, year, school, committeemember*)> <!ATTLIST mastersthesis(mdate CDATA #REQUIRED key ID #REQUIRED advisor CDATA #IMPLIED> <!ELEMENT author(#PCDATA)> … Example use of DTD in XML file: <? xml version="1. 0" encoding="ISO-8859 -1" ? > <!DOCTYPE dblp SYSTEM “my. dtd"> <dblp>… 25

DTDs Are Very Limited DTDs capture grammatical structure, but have some drawbacks: § Only

DTDs Are Very Limited DTDs capture grammatical structure, but have some drawbacks: § Only string scalar types § Global ID/reference space is inconvenient § No way of defining OO-like inheritance 26

XML Schema: DTDs Rethought Features: § § XML syntax Better way of defining keys

XML Schema: DTDs Rethought Features: § § XML syntax Better way of defining keys using XPaths Type subclassing … And, of course, built-in datatypes 27

Basic Constructs of Schema Separation of elements (and attributes) from types: § complex. Type

Basic Constructs of Schema Separation of elements (and attributes) from types: § complex. Type is a structured type It can have sequences or choices § element and attribute have name and type Elements may also have min. Occurs and max. Occurs Subtyping, most commonly using: <complex. Content> <extension base=“prev. Type”> … </…> 28

Simple Schema Example <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema"> <xsd: element name=“mastersthesis"

Simple Schema Example <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema"> <xsd: element name=“mastersthesis" type=“Thesis. Type"/> <xsd: complex. Type name=“Thesis. Type"> <xsd: attribute name=“mdate" type="xsd: date"/> <xsd: attribute name=“key" type="xsd: string"/> <xsd: attribute name=“advisor" type="xsd: string"/> <xsd: sequence> <xsd: element name=“author" type=“xsd: string"/> <xsd: element name=“title" type=“xsd: string"/> <xsd: element name=“year" type=“xsd: integer"/> <xsd: element name=“school" type=“xsd: string”/> <xsd: element name=“committeemember" type=“Committee. Type” min. Occurs=“ 0"/> </xsd: sequence> </xsd: complex. Type> 29

Embedding XML Schema <root xmlns: xsi="http: //www. w 3. org/2000/10/XMLSchema-instance" xsi: no. Namespace. Schema.

Embedding XML Schema <root xmlns: xsi="http: //www. w 3. org/2000/10/XMLSchema-instance" xsi: no. Namespace. Schema. Location="s 1. xsd" > <grade>a</grade> </root> <s 1: root xmlns: s 1="http: //www. schema. Valid. com/s 1 ns" xmlns: xsi="http: //www. w 3. org/2000/10/XMLSchema-instance" xsi: schema. Location="http: //www. schema. Valid. com/s 1 ns. xsd" > <s 1: grade>a</s 1: grade> </s 1: root> But the XML parser is actually free to ignore this – the schema is typically specified “from outside” the document 30

Designing an XML Schema/DTD Often we are given a DTD/Schema; if not, we need

Designing an XML Schema/DTD Often we are given a DTD/Schema; if not, we need to design one We orient the XML tree around the “central” objects in a particular application 31

Manipulating XML Sometimes: § Need to restructure an XML document § Or simply need

Manipulating XML Sometimes: § Need to restructure an XML document § Or simply need to retrieve certain parts that satisfy a constraint, e. g. : All books by author XYZ 32

Document Object Model (DOM) vs. Queries § Build a DOM tree (as we saw

Document Object Model (DOM) vs. Queries § Build a DOM tree (as we saw earlier) and access via Java (etc. ) DOMNode object § DOM objects have methods like “get. First. Child()”, “get. Next. Sibling” § Common way of traversing the tree § Can also modify the DOM tree – alter the XML – via insert. After(), etc. § Alternate approach: a query language § Define some sort of a template describing traversals from the root of the directed graph § In XML, the basis of this template is called an XPath Can also declare some constraints on the values you want The XPath returns a node set of matches 33

XPaths In its simplest form, an XPath is like a path in a file

XPaths In its simplest form, an XPath is like a path in a file system: /mypath/subpath/*/morepath § The XPath returns a node set representing the XML nodes (and their subtrees) at the end of the path § XPaths can have node tests at the end, returning only particular node types, e. g. , text(), processing-instruction(), comment(), element(), attribute() § XPath is fundamentally an ordered language: it can query in order-aware fashion, and it returns nodes in order 34

Sample XML <? xml version="1. 0" encoding="ISO-8859 -1" ? > <dblp> <mastersthesis mdate="2002 -01

Sample XML <? xml version="1. 0" encoding="ISO-8859 -1" ? > <dblp> <mastersthesis mdate="2002 -01 -03" key="ms/Brown 92"> <author>Kurt P. Brown</author> <title>PRPL: A Database Workload Specification Language</title> <year>1992</year> <school>Univ. of Wisconsin-Madison</school> </mastersthesis> <article mdate="2002 -01 -03" key="tr/dec/SRC 1997 -018"> <editor>Paul R. Mc. Jones</editor> <title>The 1995 SQL Reunion</title> <journal>Digital System Research Center Report</journal> <volume>SRC 1997 -018</volume> <year>1997</year> <ee>db/labs/dec/SRC 1997 -018. html</ee> <ee>http: //www. mcjones. org/System_R/SQL_Reunion_95/</ee> </article> 35

XML Data Model Visualized Root ? xml 2002… element article mdate author title year

XML Data Model Visualized Root ? xml 2002… element article mdate author title year school 1992 key editor title journal volume year ee ee 2002… tr/dec/… PRPL… Kurt P…. p-i dblp key ms/Brown 92 attribute text mastersthesis mdate root Digital… Univ…. 1997 The… Paul R. db/labs/dec SRC… http: //www. 36

Some Example XPath Queries § § /dblp/mastersthesis/title /dblp/*/editor //title/text() 37

Some Example XPath Queries § § /dblp/mastersthesis/title /dblp/*/editor //title/text() 37

Context Nodes and Relative Paths XPath has a notion of a context node: it’s

Context Nodes and Relative Paths XPath has a notion of a context node: it’s analogous to a current directory § “. ” represents this context node § “. . ” represents the parent node § We can express relative paths: subpath/sub-subpath/. . gets us back to the context node Ø By default, the document root is the context node 38

Predicates – Filtering Operations A predicate allows us to filter the node set based

Predicates – Filtering Operations A predicate allows us to filter the node set based on selectionlike conditions over sub-XPaths: /dblp/article[title = “Paper 1”] which is equivalent to: /dblp/article[. /title/text() = “Paper 1”] because of type coercion. What does this do: /dblp/article[@key = “ 123” and. /title/text() = “Paper 1” and. /author/*/element()] 39

Axes: More Complex Traversals Thus far, we’ve seen XPath expressions that go down the

Axes: More Complex Traversals Thus far, we’ve seen XPath expressions that go down the tree (and up one step) § But we might want to go up, left, right, etc. § These are expressed with so-called axes : self: : path-step child: : path-step descendant-or-self: : path-step preceding-sibling: : path-step preceding: : path-step parent: : path-step ancestor-or-self: : path-step following-sibling: : path-step following: : path-step § The previous XPaths we saw were in “abbreviated form” 40

Users of XPath § XML Schema uses simple XPaths in defining keys and uniqueness

Users of XPath § XML Schema uses simple XPaths in defining keys and uniqueness constraints § XLink and XPointer, hyperlinks for XML § XSLT – useful for converting from XML to other representations (e. g. , HTML, PDF, SVG) § XQuery – useful for restructuring an XML document or combining multiple documents § Might well turn into the “glue” between Web Services, etc. 41

A Functional Language for XML § XSLT is based on a series of templates

A Functional Language for XML § XSLT is based on a series of templates that match different parts of an XML document § There’s a policy for what rule or template is applied if more than one matches (it’s not what you’d think!) § XSLT templates can invoke other templates § XSLT templates can be nonterminating (beware!) § XSLT templates are based on XPath “match”es, and we can also apply other templates (potentially to “select”ed XPaths) § Within each template, directly describe what should be output 42

An XSLT Template § An XML document itself § XML tags create output OR

An XSLT Template § An XML document itself § XML tags create output OR are XSL operations § All XSL tags are prefixed with “xsl” namespace § All non-XSL tags are part of the XML output § Common XSL operations: § template with a match XPath § Recursive call to apply-templates, which may also select where it should be applied § Attach to XML document with a processing-instruction: <? xml version = “ 1. 0” ? > <? xml-stylesheet type=“text/xsl” href=“http: //www. com/my. xsl” ? > 43

An Example XSLT Stylesheet <xsl: stylesheet version=“ 1. 1”> <xsl: template match=“/dblp”> <html><head>This is

An Example XSLT Stylesheet <xsl: stylesheet version=“ 1. 1”> <xsl: template match=“/dblp”> <html><head>This is DBLP</head> <body> <xsl: apply-templates /> </body> </html> </xsl: template> <xsl: template match=“inproceedings”> <h 2><xsl: apply-templates select=“title” /></h 2> <p><xsl: apply-templates select=“author”/></p> </xsl: template> … </xsl: stylesheet> 44

XSLT Processing Model § List of source nodes result tree fragment(s) § Start with

XSLT Processing Model § List of source nodes result tree fragment(s) § Start with root § Find all template rules with matching patterns from root Find “best” match according to some heuristics Set the current node list to be the set of things it maches § Iterate over each node in the current node list Apply the operations of the template “Append” the results of the matching template rule to the result tree structure s Repeat recursively if specified to by apply-templates 45

What If There’s More than One Match? § § Eliminate rules of lower precedence

What If There’s More than One Match? § § Eliminate rules of lower precedence due to importing Break a rule into any | branches and consider separately Choose rule with highest computed or specified priority Simple rules for computing priority based on “precision”: § § QName preceded by XPath child/axis specifier: priority 0 NCName preceded by child/axis specifier: priority -0. 25 Node. Test preceded by child/axis specifier: pririty -0. 5 else priority 0. 5 46

Other Common Operations § Iteration: <xsl: for-each select=“path”> </xsl: for-each> § Conditionals: <xsl: if

Other Common Operations § Iteration: <xsl: for-each select=“path”> </xsl: for-each> § Conditionals: <xsl: if test=“. /text() < ‘abc’”> </xsl: if> § Copying current node and children to the result set: <xsl: copy> <xsl: apply-templates /> </xsl: copy> 47

Creating Output Nodes § Return text/attribute data (this is a default rule): <xsl: template

Creating Output Nodes § Return text/attribute data (this is a default rule): <xsl: template match=“text()|@*”> <xsl: value-of select=“. ”/> </xsl: template> § Create an element from text (attribute is similar): <xsl: element name=“text()”> <xsl: apply-templates/> </xsl: element> § Copy nodes matching a path <xsl: copy-of select=“*”/> 48

Embedding Stylesheets § You can “import” or “include” one stylesheet from another: <xsl: import

Embedding Stylesheets § You can “import” or “include” one stylesheet from another: <xsl: import href=“http: //www. com/my. xsl/”> <xsl: include href=“http: //www. com/my. xsl/”> § “Include”: the rules get same precedence as in including template § “Import”: the rules are given lower precedence 49

XSLT Summary § A very powerful, template-based transformation language for XML document other structured

XSLT Summary § A very powerful, template-based transformation language for XML document other structured document § Commonly used to convert XML PDF, SVG, Graph. Viz DOT format, HTML, WML, … § Primarily useful for presentation of XML or for very simple conversions § But sometimes we need more complex operations when converting data from one source to another § Joins – combining and correlating information from multiple sources § Aggregation – computing averages, counts, etc. 50

XSLT and Alternatives XSLT is focused on reformatting documents § Stylesheets are focused around

XSLT and Alternatives XSLT is focused on reformatting documents § Stylesheets are focused around one XML file § XML file must reference the stylesheet What if we want to: § § Manage and combine collections of XML documents? Make Web service requests for XML? “Glue together” different Web service requests? Query for keywords within documents, with ranked answers § This is where XQuery plays a role – see CIS 330 / 550 for details 51