Lecture 14 Metadata and Markup SIMS 202 Information

  • Slides: 67
Download presentation
Lecture 14: Metadata and Markup SIMS 202: Information Organization and Retrieval Prof. Ray Larson

Lecture 14: Metadata and Markup SIMS 202: Information Organization and Retrieval Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10: 30 am - 12: 00 pm Fall 2003 http: //www. sims. berkeley. edu/academics/courses/is 202/f 03/ IS 202 – FALL 2003. 10. 09 - SLIDE 1

Lecture Overview • Review – XML and Document Engineering • Metadata And Markup –

Lecture Overview • Review – XML and Document Engineering • Metadata And Markup – XML As A Metadata Lingua Franca • METS – SGML vs. XML DTD Construction – XML Schemas – XML For Protocols And Metadata Languages • Readings/Discussion IS 202 – FALL 2003. 10. 09 - SLIDE 2

Lecture Overview • Review – XML and Document Engineering • Metadata And Markup –

Lecture Overview • Review – XML and Document Engineering • Metadata And Markup – XML As A Metadata Lingua Franca • METS – SGML vs. XML DTD Construction – XML Schemas – XML For Protocols And Metadata Languages • Readings/Discussion IS 202 – FALL 2003. 10. 09 - SLIDE 3

Lecture Overview • Review – XML and Document Engineering • Metadata And Markup –

Lecture Overview • Review – XML and Document Engineering • Metadata And Markup – XML As A Metadata Lingua Franca • METS – SGML vs. XML DTD Construction – XML Schemas – XML For Protocols And Metadata Languages • Readings/Discussion IS 202 – FALL 2003. 10. 09 - SLIDE 4

XML as a common syntax • XML (and SGML) provide a way of expressing

XML as a common syntax • XML (and SGML) provide a way of expressing the structure of documents that can be verified and validated by document processing systems • “Documents” can be metadata structures – Such as the description of a particular photograph in our Phone project • XML thus provides a way of representing metadata descriptions as well as the content that they describe IS 202 – FALL 2003. 10. 09 - SLIDE 5

XML as a common syntax • All XML documents follow some simple rules that

XML as a common syntax • All XML documents follow some simple rules that make them interchangeable and usable across different systems – All data and markup is in UNICODE – All elements are marked by begin and end tags – All markup is case-sensitive – XML DTD’s and/or Schemas define the valid structure (and sometimes content) of the documents IS 202 – FALL 2003. 10. 09 - SLIDE 6

Example – METS • METS – the Metadata Encoding and Transmission Standard is a

Example – METS • METS – the Metadata Encoding and Transmission Standard is a new Schema intended to provide: – “a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium” • METS can be used to “wrap” complex sets of data (the actual data, with rules for encoding binary forms), the metadata describing the parts of that data, and the sequence and conditions under which the data can or should be presented or displayed IS 202 – FALL 2003. 10. 09 - SLIDE 7

Lecture Overview • Review – XML and Document Engineering • Metadata And Markup –

Lecture Overview • Review – XML and Document Engineering • Metadata And Markup – XML As A Metadata Lingua Franca • METS – SGML vs. XML DTD Construction – XML Schemas – XML For Protocols And Metadata Languages • Readings/Discussion IS 202 – FALL 2003. 10. 09 - SLIDE 8

SGML/XML Structure • An SGML document consists of three parts: – The SGML Declaration

SGML/XML Structure • An SGML document consists of three parts: – The SGML Declaration – The Document Type Definition (DTD) – The Document Instance • An XML document REQUIRES only the document instance, but for effective processing a DTD is very important • XML Schema (later) provides an alternative to DTDs for XML applications IS 202 – FALL 2003. 10. 09 - SLIDE 9

Document Type Definitions • The DTD describes the structural elements and "shorthand" markup for

Document Type Definitions • The DTD describes the structural elements and "shorthand" markup for a particular document type and defines: – Names of "legal" elements – How many times elements can appear – The order of elements in a document – Whether markup can be omitted (SGML only) – Contents of elements (i. e. , nested structures) – Attributes associated with elements – Names of "entities" – Short-hand conventions for element tags (SGML only) IS 202 – FALL 2003. 10. 09 - SLIDE 10

DTD Components • The major components of a DTD are: – Entity Declarations –

DTD Components • The major components of a DTD are: – Entity Declarations – Element Declarations – Attribute Declarations IS 202 – FALL 2003. 10. 09 - SLIDE 11

Document Type Definitions • Entity Declarations are a "macro" definition facility for both DTD

Document Type Definitions • Entity Declarations are a "macro" definition facility for both DTD and Document instance parts – General Internal Entity Definitions <!ENTITY name "substitute string"> referenced by &name; – General External Entity Definitions <!ENTITY name SYSTEM "file path"> referenced by &name; – Parameter Entity Definitions (used only inside DTDs) <!ENTITY %name "substitute string"> or <!ENTITY %name SYSTEM "file path"> referenced by %name; or %name IS 202 – FALL 2003. 10. 09 - SLIDE 12

Document Type Definitions • SGML Element Declarations define the structural elements of a document

Document Type Definitions • SGML Element Declarations define the structural elements of a document and its associated markup <!ELEMENT name - - content_model or declared_content +(include_list) (exclude_list) > – Omitted tag minimization indicates whether start-tags or end-tags can be omitted in the markup (o) or (-) are required in SGML but can NOT be used in XML IS 202 – FALL 2003. 10. 09 - SLIDE 13

Document Type Definitions • Content model provides a nested structural description of the elements

Document Type Definitions • Content model provides a nested structural description of the elements that make up this element, e. g. : <!ELEMENT memo - - ((to & from), body, close? )> <!ELEMENT body - O (p)* > <!ELEMENT p - O (#PCDATA | q)*> <!ELEMENT q - - (#PCDATA)>. . . – ANY (in SGML) may be used to indicate a content model of any elements in the DTD, in any order IS 202 – FALL 2003. 10. 09 - SLIDE 14

Document Type Definitions • Same content model in XML <? xml version = “

Document Type Definitions • Same content model in XML <? xml version = “ 1. 0”? > <!DOCTYPE memo [ <!ELEMENT memo ((to | from)+, body, close? )> <!ELEMENT body (p)* > <!ELEMENT p (#PCDATA | q)* > <!ELEMENT q (#PCDATA)>… ]> – Note the XML processing instruction “Prolog” – Note that & in previous page is not legal XML IS 202 – FALL 2003. 10. 09 - SLIDE 15

Document Type Definitions • Declared content can be: PCDATA, RCDATA, EMPTY • Inclusion and

Document Type Definitions • Declared content can be: PCDATA, RCDATA, EMPTY • Inclusion and Exclusion lists can be used to indicate elements that can occur or are forbidden to occur in any sub-elements of the content model (NOT in XML), e. g. : <!ELEMENT memo -- ((to & from), body close? ) +(fn)> – Says that element fn can appear anyplace in the memo IS 202 – FALL 2003. 10. 09 - SLIDE 16

Document Type Definitions • Attribute Declarations define attributes associated with (potentially) each element of

Document Type Definitions • Attribute Declarations define attributes associated with (potentially) each element of a document and provide the acceptable values for those attributes IS 202 – FALL 2003. 10. 09 - SLIDE 17

Attributes Example • <!ATTLIST associate_element attribute_name declared_value default_value > • <!ATTLIST memo status (PUBLIC

Attributes Example • <!ATTLIST associate_element attribute_name declared_value default_value > • <!ATTLIST memo status (PUBLIC | CONFIDENTIAL) PUBLIC> – In markup of a document: <memo status="CONFIDENTIAL"> also, because of the default set: <memo> would be the same as <memo status="PUBLIC"> There a variety of special defaults and data types that can be given in attribute definitions IS 202 – FALL 2003. 10. 09 - SLIDE 18

Sample SGML DTD <!doctype ELIB-TEXTS [ <!-- This is a DTD for bibliographic records

Sample SGML DTD <!doctype ELIB-TEXTS [ <!-- This is a DTD for bibliographic records extracted from the elib/rfc 1357 simple bibliographic format. --> <!ELEMENT ELIB-TEXTS o o (ELIB-BIB*)> <!-- We allow most elements to occur any number of times in any order --> <!-- this is because there is little consistency in the actual usage. --> <!ELEMENT ELIB-BIB - - (BIB-VERSION, ID, ENTRY? , DATE? , TITLE*, ORGANIZATION*, (SERIES | TYPE | REVISION-DATE | AUTHOR-PERSONAL | AUTHOR-INSTITUTIONAL | AUTHOR-CONTRIBUTING-PERSONAL | AUTHOR-CONTRIBUTING-INSTITUTIONAL | CONTACT AUTHOR | PROJECT | PAGES | BIOREGION | CERES-BIOREGION | TEXTSOUP | LOCATION | ULTIMATE-CLIENT | URL | KEYWORDS | NOTES | ABSTRACT)*, (TEXT-REF | PAGED-REF)* )> <!-- We won't make any assumptions about content. . . all PCDATA --> <!ELEMENT <!ELEMENT … etc… ]> ID - o (#PCDATA)> ABSTRACT - o (#PCDATA)> AUTHOR-CONTRIBUTING-INSTITUTIONAL - o (#PCDATA)> AUTHOR-CONTRIBUTING-PERSONAL - o (#PCDATA)> AUTHOR-PERSONAL-CONTRIBUTING - o (#PCDATA)> IS 202 – FALL 2003. 10. 09 - SLIDE 19

XML Version <!doctype ELIB-TEXTS [ <!-- This is a DTD for bibliographic records extracted

XML Version <!doctype ELIB-TEXTS [ <!-- This is a DTD for bibliographic records extracted from the elib/rfc 1357 simple bibliographic format. --> <!ELEMENT ELIB-TEXTS(ELIB-BIB*)> <!-- We allow most elements to occur any number of times in any order --> <!-- this is because there is little consistency in the actual usage. --> <!ELEMENT ELIB-BIB (BIB-VERSION, ID, ENTRY? , DATE? , TITLE*, ORGANIZATION*, (SERIES | TYPE | REVISION-DATE | AUTHOR-PERSONAL | AUTHOR-INSTITUTIONAL | AUTHOR-CONTRIBUTING-PERSONAL | AUTHOR-CONTRIBUTING-INSTITUTIONAL | CONTACT AUTHOR | PROJECT | PAGES | BIOREGION | CERES-BIOREGION | TEXTSOUP | LOCATION | ULTIMATE-CLIENT | URL | KEYWORDS | NOTES | ABSTRACT)*, (TEXT-REF | PAGED-REF)* )> <!-- We won't make any assumptions about content. . . all PCDATA --> <!ELEMENT <!ELEMENT … etc… ]> ID (#PCDATA)> ABSTRACT (#PCDATA)> AUTHOR-CONTRIBUTING-INSTITUTIONAL (#PCDATA)> AUTHOR-CONTRIBUTING-PERSONAL (#PCDATA)> AUTHOR-PERSONAL-CONTRIBUTING (#PCDATA)> IS 202 – FALL 2003. 10. 09 - SLIDE 20

Document Using That DTD <ELIB-BIB> <BIB-VERSION>ELIB-v 1. 0 </BIB-VERSION> <ID>6</ID> <ENTRY>February 13 1995</ENTRY> <DATE>March

Document Using That DTD <ELIB-BIB> <BIB-VERSION>ELIB-v 1. 0 </BIB-VERSION> <ID>6</ID> <ENTRY>February 13 1995</ENTRY> <DATE>March 1, 1993</DATE> <TITLE>Water Conditions in California Report 2</TITLE> <ORGANIZATION>California Department of Water Resources</ORGANIZATION> <SERIES>120 -93</SERIES> <TYPE>bulletin</TYPE> <AUTHOR-INSTITUTIONAL>California Department of Water Resources </AUTHOR-INSTITUTIONAL> <PAGES>17</PAGES> <TEXT-REF>/elib/data/disk 5/documents/6/HYPEROCR/hyperocr. html </TEXT-REF> <PAGED-REF>/elib/data/disk 5/documents/6/OCR-ASCII-NOZONE </PAGED-REF> </ELIB-BIB> IS 202 – FALL 2003. 10. 09 - SLIDE 21

Dublin Core • Review… • Simple metadata for describing internet resources • For “Document-Like

Dublin Core • Review… • Simple metadata for describing internet resources • For “Document-Like Objects” • 15 Elements IS 202 – FALL 2003. 10. 09 - SLIDE 22

Dublin Core Elements • • Title Creator Subject Description Publisher Other Contributors Date Resource

Dublin Core Elements • • Title Creator Subject Description Publisher Other Contributors Date Resource Type IS 202 – FALL 2003 • • Format Resource Identifier Source Language Relation Coverage Rights Management 2003. 10. 09 - SLIDE 23

DC XML DTD Implementation • There have been various versions • This one is

DC XML DTD Implementation • There have been various versions • This one is the one recommended (required) by the Open Archives Initiative Metadata Harvesting Protocol (OAI-MHP) • Uses XML Name Spaces • Available at http: //dublincore. org/documents/2001/09/20/dcmes-xml/ IS 202 – FALL 2003. 10. 09 - SLIDE 24

DC Element and Attribute Definitions <!-- The elements from DCMES 1. 1 --> <!--

DC Element and Attribute Definitions <!-- The elements from DCMES 1. 1 --> <!-- The name given to the resource. --> <!ELEMENT dc: title (#PCDATA)> <!ATTLIST dc: title xml: lang CDATA #IMPLIED> <!-- An entity primarily responsible for making the content of the resource. --> <!ELEMENT dc: creator (#PCDATA)> <!ATTLIST dc: creator xml: lang CDATA #IMPLIED> <!-- The topic of the content of the resource. --> <!ELEMENT dc: subject (#PCDATA)> <!ATTLIST dc: subject xml: lang CDATA #IMPLIED> <!-- An account of the content of the resource. --> <!ELEMENT dc: description (#PCDATA)> <!ATTLIST dc: description xml: lang CDATA #IMPLIED> <!-- The entity responsible for making the resource available. --> <!ELEMENT dc: publisher (#PCDATA)> <!ATTLIST dc: publisher xml: lang CDATA #IMPLIED> <!-- An entity responsible for making contributions to the content of the resource. --> <!ELEMENT dc: contributor (#PCDATA)> <!ATTLIST dc: contributor xml: lang CDATA #IMPLIED> <!-- A date associated with an event in the life cycle of the resource. --> <!ELEMENT dc: date (#PCDATA)> <!ATTLIST dc: date xml: lang CDATA #IMPLIED> IS 202 – FALL 2003. 10. 09 - SLIDE 25

DC Element Definitions (cont. ) <!-- The nature or genre of the content of

DC Element Definitions (cont. ) <!-- The nature or genre of the content of the resource. --> <!ELEMENT dc: type (#PCDATA)> <!ATTLIST dc: type xml: lang CDATA #IMPLIED> <!-- The physical or digital manifestation of the resource. --> <!ELEMENT dc: format (#PCDATA)> <!ATTLIST dc: format xml: lang CDATA #IMPLIED> <!-- An unambiguous reference to the resource within a given context. --> <!ELEMENT dc: identifier (#PCDATA)> <!ATTLIST dc: identifier xml: lang CDATA #IMPLIED> <!ATTLIST dc: identifier rdf: resource CDATA #IMPLIED> <!-- A Reference to a resource from which the present resource is derived. --> <!ELEMENT dc: source (#PCDATA)> <!ATTLIST dc: source xml: lang CDATA #IMPLIED> <!ATTLIST dc: source rdf: resource CDATA #IMPLIED> <!-- A language of the intellectual content of the resource. --> <!ELEMENT dc: language (#PCDATA)> <!ATTLIST dc: language xml: lang CDATA #IMPLIED> <!-- A reference to a related resource. --> <!ELEMENT dc: relation (#PCDATA)> <!ATTLIST dc: relation xml: lang CDATA #IMPLIED> <!ATTLIST dc: relation rdf: resource CDATA #IMPLIED> <!-- The extent or scope of the content of the resource. --> <!ELEMENT dc: coverage (#PCDATA)> <!ATTLIST dc: coverage xml: lang CDATA #IMPLIED> <!-- Information about rights held in and over the resource. --> <!ELEMENT dc: rights (#PCDATA)> <!ATTLIST dc: rights xml: lang CDATA #IMPLIED> IS 202 – FALL 2003. 10. 09 - SLIDE 26

A More Complex SGML DTD <!DOCTYPE USMARC [ <!-- USMARC DTD. UCB-SLIS v. 0.

A More Complex SGML DTD <!DOCTYPE USMARC [ <!-- USMARC DTD. UCB-SLIS v. 0. 08 --> <!-- By Jerome P. Mc. Donough, April 1, 1994 --> <!ELEMENT USMARC - - (Leader, Directry, Var. Flds)> <!ATTLIST USMARC Material (BK|AM|CF|MP|MU|VM|SE) "BK" id CDATA #IMPLIED> <!-- Author's Note: the id attribute for the USMARC element is intended to hold a unique record number for each MARC record in the local database. That is to say, it is intended ONLY as an aid in maintaining the local database of MARC records --> <!ELEMENT Leader - O (LRL, Rec. Stat, Rec. Type, Bib. Level, UCP, Ind. Count, SFCount, Base. Addr, Enc. Level, Dsc. Cat. Fm, Link. Rec, Entry. Map)> <!ELEMENT Directry - O (#PCDATA)> <!ELEMENT Var. Flds - O (Var. CFlds, Var. DFlds)> <!-- Component parts of Leader --> <!-- Logical Record Length --> <!ELEMENT LRL - O (#PCDATA)> …etc… IS 202 – FALL 2003. 10. 09 - SLIDE 27

More Complex DTD (cont. ) <!-- Variable Data Fields --> <!ELEMENT Var. DFlds -

More Complex DTD (cont. ) <!-- Variable Data Fields --> <!ELEMENT Var. DFlds - O (Numb. Code, Main. Enty? , Titles, Ed. Imprnt? , Phys. Desc? , Series? , Notes? , Subj. Accs? , Add. Enty? , Link. Enty? , SAdd. Enty? , Hold. Alt. G? , Fld 9 XX? )> <!-- Component Parts of Variable Data Fields --> <!-- Numbers & Codes --> <!ELEMENT Numb. Code - O (Fld 010? , Fld 011? , Fld 015? , Fld 017*, Fld 018? , Fld 019*, Fld 020*, Fld 022*, Fld 023*, Fld 024*, Fld 025*, Fld 027*, Fld 028*, Fld 029*, Fld 030*, Fld 032*, Fld 033*, Fld 034*, Fld 035*, Fld 036? , Fld 037*, Fld 039*, Fld 040? , Fld 041? , Fld 042? , Fld 043? , Fld 044? , Fld 045? , Fld 046? , Fld 047? , Fld 048*, Fld 050*, Fld 051*, Fld 052*, Fld 055*, Fld 060*, Fld 061*, Fld 066? , Fld 069*, Fld 070*, Fld 071*, Fld 072*, Fld 074*, Fld 080? , Fld 082*, Fld 084*, Fld 086*, Fld 088*, Fld 090*, Fld 096*)> <!-- Main Entries --> <!ELEMENT Main. Enty - O (Fld 100? , Fld 111? , Fld 130? )> <!-- Titles --> <!ELEMENT Titles - O (Fld 210? , Fld 211*, Fld 212*, Fld 214*, Fld 222*, Fld 240? , Fld 242*, Fld 243? , Fld 245, Fld 246*, Fld 247*)> <!-- Edition, Imprint, etc. --> <!ELEMENT Ed. Imprnt - O (Fld 250? , Fld 254? , Fld 255*, Fld 256? , Fld 257? , Fld 260? , Fld 261? , Fld 262? , Fld 263? , Fld 265? )> <!-- Physical Description, etc. --> <!ELEMENT Phys. Desc - O (Fld 300*, Fld 305*, Fld 306? , Fld 310? , Fld 315? , Fld 321*, Fld 340*, Fld 350? , Fld 351*, Fld 355*, Fld 357*, Fld 362*)> …etc… IS 202 – FALL 2003. 10. 09 - SLIDE 28

Complex DTD (cont. ) <!-- Title Statement --> <!ELEMENT Fld 245 - O (Six?

Complex DTD (cont. ) <!-- Title Statement --> <!ELEMENT Fld 245 - O (Six? , (a|b|c|f|g|h|k|n|p|s)+)> <!ATTLIST Fld 245 Add. Enty (No|Yes|Blank) #IMPLIED NFChars (0|1|2|3|4|5|6|7|8|9|Blnk) #IMPLIED> …etc… <!-- Subfield Element Declarations --> <!ELEMENT a - O (#PCDATA)> <!ELEMENT b - O (#PCDATA)> <!ELEMENT c - O (#PCDATA)> <!ELEMENT d - O (#PCDATA)> <!ELEMENT e IS 202 – FALL 2003 - O (#PCDATA )> 2003. 10. 09 - SLIDE 29

Document Markup • All document markup is derived from the DTD for the particular

Document Markup • All document markup is derived from the DTD for the particular document type • In SGML the DTD should be referenced in the document using the DOCTYPE declaration: <!DOCTYPE name SYSTEM "file_path" > or <!DOCTYPE name SYSTEM "file_path" [doctype_declaration_subset]> or <!DOCTYPE name [doctype_declaration_subset]> The doctype_declaration_subset can be any combination of elements, entity, and attribute declarations IS 202 – FALL 2003. 10. 09 - SLIDE 30

HTML • HTML was not originally "real" SGML, the DTD was invented after the

HTML • HTML was not originally "real" SGML, the DTD was invented after the language • It is often more concerned with the form of the output on the screen than with the structural contents of the HTML docs • Relies on the application (such as Netscape) to implement interesting actions like hypertext linking • XHTML is now a W 3 C “recommendation” that applies XML conventions to HTML, and provides a growing set of capabilities within an XML framework (our phones use XHTML) IS 202 – FALL 2003. 10. 09 - SLIDE 31

Lecture Overview • Review – XML and Document Engineering • Metadata And Markup –

Lecture Overview • Review – XML and Document Engineering • Metadata And Markup – XML As A Metadata Lingua Franca • METS – SGML vs. XML DTD Construction – XML Schemas – XML For Protocols And Metadata Languages • Readings/Discussion IS 202 – FALL 2003. 10. 09 - SLIDE 32

What are XML Schemas? • An XML vocabulary for expressing your data's structure AND

What are XML Schemas? • An XML vocabulary for expressing your data's structure AND content types, and even the business rules involved in processing the data • Written in XML themselves • Support namespaces for combining multiple schemas in the same documents – The slides in this section are based on an XML tutorial by Roger L. Costello IS 202 – FALL 2003. 10. 09 - SLIDE 33

Example <location> <latitude>32. 904237</latitude> <longitude>73. 620290</longitude> <uncertainty units="meters">2</uncertainty> </location> Is this data valid? To

Example <location> <latitude>32. 904237</latitude> <longitude>73. 620290</longitude> <uncertainty units="meters">2</uncertainty> </location> Is this data valid? To be valid, it must meet these constraints (data business rules): 1. The location must be comprised of a latitude, followed by a longitude, followed by an indication of the uncertainty of the lat/lon measurements. 2. The latitude must be a decimal with a value between -90 to +90 3. The longitude must be a decimal with a value between -180 to +180 4. For both latitude and longitude the number of digits to the right of the decimal point must be exactly six digits. 5. The value of uncertainty must be a non-negative integer 6. The uncertainty units must be either meters or feet. We can express all these data constraints using XML Schemas IS 202 – FALL 2003. 10. 09 - SLIDE 34

Validating your data <location> <latitude>32. 904237</latitude> <longitude>73. 620290</longitude> <uncertainty units="meters">2</uncertainty> </location> XML Schema validator

Validating your data <location> <latitude>32. 904237</latitude> <longitude>73. 620290</longitude> <uncertainty units="meters">2</uncertainty> </location> XML Schema validator Data is ok! -check that the latitude is between -90 and +90 -check that the longitude is between -180 and +180 - check that the fraction digits is 6 for lat and lon. . . XML Schema IS 202 – FALL 2003. 10. 09 - SLIDE 35

Purpose of XML Schemas • Specify: – the structure of instance documents • "this

Purpose of XML Schemas • Specify: – the structure of instance documents • "this element contains these elements, which contains these other elements, etc" – the datatype of each element/attribute • "this element shall hold an integer with the range 0 to 12, 000" (DTDs don't do too well with specifying datatypes like this) IS 202 – FALL 2003. 10. 09 - SLIDE 36

Why Schemas? Motivation for XML Schemas • People are dissatisfied with DTDs – It's

Why Schemas? Motivation for XML Schemas • People are dissatisfied with DTDs – It's a different syntax • You write your XML (instance) document using one syntax and the DTD using another syntax --> bad, inconsistent – Limited datatype capability • DTDs support a very limited capability for specifying datatypes. You can't, for example, express "I want the <elevation> element to hold an integer with a range of 0 to 12, 000" – Desire a set of datatypes compatible with those found in databases • DTD supports 10 datatypes; XML Schemas supports 44+ datatypes IS 202 – FALL 2003. 10. 09 - SLIDE 37

Highlights of XML Schemas • XML Schemas are a tremendous advancement over DTDs: –

Highlights of XML Schemas • XML Schemas are a tremendous advancement over DTDs: – Enhanced datatypes • 44+ versus 10 • Can create your own datatypes – Example: "This is a new type based on the string type and elements of this type must follow this pattern: ddd-dddd, where 'd' represents a digit". – Written in the same syntax as instance documents • less syntax to remember – Object-oriented'ish • Can extend or restrict a type (derive new type definitions on the basis of old ones) – Can express sets, i. e. , can define the child elements to occur in any order IS 202 – FALL 2003. 10. 09 - SLIDE 38

Highlights of XML Schemas • Can specify element content as being unique (keys on

Highlights of XML Schemas • Can specify element content as being unique (keys on content) and uniqueness within a region • Can define multiple elements with the same name but different content • Can define elements with nil content • Can define substitutable elements - e. g. , the "Book" element is substitutable for the "Publication" element. IS 202 – FALL 2003. 10. 09 - SLIDE 39

Book. Store. dtd <!ELEMENT Book. Store (Book)+> <!ELEMENT Book (Title, Author, Date, ISBN, Publisher)>

Book. Store. dtd <!ELEMENT Book. Store (Book)+> <!ELEMENT Book (Title, Author, Date, ISBN, Publisher)> <!ELEMENT Title (#PCDATA)> <!ELEMENT Author (#PCDATA)> <!ELEMENT Date (#PCDATA)> <!ELEMENT ISBN (#PCDATA)> <!ELEMENT Publisher (#PCDATA)> IS 202 – FALL 2003. 10. 09 - SLIDE 40

ELEMENT ATTLIST Book. Store Author #PCDATA Book ID NMTOKEN CDATA ENTITY Title Publisher ISBN

ELEMENT ATTLIST Book. Store Author #PCDATA Book ID NMTOKEN CDATA ENTITY Title Publisher ISBN Date This is the vocabulary that DTDs provide to define your new vocabulary IS 202 – FALL 2003. 10. 09 - SLIDE 41

http: //www. w 3. org/2001/XMLSchema http: //www. books. org (target. Namespace) complex. Type element

http: //www. w 3. org/2001/XMLSchema http: //www. books. org (target. Namespace) complex. Type element Book. Store Author sequence Book schema string integer boolean Title Publisher ISBN Date This is the vocabulary that XML Schemas provide to define your new vocabulary One difference between XML Schemas and DTDs is that the XML Schema vocabulary is associated with a name (namespace). Likewise, the new vocabulary that you define must be associated with a name (namespace). With DTDs neither set of vocabulary is associated with a name (namespace) [DTDs pre-dated namespaces]. IS 202 – FALL 2003. 10. 09 - SLIDE 42

<? xml version="1. 0"? > <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema" target.

<? xml version="1. 0"? > <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema" target. Namespace="http: //www. books. org" xmlns="http: //www. books. org" element. Form. Default="qualified"> <xsd: element name="Book. Store"> <xsd: complex. Type> <xsd: sequence> <xsd: element ref="Book" min. Occurs="1" max. Occurs="unbounded"/> </xsd: sequence> </xsd: complex. Type> </xsd: element> <xsd: element name="Book"> <xsd: complex. Type> <xsd: sequence> <xsd: element ref="Title" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Author" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Date" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="ISBN" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Publisher" min. Occurs="1" max. Occurs="1"/> </xsd: sequence> </xsd: complex. Type> </xsd: element> <xsd: element name="Title" type="xsd: string"/> <xsd: element name="Author" type="xsd: string"/> <xsd: element name="Date" type="xsd: string"/> <xsd: element name="ISBN" type="xsd: string"/> <xsd: element name="Publisher" type="xsd: string"/> </xsd: schema> IS 202 – FALL 2003 Book. Store. xsd - SLIDE 43 xsd 2003. 10. 09 = Xml-Schema Definition

<? xml version="1. 0"? > <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema" target.

<? xml version="1. 0"? > <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema" target. Namespace="http: //www. books. org" xmlns="http: //www. books. org" element. Form. Default="qualified"> <xsd: element name="Book. Store"> <xsd: complex. Type> <xsd: sequence> <!ELEMENT Book. Store (Book)+> <xsd: element ref="Book" min. Occurs="1" max. Occurs="unbounded"/> </xsd: sequence> </xsd: complex. Type> </xsd: element> <xsd: element name="Book"> <xsd: complex. Type> <xsd: sequence> <xsd: element ref="Title" min. Occurs="1" max. Occurs="1"/> <!ELEMENT Book (Title, Author, Date, <xsd: element ref="Author" min. Occurs="1" max. Occurs="1"/> ISBN, Publisher)> <xsd: element ref="Date" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="ISBN" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Publisher" min. Occurs="1" max. Occurs="1"/> </xsd: sequence> </xsd: complex. Type> </xsd: element> <xsd: element name="Title" type="xsd: string"/> <!ELEMENT Title (#PCDATA)> <xsd: element name="Author" type="xsd: string"/> <!ELEMENT Author (#PCDATA)> <xsd: element name="Date" type="xsd: string"/> <!ELEMENT Date (#PCDATA)> <xsd: element name="ISBN" type="xsd: string"/> <!ELEMENT ISBN (#PCDATA)> <xsd: element name="Publisher" type="xsd: string"/> <!ELEMENT Publisher (#PCDATA)> </xsd: schema> IS 202 – FALL 2003. 10. 09 - SLIDE 44

<? xml version="1. 0"? > <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema" target.

<? xml version="1. 0"? > <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema" target. Namespace="http: //www. books. org" xmlns="http: //www. books. org" element. Form. Default="qualified"> <xsd: element name="Book. Store"> <xsd: complex. Type> <xsd: sequence> <xsd: element ref="Book" min. Occurs="1" max. Occurs="unbounded"/> </xsd: sequence> </xsd: complex. Type> </xsd: element> <xsd: element name="Book"> <xsd: complex. Type> <xsd: sequence> <xsd: element ref="Title" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Author" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Date" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="ISBN" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Publisher" min. Occurs="1" max. Occurs="1"/> </xsd: sequence> </xsd: complex. Type> </xsd: element> <xsd: element name="Title" type="xsd: string"/> <xsd: element name="Author" type="xsd: string"/> <xsd: element name="Date" type="xsd: string"/> <xsd: element name="ISBN" type="xsd: string"/> <xsd: element name="Publisher" type="xsd: string"/> </xsd: schema> IS 202 – FALL 2003 All XML Schemas have "schema" as the root element. 2003. 10. 09 - SLIDE 45

<? xml version="1. 0"? > <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema" target.

<? xml version="1. 0"? > <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema" target. Namespace="http: //www. books. org" xmlns="http: //www. books. org" element. Form. Default="qualified"> <xsd: element name="Book. Store"> <xsd: complex. Type> <xsd: sequence> <xsd: element ref="Book" min. Occurs="1" max. Occurs="unbounded"/> </xsd: sequence> </xsd: complex. Type> </xsd: element> <xsd: element name="Book"> <xsd: complex. Type> <xsd: sequence> <xsd: element ref="Title" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Author" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Date" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="ISBN" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Publisher" min. Occurs="1" max. Occurs="1"/> </xsd: sequence> </xsd: complex. Type> </xsd: element> <xsd: element name="Title" type="xsd: string"/> <xsd: element name="Author" type="xsd: string"/> <xsd: element name="Date" type="xsd: string"/> <xsd: element name="ISBN" type="xsd: string"/> <xsd: element name="Publisher" type="xsd: string"/> </xsd: schema> IS 202 – FALL 2003 The elements and datatypes that are used to construct schemas - schema - element - complex. Type - sequence - string come from the http: //…/XMLSchema namespace 2003. 10. 09 - SLIDE 46

XMLSchema Namespace http: //www. w 3. org/2001/XMLSchema complex. Type element sequence schema string boolean

XMLSchema Namespace http: //www. w 3. org/2001/XMLSchema complex. Type element sequence schema string boolean integer IS 202 – FALL 2003. 10. 09 - SLIDE 47

<? xml version="1. 0"? > <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema" target.

<? xml version="1. 0"? > <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema" target. Namespace="http: //www. books. org" xmlns="http: //www. books. org" element. Form. Default="qualified"> <xsd: element name="Book. Store"> <xsd: complex. Type> <xsd: sequence> <xsd: element ref="Book" min. Occurs="1" max. Occurs="unbounded"/> </xsd: sequence> </xsd: complex. Type> </xsd: element> <xsd: element name="Book"> <xsd: complex. Type> <xsd: sequence> <xsd: element ref="Title" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Author" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Date" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="ISBN" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Publisher" min. Occurs="1" max. Occurs="1"/> </xsd: sequence> </xsd: complex. Type> </xsd: element> <xsd: element name="Title" type="xsd: string"/> <xsd: element name="Author" type="xsd: string"/> <xsd: element name="Date" type="xsd: string"/> <xsd: element name="ISBN" type="xsd: string"/> <xsd: element name="Publisher" type="xsd: string"/> </xsd: schema> IS 202 – FALL 2003 Says that the elements defined by this schema - Book. Store - Book - Title - Author - Date - ISBN - Publisher are to go in this namespace 2003. 10. 09 - SLIDE 48

Book Namespace (target. Namespace) http: //www. books. org (target. Namespace) Book. Store Author Book

Book Namespace (target. Namespace) http: //www. books. org (target. Namespace) Book. Store Author Book Title Publisher ISBN Date IS 202 – FALL 2003. 10. 09 - SLIDE 49

<? xml version="1. 0"? > <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema" target.

<? xml version="1. 0"? > <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema" target. Namespace="http: //www. books. org" xmlns="http: //www. books. org" element. Form. Default="qualified"> <xsd: element name="Book. Store"> <xsd: complex. Type> <xsd: sequence> <xsd: element ref="Book" min. Occurs="1" max. Occurs="unbounded"/> </xsd: sequence> </xsd: complex. Type> </xsd: element> <xsd: element name="Book"> <xsd: complex. Type> <xsd: sequence> <xsd: element ref="Title" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Author" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Date" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="ISBN" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Publisher" min. Occurs="1" max. Occurs="1"/> </xsd: sequence> </xsd: complex. Type> </xsd: element> <xsd: element name="Title" type="xsd: string"/> <xsd: element name="Author" type="xsd: string"/> <xsd: element name="Date" type="xsd: string"/> <xsd: element name="ISBN" type="xsd: string"/> <xsd: element name="Publisher" type="xsd: string"/> </xsd: schema> IS 202 – FALL 2003 The default namespace Is http: //www. books. org which is the target. Namespace! This is referencing a Book element declaration. The Book in what namespace? Since there is no namespace qualifier it is referencing the Book element in the default namespace, which is the target. Namespace! Thus, this is a reference to the Book element declaration in this schema. 2003. 10. 09 - SLIDE 50

<? xml version="1. 0"? > <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema" target.

<? xml version="1. 0"? > <xsd: schema xmlns: xsd="http: //www. w 3. org/2001/XMLSchema" target. Namespace="http: //www. books. org" xmlns="http: //www. books. org" element. Form. Default="qualified"> <xsd: element name="Book. Store"> <xsd: complex. Type> <xsd: sequence> <xsd: element ref="Book" min. Occurs="1" max. Occurs="unbounded"/> </xsd: sequence> </xsd: complex. Type> </xsd: element> <xsd: element name="Book"> <xsd: complex. Type> <xsd: sequence> <xsd: element ref="Title" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Author" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Date" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="ISBN" min. Occurs="1" max. Occurs="1"/> <xsd: element ref="Publisher" min. Occurs="1" max. Occurs="1"/> </xsd: sequence> </xsd: complex. Type> </xsd: element> <xsd: element name="Title" type="xsd: string"/> <xsd: element name="Author" type="xsd: string"/> <xsd: element name="Date" type="xsd: string"/> <xsd: element name="ISBN" type="xsd: string"/> <xsd: element name="Publisher" type="xsd: string"/> </xsd: schema> IS 202 – FALL 2003 This is a directive to any instance documents which conform to this schema: Any elements used by the instance document which were declared in this schema must be namespace qualified. 2003. 10. 09 - SLIDE 51

Referencing a schema in an XML instance document <? xml version="1. 0"? > <Book.

Referencing a schema in an XML instance document <? xml version="1. 0"? > <Book. Store xmlns ="http: //www. books. org" 1 xmlns: xsi="http: //www. w 3. org/2001/XMLSchema-instance" 3 xsi: schema. Location="http: //www. books. org 2 Book. Store. xsd"> <Book> <Title>My Life and Times</Title> <Author>Paul Mc. Cartney</Author> <Date>July, 1998</Date> <ISBN>94303 -12021 -43892</ISBN> <Publisher>Mc. Millin Publishing</Publisher> </Book>. . . </Book. Store> 1. First, using a default namespace declaration, tell the schema-validator that all of the elements used in this instance document come from the http: //www. books. org namespace. 2. Second, with schema. Location tell the schema-validator that the http: //www. books. org namespace is defined by Book. Store. xsd (i. e. , schema. Location contains a pair of values). 3. Third, tell the schema-validator that the schema. Location attribute we are using is the one in the XMLSchema-instance namespace. IS 202 – FALL 2003. 10. 09 - SLIDE 52

XMLSchema-instance Namespace http: //www. w 3. org/2001/XMLSchema-instance schema. Location type no. Namespace. Schema. Location

XMLSchema-instance Namespace http: //www. w 3. org/2001/XMLSchema-instance schema. Location type no. Namespace. Schema. Location nil IS 202 – FALL 2003. 10. 09 - SLIDE 53

Referencing a schema in an XML instance document schema. Location="http: //www. books. org Book.

Referencing a schema in an XML instance document schema. Location="http: //www. books. org Book. Store. xsd" Book. Store. xml - uses elements from namespace http: //www. books. org target. Namespace="http: //www. books. org" Book. Store. xsd - defines elements in namespace http: //www. books. org A schema defines a new vocabulary. Instance documents use that new vocabulary. IS 202 – FALL 2003. 10. 09 - SLIDE 54

Note multiple levels of checking Book. Store. xml Book. Store. xsd Validate that the

Note multiple levels of checking Book. Store. xml Book. Store. xsd Validate that the xml document conforms to the rules described in Book. Store. xsd IS 202 – FALL 2003 XMLSchema. xsd (schema-for-schemas) Validate that Book. Store. xsd is a valid schema document, i. e. , it conforms to the rules described in the schema-for-schemas 2003. 10. 09 - SLIDE 55

Default Value for min. Occurs and max. Occurs • The default value for min.

Default Value for min. Occurs and max. Occurs • The default value for min. Occurs is "1" • The default value for max. Occurs is "1" <xsd: element ref="Title" min. Occurs="1" max. Occurs="1"/> Equivalent! <xsd: element ref="Title"/> IS 202 – FALL 2003. 10. 09 - SLIDE 56

Much More to XMLSchema! • This was an overview of some basics • There

Much More to XMLSchema! • This was an overview of some basics • There are many other features, such as: – The ability to import other schemas or parts of schemas – Ability to specify many data types – Etc. • XMLSchema definitions are at W 3 C – http: //www. w 3. org/TR/xmlschema-0/ is a good place to start IS 202 – FALL 2003. 10. 09 - SLIDE 57

Lecture Overview • Review – XML and Document Engineering • Metadata And Markup –

Lecture Overview • Review – XML and Document Engineering • Metadata And Markup – XML As A Metadata Lingua Franca • METS – SGML vs. XML DTD Construction – XML Schemas – XML For Protocols And Metadata Languages • Readings/Discussion IS 202 – FALL 2003. 10. 09 - SLIDE 58

Other Protocols and Metadata Systems Using XML • SOAP (Simple Object Access Protocol) •

Other Protocols and Metadata Systems Using XML • SOAP (Simple Object Access Protocol) • DAV/DASL (Distributed Authoring and Versioning) • SDLIP (Simple Digital Library Interoperability Protocol) • RDF (Resource Description Framework) • ADL Gazetteer Protocol • OAI-MHP (already discussed) • MPEG-7 (more next time) • METS • Also versions of MARC and other formats in XML IS 202 – FALL 2003. 10. 09 - SLIDE 59

SGML and XML Sources and Resources • Books: – van Herwijnen, Eric. Practical SGML.

SGML and XML Sources and Resources • Books: – van Herwijnen, Eric. Practical SGML. (2 nd Ed. ) Boston: Kluwer Academic Publishers, 1994. – Goldfarb, Charles F. The SGML Handbook. Oxford: Clarenden Press, 1990. (and MANY XML books) • Web Sites: – The W 3 C web site (all XML standards documents) • http: //www. w 3. org – Robin Cover’s SGML/XML Site • http: //www. oasis-open. org/cover/sgml-xml. html IS 202 – FALL 2003. 10. 09 - SLIDE 60

Lecture Overview • Review – XML and Document Engineering • Metadata And Markup –

Lecture Overview • Review – XML and Document Engineering • Metadata And Markup – XML As A Metadata Lingua Franca • METS – SGML vs. XML DTD Construction – XML Schemas – XML For Protocols And Metadata Languages • Readings/Discussion IS 202 – FALL 2003. 10. 09 - SLIDE 61

Discussion – Vam Makam • Kirk covers examples of DTDs for books and newspapers.

Discussion – Vam Makam • Kirk covers examples of DTDs for books and newspapers. Many individuals and corporations have been creating numerous DTDs for themselves and general purposes. What are some innovative and useful ideas for areas where designing DTDs might be useful? For ideas that may have already been thought of, how could they be improved or extended? IS 202 – FALL 2003. 10. 09 - SLIDE 62

Discussion – Vam Makam • However, recent XML DTDs have emerged, newer ideas such

Discussion – Vam Makam • However, recent XML DTDs have emerged, newer ideas such as XML schemas have presented themselves as a better option. Given the thought process and work gone into designing existing DTDs, at what point is it worth modifying an existing DTD to an XML schema? • Now that you have learned how to design a dtd and have basic knowledge about XML, what are some existing technologies that combined with XML become more useful? IS 202 – FALL 2003. 10. 09 - SLIDE 63

Discussion – Annie Yeh • Kirk addresses the advantages of using external DTDs, the

Discussion – Annie Yeh • Kirk addresses the advantages of using external DTDs, the reusability of public DTDs, the ability to focus on content rather than structure, easier management or multiple documents, and easier data error checking. What are some of the existing repositories in which we can store these DTDs? What are some of the ways with which we can facilitate this process? What are their pros and cons? What are some of the more ideal interfaces with which to facilitate this? IS 202 – FALL 2003. 10. 09 - SLIDE 64

Discussion – Annie Yeh • What are the differences between DTDs and Schemas, and

Discussion – Annie Yeh • What are the differences between DTDs and Schemas, and what are the pros and cons of each? IS 202 – FALL 2003. 10. 09 - SLIDE 65

Next Time • Metadata for Motion Pictures: MPEG-7 • Readings/Discussion – MPEG-7 (Part 1)

Next Time • Metadata for Motion Pictures: MPEG-7 • Readings/Discussion – MPEG-7 (Part 1) (J. M. Martinez, R. Koenen, F. Pereira) – MPEG-7 (Part 2) (J. Martinez) IS 202 – FALL 2003. 10. 09 - SLIDE 66

IS 202 – FALL 2003. 10. 09 - SLIDE 67

IS 202 – FALL 2003. 10. 09 - SLIDE 67