An Introduction to XML Instructors Geoffrey Fox and

  • Slides: 67
Download presentation
An Introduction to XML Instructors: Geoffrey Fox and Bryan Carpenter Dept. of Computer Science

An Introduction to XML Instructors: Geoffrey Fox and Bryan Carpenter Dept. of Computer Science School of Computational Science and Information Technology 400 Dirac Science Library Florida State University Tallahassee Florida 32306 -4120 http: //www. csit. fsu. edu Nancy Mc. Cracken, Ozgur Balsoy http: //aspen. csit. fsu. edu/webtech/xml/ 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 1

Outline of XML Introduction • Overview of XML and its relationship to HTML and

Outline of XML Introduction • Overview of XML and its relationship to HTML and SGML • XML as a object structure versus XML as “just a better HTML” for documents • Basic XML and being well formed • XML Prolog and Processing Instructions • Namespaces • What is a DTD and allowed declarations • Content Models for Elements • Entities: Internal, External – General Character and Parameter • INCLUDE and IGNORE • Attribute values and types • NOTATIONS • Unparsed Entities • Example 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 2

Overview of HTML • HTML = Hypertext Markup Language – the lingua franca of

Overview of HTML • HTML = Hypertext Markup Language – the lingua franca of the World Wide Web – HTML is a simple language well suited for hypertext, multimedia and the display of small and reasonably simple documents • HTML 2. 0 spec completed in Nov 95 • HTML+ and HTML 3. 0 never released • HTML 3. 2 (Jan 97) added tables, applets, and other capabilities (approximately 70 tags) – this is what most people are familiar with today • HTML 4. 0 spec released in Dec 97 • XHTML (XML Version of HTML 4. 0) released January 2000 with refinements still being worked on. 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 3

Beyond HTML • Limitations of HTML: – Extensibility: HTML does not allow users to

Beyond HTML • Limitations of HTML: – Extensibility: HTML does not allow users to specify their own tags or attributes in order to parameterize or otherwise semantically qualify their data. – Structure: HTML does not support the specification of deep structures needed to represent database schema or object-oriented hierarchies. – Validation: HTML does not support the kind of language specification that allows applications to check data for structural validity when it is imported. 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 4

What is XML? • XML = e. Xtensible Markup Language • XML is a

What is XML? • XML = e. Xtensible Markup Language • XML is a subset of Standard Generalized Markup Language, but unlike the latter, XML is specifically designed for the web • Specification of W 3 C: http: //www. w 3. org/XML • XML 1. 0 in February 98, related specifications since then • How XML fits into the new HTML world: – XML describes the logical structure of the document. – CSS (Cascading Style Sheets) or other style language describes the visual presentation of the document. – The DOM (Document Object Model) allows scripting languages, such as Java. Script to access document objects. – DHTML (Dynamic HTML) allows a dynamic presentation of the document. 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 5

Logical vs. Visual Design • The logical design of a document (content) should be

Logical vs. Visual Design • The logical design of a document (content) should be separate from its visual design (presentation) • Separation of logical and visual design – promotes sound typography – encourages better writing – is more flexible • XML can be used to define the logical design, while the XSL (Extensible Style Language) is used to define the visual design (usually by mapping XML into HTML). 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 6

What is SGML? • SGML = Standard Generalized ML • A SGML document carries

What is SGML? • SGML = Standard Generalized ML • A SGML document carries with it a grammar called a Document Type Definition (DTD). The DTD defines the tags and the meaning of those tags • Presentation is governed by a style sheet written in the Document Style Semantics and Specification Language (DSSSL) • Note that HTML is a fixed SGML application, a hardwired set of about 70 tags and 50 attributes, and does not need to have a DTD. 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 7

SGML Example • A simple SGML document with embedded DTD: <!DOCTYPE DOCUMENT [ <!ELEMENT

SGML Example • A simple SGML document with embedded DTD: <!DOCTYPE DOCUMENT [ <!ELEMENT DOCUMENT O O (p*, BIGP*)> <!ELEMENT p - O (#PCDATA)> <!ELEMENT BIGP - O (#PCDATA)> ]> <DOCUMENT> <p>Welcome to <BIGP>XML Style! </DOCUMENT> 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 8

SGML Example (cont’d) • A corresponding DSSSL style sheet: <!DOCTYPE style-sheet PUBLIC "-//James Clark//DTD

SGML Example (cont’d) • A corresponding DSSSL style sheet: <!DOCTYPE style-sheet PUBLIC "-//James Clark//DTD DSSSL Style Sheet//EN"> (root (make simple-page-sequence)) (element p (make paragraph)) (element BIGP (make paragraph font-size: 24 pt space-before: 12 pt)) 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 9

XML is SGML Lite • XML is also an SGML application, but since XML

XML is SGML Lite • XML is also an SGML application, but since XML is extensible (XML is also a metalanguage), every XML document must be accompanied by its DTD • XML is a compromise between the non-extensible, limited capabilities of HTML and the full power and complexity of SGML • XML offers “ 80% of the benefits of SGML for 20% of its complexity” – XML designers tried to leave out all the SGML that would be rarely used on the web – Note that XML specification is 30 pages and the SGML specification is 500 pages. • XML allows you to define your own tags and to describe nested hierarchies of information. 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 10

XML Design Goals • • • 1) XML shall be usable over the Internet

XML Design Goals • • • 1) XML shall be usable over the Internet 2) XML shall support a variety of applications 3) XML shall be compatible with SGML 4) It shall be easy to write programs that process XML documents 5) Optional features in XML shall be kept to the absolute minimum, ideally zero 6) XML documents should be human-legible and reasonably clear 7) Design of XML should be prepared quickly 8) Design of XML shall be formal and concise 9) XML documents shall be easy to create 10) Terseness in XML markup is of minimal importance 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 11

Features of XML I • The documents are stored in plain text and thus

Features of XML I • The documents are stored in plain text and thus can be transferred and processed anywhere. • Inline-reusability - documents can be composed of many pieces • Unifying principles make it easily acceptable – “everything is a tree” – UNICODE for different languages • XML documents enable several types of uses – traditional data processing - XML documents can be the data interchange medium – document-driven programming – archiving 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 12

Features of XML II • It is important to remember that XML is a

Features of XML II • It is important to remember that XML is a markup language, not a programming language. XSL can be viewed as a way of programming data whose structure is defined in XML • M in XML is Markup reflecting its origin in the publication” community with markup specifying layout of document, fonts to use etc. • XML’s most important use is not this original specifying abstract data structures -- equivalent to structures in C++ or classes in Java or Entity relationship in database world 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 13

Origins of XML • First draft of XML spec released by W 3 C

Origins of XML • First draft of XML spec released by W 3 C in Nov 96 (four other drafts published in 1997) • The first XML parser (written in Java) released by Microsoft in July 97 • Microsoft released version 1. 8 of its XML parser (which supports XML 1. 0) in Jan 98 • W 3 C finalized the XML 1. 0 spec in Feb 98 • First XML-aware beta versions of Netscape and IE 5. 0 released in June 98 • Sun announced Java Standard Extension for XML (XML API) in March 99 • W 3 C working drafts for extensions - 99/00 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 14

“Hello World!” in XML • An XML document with external DTD: <? xml version="1.

“Hello World!” in XML • An XML document with external DTD: <? xml version="1. 0"? > <!DOCTYPE greeting SYSTEM "hello. dtd"> <greeting>Hello World!</greeting> • An XML document with embedded DTD: <? xml version="1. 0"? standalone =“yes” ? > <!DOCTYPE greeting [ <!ELEMENT greeting (#PCDATA)> ]> <greeting>Hello World!</greeting> 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 15

XML and Related Acronyms • Document Type Definition (DTD), which defines the tags and

XML and Related Acronyms • Document Type Definition (DTD), which defines the tags and their relationships • Extensible Style Language (XSL) style sheets, which specify the presentation of the document • Cascading Style Sheets(CSS) less powerful presentation technology without tag mapping capability • XPATH which specifies location in document • XLINK and XPOINTER which defines link-handling details • Resource Description Framework (RDF), document metadata • Document Object Model (DOM), API for converting the document to a tree object in your program for processing and updating • Simple API for XML (SAX), “serial access” protocol, fast-to-execute protocol for processing document on the fly • XML Namespaces, for an environment of multiple sets of XML tags • XHTML, a definition of HTML tags for XML documents (which are then just HTML documents) • XML schema, offers a more flexible alternative to DTD 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 16

Document Type Definition • The DTD specifies the logical structure of the document; it

Document Type Definition • The DTD specifies the logical structure of the document; it is a formal grammar describing document syntax and semantics • The DTD does not describe the physical layout of the document; this is left to the style sheets and the scripts • It is no mean task to write a DTD, so most users will adopt predefined DTDs (or can write an XML document without a DTD). • DTDs can be written in separate files to facilitate re-use. • Content-providers, industries and other groups can collaborate to define sets of tags: the essence of “any” field (physics, music …) is captured in a domain specific DTD • XML Schema will tend to replace DTD and we will discuss both later 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 17

XML must be “well-formed” • For the data contained in an XML document to

XML must be “well-formed” • For the data contained in an XML document to be parsed correctly, its markup must be well-formed, meaning in part that properly nested and nonabbreviated starting and ending tags are used. – This well-formedness provides a well defined encapsulation mechanism allowing designated sections of the data to be accessed programmatically. • XML documents are made up of markup and CDATA (character data) – PCDATA is gotten from parsing text and processing markup as necessary • “markup” includes – Tags, Entity references, Character references, Comments, CDATA Section delimiters, DTD declarations and Processing Instructions 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 18

Characters in XML • We can choose the character set such as UTF-8 (8

Characters in XML • We can choose the character set such as UTF-8 (8 bit ASCII codes for characters) or the official default Unicode (16 bit character codes as used by Java) or even UCS which offers 32 bits for each character. This is specified in the xml processing instruction in the document prolog. • You can use character reference markup – &#x 03 C 0; is Unicode for wrapped in &#. . ; syntax for a 16 bit (4 hexadecimal symbols) character reference in Unicode (ISO/IEC 10646) – π is also using decimal form of Unicode • One can use the five built-in entity references – & for & – &apos; for ‘ – > for > – < for < – " for “ • We will later see how to redefine arbitrary entity references 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 19

White Space in XML • XML as default treats spaces, tabs, line feeds and

White Space in XML • XML as default treats spaces, tabs, line feeds and carriage return “just” as white space. Thus <greeting>Hello World!</greeting> and <greeting>Hello World!</greeting> are identical • This is similar to HTML. One can overrule this using attribute xml: space with syntax • <greeting xml: space=“preserve” >Hello World!</greeting> • This attribute must be defined in DTD with • <!ATTLIST greeting xml: space (default|preserve) ‘preserve’ > – defines element greeting to allow an attribute xml: space which can take values default or preserve with latter as default • If you specify xml: space, then it holds not only for given element but all those contained within it. 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 20

XML Example • Another example which could be used for URL exchanges between network

XML Example • Another example which could be used for URL exchanges between network capable applications: <LINK> <TITLE>XML Recommendation</TITLE> <URL> http: //www. w 3. org/TR/REC-xml </URL> <DESCRIPTION> The official XML spec from W 3 C </DESCRIPTION> </LINK> 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 21

XML Example (cont’d) • A document may have many such links: • <? xml

XML Example (cont’d) • A document may have many such links: • <? xml version="1. 0" encoding=”UTF-8” standalone="yes"? > <? xml-stylesheet type=“text/css” href=“fred. css” ? > <DOCUMENT> <LINKS> <LINK>…</LINK> … </LINKS> </DOCUMENT> • Here we have also added prolog processing instructions. 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 22

XML Prolog and Processing Instructions • Every XML file starts with the prolog, giving

XML Prolog and Processing Instructions • Every XML file starts with the prolog, giving information about the document. The minimal prolog identifies it as an xml document <? xml version=“ 1. 0”? > • The prolog may also include the encoding and whether it is a standalone document: <? xml version="1. 0" encoding="ISO-8859 -1” standalone="yes” ? > • If it is not standalone, it may specifiy external “entities” which may be named in the document or an external DTD • An XML file may also contain more general processing instructions for the application processing the document: <? target instructions ? > where target is the name of the application. • Only <? xml … ? > is understood by all XML processors • Specification of a stylesheet by <? xml-stylesheet. . ? > is common 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 23

XML Prolog and Comments • The Prolog can contain: – Processing Instructions – DTD

XML Prolog and Comments • The Prolog can contain: – Processing Instructions – DTD Specifications -- we have illustrated these and will discuss in detail later – Comments • Comments have same form anywhere in the XML document and are just like comments in HTML • <!--This is the Prolog and <tag> Lousy Course</tag> is not treated as a tag--> – You cannot have -- inside comments but <tag> </tag> is not treated as markup 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 24

XML tag structure • In XML terminology, a pair of start and end tags

XML tag structure • In XML terminology, a pair of start and end tags is an element. • XML documents must have a strict hierarchical structure. – All start tags must have an end tag. – Any element must be properly nested within another. – <LI> XML requires <B><I>proper nesting</I></B>. </LI> is well formed – <LI> XML requires <B><I>proper nesting</I></LI>. </B> would be rejected by an XML Parser • Empty tags are allowed as elements in XML documents. – An empty tag is a start and end tag together and is identified by a trailing / after the tag name. So in XHTML one uses <br/> for the empty break tag. – A start tag and end tag with nothing in-between can also be considered an empty tag. <IMG SRC=“face. gif”></IMG> – XML tags are case-sensitive. (<H 1> is not the same as <h 1>. 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 25

Document is a Single Tree • XML documents allow only one root element. •

Document is a Single Tree • XML documents allow only one root element. • So it must be • <? xml version=“ 1. 0” ? > <rootoftree> ……… </rootoftree> • And not • <? xml version=“ 1. 0” ? > <rootoftree> ……… So there is only one tree in each </rootoftree> document <rootoftree> ……… </rootoftree> 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 26

XML Attributes I • Tags can have any number of attributes (which must be

XML Attributes I • Tags can have any number of attributes (which must be declared inside the DTD) • All attribute values must be within single or double quotes. <FONT COLOR=“#FF 00 CC”> quoted attribute </FONT> • If you have a double quote inside an attribute value, then either – Use " for inside quote as in quote=“" ” – Enclose attribute value in single quotes as in quote=‘”’ • Each attribute can only appear once in a given element definition • One can choose between <person name=“Fox” role=“teacher” ></person> and • <person><name>Fox</name><role>teacher</role></person> • Note you can repeat elements but you cannot repeat attributes to represent multiple occurrences 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 27

XML Attributes II • Note that with DTD (this changes with Schema), all element

XML Attributes II • Note that with DTD (this changes with Schema), all element and attribute values are text not numbers and so must be “converted” by application to intended form • So <item> weekdays<quantity>5</quantity><item> or <item quantity=“ 5” >weekdays</item> – Returns string “ 5” not the number 5 for quantity • xml: lang is a useful attribute (in xml Namespace) which can be used (as always if declared in DTD or Schema as allowed attribute) to specify language – <text xml: lang=“en”>Good English</text> <text xml: lang=“x-youth” >Coolio, Wax On, Wax Off, Dude</text> – xml: lang can take values from an official vocabulary (such as en above which is ISO 639) or your private code starting with x 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 28

XML Names and NMTOKEN • Name Characters are letters, digits, hyphens, underscores, colons or

XML Names and NMTOKEN • Name Characters are letters, digits, hyphens, underscores, colons or full stops. • An NMTOKEN is any collection of Name Characters • NMTOKENS is any list of NMTOKEN’s separated by white space (space, tab, newline etc. ) • Case is significant: PERSON and person are distinct names • Attribute and Element names must be (a subset of) NMTOKEN with restriction – Names cannot begin with a digit – Names cannot begin with xml (or any variant gotten by case changes) – system will use this prefix • Colons are ONLY to be used in Namespaces – currently an informal rule only 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 29

CDATA Sections • CDATA sections allow you to include unparsed characters in a document

CDATA Sections • CDATA sections allow you to include unparsed characters in a document <![CDATA <ignored>Anything </ignored> ]]> • In this example the ignored tag is not processed by XML parser • Unfortunately you must guarantee that there is no ]]> string in the text between <![CDATA and ]]> • <script language=“Java. Script”> <![CDATA var fred = 0; if( fred < 10) { document. writeln(“> and < here are NOT parsed”); } ]]> </script> 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 30

XML Namespaces I • This is an extension to XML adopted January 1999 at

XML Namespaces I • This is an extension to XML adopted January 1999 at http: //www. w 3. org/TR/1999/REC-xml-names-19990114/ • Namespaces address problem that attributes cannot be repeated • Suppose you had a DTD with <student> and <faculty> and you wanted to write <student><name>you</name><student> <teacher><name>me<special>Prof</special></name></teacher> • This is invalid unless <name> is identical in structure for both teacher and student, as each element in tree must have unique structure. • We can get round it by using <studentname> and <teachername> but this is not so satisfactory especially if you get this conflict by joining two different sets of tags together – This is seen in XHTML when you could add Math. ML SMIL SVG tags …. 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 31

XML Namespaces II • So we use new syntax xmlns=http: //aspen. csit. fsu. edu/namespaces/university.

XML Namespaces II • So we use new syntax xmlns=http: //aspen. csit. fsu. edu/namespaces/university. dtd to define an XML Namespace • The value of xmlns is hopefully a useful URL telling you about tags. However this is not required. – Microsoft in its cunning way uses in Office web export: – <xml xmlns: v="urn: schemas-microsoft-com: vml“ xmlns: o="urn: schemas-microsoft-com: office“ xmlns: p="urn: schemas-microsoft-com: office: powerpoint"> • And teaches Internet Explorer to understand these obscure “universal resource names” for VML Office and Power. Point Namespaces respectively. • xmlns is an attribute which can be used in any element (depending on parser you may need to declare this as allowed attribute in DTD) • <student xmlns=“studentdtd”><name> …. 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 32

XML Namespaces III • And when we come to teacher use <bigboss: teacher xmlns:

XML Namespaces III • And when we come to teacher use <bigboss: teacher xmlns: bigboss=“teacherdtd”><bigboss: name> …. • In the above, we made student elements as default • We can more symmetrically write <university xmlns: bigboss=“teacherdtd” xmlns: downtrodden=“studentdtd” > • <downtrodden: student><downtrodden: name>you </downtrodden: name></downtrodden: student> ……. . <bigboss: teacher><bigboss: name>me </bigboss: name></bigboss: student> • </university> 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 33

Document Type Definition • A powerful feature of XML that provides a formal set

Document Type Definition • A powerful feature of XML that provides a formal set of rules to define a document structure • Defines the elements that may be used, and dictates where they may be applied in relation to each other; therefore specifies the document hierarchy and granularity • Comprises a set of declarations that define a document structure tree • Declarations stored either at the top of each document that must conform to the rules, or alternatively, and more usually, in separate data files, referred by a special instruction at the top of each document. • Although formally optional, it is required by many XML tools • Schema are in many ways more elegant but DTD will teach us syntax of XML! 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 34

Document Type Definition • Each DTD element must either be a container element, or

Document Type Definition • Each DTD element must either be a container element, or be empty (a place holder). Container elements may contain text, child elements, or a mixture of both. • DTD also specifies the names of attributes, and dictates which elements they may appear in. For each attribute it specifies whether it is optional or required. – It gives list of possible values for an enumerated attribute • Comparing XML with Java, a DTD corresponds to the class and an XML file to an object – an instance of a class • Files that obey XML syntax rules are well formed • Files consistent with DTD are valid • One can “punt” and specify ANY for document structure – This implies file can have any elements and tags and there is no validation needed but all elements still need to be declared (see later example) 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 35

DTD definitions • A DTD allows you to create new tags by writing grammar

DTD definitions • A DTD allows you to create new tags by writing grammar rules which the tags must obey. The rules specify which tags and attributes are valid and their context. – You specify order and number of times each element can appear • A DTD element declaration looks like: <!ELEMENT person(name, email*)> – ELEMENT is the type – person is the element declaration – (name, email*) is the element content model – name and email are the children of person and define the hierarchy of the document. email must follow name in file – Note that this is called a grammar rule because it could have been written in BNF: person : : = (name, email*) 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 36

Document Type Definition I • A DTD consists of a set of declarations •

Document Type Definition I • A DTD consists of a set of declarations • Each declaration must use markup format <!…>, and can only use the one of the following keywords: – ELEMENT (tag definition) – ATTLIST (attribute definitions) – ENTITY (entity definition) – NOTATION (data type notation definition) – COMMENT (Same format as already described) • The declarations should appear inside a <!DOCTYPE document declaration which is at its simplest <!DOCTYPE Rootname [ DTD Declarations starting with one for element Rootname ]> 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 37

Document Type Definition II • Notice that the first element declared in a DTD

Document Type Definition II • Notice that the first element declared in a DTD must be the same Rootname which is first argument of DOCTYPE declaration • There are several types of DTD: • Internal: <!DOCTYPE Rootname [DTD Declarations]> • External in one of two forms: <!DOCTYPE Rootname SYSTEM URL ]> <!DOCTYPE Rootname PUBLIC Identifier URL ]> • And Mixed: <!DOCTYPE Rootname SYSTEM URL [DTD Local Declarations]> <!DOCTYPE Rootname PUBLIC Identifier URL [DTD Local Declarations]> • Normally one uses SYSTEM type external DTD 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 38

XML <LINKS> Example of External DTD File • The URL can be a file

XML <LINKS> Example of External DTD File • The URL can be a file “something. dtd” in same directory as XML file which is a typical relative address or a full URL such as “http: //aspen. csit. fsu. edu/EXTRNL. DTD” • Here is a DTD for the earlier example of a tree DOCUMENT with ability to define <LINKS>: <!ELEMENT DOCUMENT (LINKS)> <!– Any Comment with usual syntax --> <!ELEMENT LINKS (LINK)*> <!ELEMENT LINK (TITLE, URL, DESCRIPTION)> <!ELEMENT TITLE (#PCDATA)> <!ELEMENT URL (#PCDATA)> <!ELEMENT DESCRIPTION (#PCDATA)> • PCDATA stands for “parsed character data” • Note external file starts with <!ELEMENT declaration and does not have <!DOCTYPE declaration 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 39

XML Example using LINKS DTD • Now store this DTD in a file (links.

XML Example using LINKS DTD • Now store this DTD in a file (links. dtd) and write an XML document based on this DTD as follows: <? XML version="1. 0"? > <!DOCTYPE DOCUMENT SYSTEM "links. dtd"> <DOCUMENT> <LINKS> <LINK>…</LINK> … </LINKS> </DOCUMENT> • This is an instance (object) based on the class defined by DOCUMENT DTD • Instance and “class definition” (links, dtd) are stored in same directory 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 40

Document Type Definition Summary • Declarations are grouped within a DTD and can be

Document Type Definition Summary • Declarations are grouped within a DTD and can be fully contained in file as below <!DOCTYPE Rootname [ <!--The DTD for tree Rootname appears here e. g. --> <!ELEMENT person (name, email*, link? ) > ………. ]> • One can store a DTD in a separate file with syntax <!DOCTYPE Rootname SYSTEM URL > where URL is an absolute or relative location. Examples are: • <!DOCTYPE Rootname SYSTEM “EXTRNL. DTD” > <!DOCTYPE Rootname SYSTEM “http: //aspen. csit. fsu. edu/EXTRNL. DTD” > • In mixed format <!DOCTYPE MYDTD SYSTEM “EXTRNL. DTD” [ <!-- Some of MYDTD appears here augmenting or modifying declarations in external file --> <!ELEMENT person (name, email*, link? ) > ]> formally you can modify a declaration in external file (i. e. internal declaration takes precedence) but not all XML parsers allow this 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 41

Examples of Official DOCTYPE Declarations • W 3 C asks you to use for

Examples of Official DOCTYPE Declarations • W 3 C asks you to use for XHTML: <!DOCTYPE html PUBLIC “-//W 3 C//DTD XHTML 1. 0 Strict//EN” “http: //www. w 3. org/TR/xhtml 1/DTD/xhtml 1 -strict. dtd” > • They say that documents using the Math. ML DTD should contain a doctype declaration of the form: • <!DOCTYPE math PUBLIC "-//W 3 C//DTD Math. ML 2. 0//EN" "http: //www. w 3. org/TR/Math. ML 2/dtd/mathml 2. dtd" > The URL(URI) may be changed to that of a local copy of the DTD if required. So an alternative is: <!DOCTYPE math SYSTEM "mathml 2. dtd" > • If a namespace prefix is being used, so that for example the document element is: • <mml: math xmlns: mml="http: //www. w 3. org/1998/Math. ML">. . . </mml: math> then the prefix must be declared in the local subset of the DTD, as follows: • <!DOCTYPE mml: math PUBLIC "-//W 3 C//DTD Math. ML 2. 0//EN" "http: //www. w 3. org/TR/Math. ML 2/dtd/mathml 2. dtd" [ <!ENTITY % MATHML. prefixed "INCLUDE"> <!ENTITY % MATHML. prefix "mml"> ]> 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 42

Public Identifier in DOCTYPE • If you use a PUBLIC keyword in the DTD

Public Identifier in DOCTYPE • If you use a PUBLIC keyword in the DTD then this is followed by FPI or formal public identifier of form standard//group//type//language • standard is – if you are defining it; + if approved by a nonstandards body; and name of standard if it exists; • group is name of person or group that invented DTD e. g. Geoffrey Fox or W 3 C • type represents name of DTD including a version number • language is 2 character language abbreviation • This is exemplified on previous foil by PUBLIC "-//W 3 C//DTD Math. ML 2. 0//EN" 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 43

Element Declarations: EMPTY • Keyword ELEMENT Introduces a new element <!ELEMENT NAME CONTENT_MODEL> •

Element Declarations: EMPTY • Keyword ELEMENT Introduces a new element <!ELEMENT NAME CONTENT_MODEL> • Element name must begin with a letter, and may additionally contain digits and some punctuations, i. e. ‘. ’, ‘-’, ‘_’, and ‘: ’ as we described earlier under NMTOKEN • If an element can hold no child elements, and also no text, then it is known as empty element and denoted by EMPTY for CONTENT_MODEL – This seems trivial but it isn’t because the present or absence of this element in an XML file can be used as a flag – As an example we can find several in HTML such as HR and IMG which never have children and include no text. Here we would write <!ELEMENT HR EMPTY> and then <HR/> or <HR></HR> generates a horizontal line • EMPTY ELEMENTS can have attributes such as the SRC attribute in <IMG/> to specify source of image. 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 44

Element Declarations: ANY • An element declared to have a content of ANY may

Element Declarations: ANY • An element declared to have a content of ANY may contain all of the other elements declared in the DTD • This is not quite the same as no DTD for the file • <!DOCTYPE fred [ <!ELEMENT fred ANY > ]> • <fred> <people>Me and You</people> <people>Them</people> </fred> • Gets an error due to presence of <people> tag • Adding <!ELEMENT people ANY > inside DTD declaration produces a valid document. • Go to http: //www. stg. brown. edu/service/xmlvalid and paste files into their textbox to see this 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 45

Element Declaration Content Model I • <!ELEMENT elementname Content_Model > • The Content_Model is

Element Declaration Content Model I • <!ELEMENT elementname Content_Model > • The Content_Model is either a collection of chlid elements or parsed character data or a mixture • A Content_Model is bounded by brackets, and contains at least one token. • When a Content_Model contains more than one content token, the child elements are controlled using two logical connector operators; sequence connector ‘, ’, and choice connector ‘|’ • <!ELEMENT element 1 (a, b, c)> indicates a is followed by element b, which in turn is followed by c. • <!ELEMENT element 2 (a | b | c)> indicates either one can be selected. • Combinations are possible: (a, b, (c|d)), or ((a, b, c) | d) 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 46

Element Declaration Content Model II • Quantity indicators can also be used. – ‘?

Element Declaration Content Model II • Quantity indicators can also be used. – ‘? ’ indicates an element is optional or cannot repeat – ‘+’ indicates an element is required and may repeat – ‘*’ indicates an element is optional, and also repeatable • Document text is indicated by the keyword #PCDATA (Parsable Character Data) <!ELEMENT emph (#PCDATA|sub|super)*> <!ELEMENT sub (#PCDATA)> <!ELEMENT super (#PCDATA)> <emph>H<sub>2</sub>0 is water. </emph> • Note if no quantity indicated, element MUST appear and if , sequence indicator used, one must preserve order 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 47

Element Declaration Content Model III • Cleanest is only use either #PCDATA on its

Element Declaration Content Model III • Cleanest is only use either #PCDATA on its own or to use general specification of multiple child elements. • Mixed Content_Models cannot specify limits on occurrences. For instance: • <!DOCTYPE fred [ <!ELEMENT fred (people)+ > <!ELEMENT people (#PCDATA | name)+ > <!ELEMENT name (#PCDATA) > ]> • <fred> <people>Me and You<name>Fox</name></people> <people><name>Bryan</name>Them</people> </fred> • Is Illegal. I must use <!ELEMENT people (#PCDATA | name)* > but then I have no constraints on number of occurences of #PCDATA strings or of <name> children 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 48

Element Declaration Content Model IV • <!DOCTYPE fred [ <!ELEMENT fred (people)+ > <!ELEMENT

Element Declaration Content Model IV • <!DOCTYPE fred [ <!ELEMENT fred (people)+ > <!ELEMENT people (comment | name)+ > <!ELEMENT comment (#PCDATA) > <!ELEMENT name (#PCDATA) > ]> • <fred> <people><comment>Me and You</comment><name>Fox</name></people> <people><name>Bryan</name><comment>Them</comment></p eople> </fred> • Is valid and more precisely you can replace 3 rd line by: <!ELEMENT people ((comment | name), (name|comment)) > and require one name and one comment in any order • Either DTD will give an error if I add <people></people> before </fred> 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 49

DTD’s and Namespaces I • DTD’s and Namespaces are a little confusing as it

DTD’s and Namespaces I • DTD’s and Namespaces are a little confusing as it is not clear how much about namespaces are understood by parser. • Best is to use something like <university xmlns: bigboss=“teacherdtd” xmlns: downtrodden=“studentdtd” > with teacherdtd, studentdtd as conventional DTD’s without any special prefixes • However this will lead to errors for parsers that do not understand Namespaces. In this case you need you make explicit the namespace: prefixes in DTD and allow attribute 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 50

DTD’s and Namespaces II • <!DOCTYPE jim: fred [ <!ELEMENT jim: fred (jim: people)*

DTD’s and Namespaces II • <!DOCTYPE jim: fred [ <!ELEMENT jim: fred (jim: people)* > <!ATTLIST jim: fred xmlns: jim CDATA #FIXED “http: //www. csit. fsu. edu” > • <!ELEMENT jim: people (jim: comment | jim: name)* > <!ELEMENT jim: comment (#PCDATA) > <!ELEMENT jim: name (#PCDATA) > ]> • <jim: fred xmlns: jim=“http: //www. csit. fsu. edu” > </jim: fred> • Is an example of a an always valid use of Namespace prefixes 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 51

Entities I • The DTD of an XML document can contain entity declarations. These

Entities I • The DTD of an XML document can contain entity declarations. These are like macro substitutions in other languages. • ENTITY’s are defined in DTD and consist of several flavors: • General Entities are referenced as &Ent. Name; • Parameter Entities are referenced as %Entname; • We have already seen the character entities – & for & – &apos; for ‘ – > for > – < for < – " for “ • These are built in but you could add other such entities with – <!ENTITY aitself “A” > and &aitself; would be replaced by A 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 52

General Entities II • As another example, I can use in DTD <!ENTITY TODAY

General Entities II • As another example, I can use in DTD <!ENTITY TODAY “ 12 January 2001” > and <comment>&TODAY; was very quiet in CSIT</comment> is parsed as <comment>12 January 2001 was very quiet in CSIT</comment> • General Entity references can be nested inside a DTD e. g. one can write <!ENTITY YEAR “ 2001” > <!ENTITY TODAY “ 12 January &YEAR; ” > • However one must use Parameter Entities and not General Entities for macro substitution in other DTD declarations like <!ATTLIST and <!ELEMENT • Parameter entities are defined as in <!ENTITY % CUSTARDTAGS “(NAME, DATE, ORDERS)” > 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 53

Nested Entity Example • An entity declaration specifies replacement text for the entity including

Nested Entity Example • An entity declaration specifies replacement text for the entity including some macro-preprocessing capability. <!ENTITY pub “&#xc 9; ditions Gallimard”> <!ENTITY rights “All rights reserved”> <!ENTITY book “La Pest: Albert Camus, &#s. A 9; 1947 &pub; . &rights; ”> • This entity would have replacement text for book: La Peste: Albert Camus, c 1947 Editions Gallimard. All rights reserved where c would be copyright symbol, and E has accent mark. 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 54

General External Entities I • These allow to insert not just text but complete

General External Entities I • These allow to insert not just text but complete files. The simplest syntax is <!ENTITY Entityname SYSTEM URL > with for example • <? xml version=“ 1. 0” standalone=“no” ? > ……… <!ENTITY TODAY SYSTEM “date. txt” > and date. txt just contains 12 January 2001 • You can even put an entire document in a file contents. xml and write <? xml version=“ 1. 0” standalone=“no” ? > <!DOCTYPE treename [ …. <!ENTITY REALSTUFF SYSTEM “contents. xml” > …. ]> <treename> &REALSTUFF; </treename> 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 55

More on Entities • One can also use the syntax we introduced for DOCTYPE.

More on Entities • One can also use the syntax we introduced for DOCTYPE. Namely <!ENTITY Entityname PUBLIC FPI URL > • Here the Formal Public Identifier FPI has the four fields described earlier • Finally we describe parameter entities which can be used for the real meat of a DTD. These are just as before except there is an extra % in definition and they are referenced as %Entityname; • <!ENTITY % Entityname Definition > <!ENTITY % Entityname SYSTEM URL > <!ENTITY % Entityname PUBLIC FPI URL > • These are very useful if you have multiple ELEMENTS with related specifications 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 56

Parameter Entity Example • <!ENTITY % peopletags “(firstname, lastname, dateofbirth)” > <!ELEMENT student %peopletags;

Parameter Entity Example • <!ENTITY % peopletags “(firstname, lastname, dateofbirth)” > <!ELEMENT student %peopletags; > <!ELEMENT teacher %peopletags; > <!ELEMENT administrator %peopletags; > • Defines a bunch of ELEMENTS that are people to have the same children elements • Parameter entities are even more commonly used for attributes because almost always several ELEMENTS share the same attributes (with often a basic set being augmented in different ways for different ELEMENTS) – This basic set can be set in a parameter Entity 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 57

Use of INCLUDE and IGNORE • One can write in a DTD <![ INCLUDE

Use of INCLUDE and IGNORE • One can write in a DTD <![ INCLUDE [ Normal DTD Declarations ]!> or • <![ IGNORE [ Normal DTD Declarations ]!> or • <!ENTITY % ignorer “IGNORE” > …………………. <![ %ignorer; [ Normal DTD Declarations ]!> • This technique allows one to divide DTD into modules and select those to be included with a set of Parameter entity statements 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 58

Attributes • The rules for attribute declarations follow a similar structure to elements and

Attributes • The rules for attribute declarations follow a similar structure to elements and have the following example. • <!ELEMENT person %persontags; > <!ATTLIST person gender (male|female) #IMPLIED > – ATTLIST is the declaration type – person is the element name – gender is the attribute name – (male|female) #IMPLIED is the attribute definition • In general syntax is <!ATTLIST ELEMENT_NAME ATTRIBUTE_NAME 1 TYPE 1 DEFAULT_VALUE 1 ATTRIBUTE_NAME 2 TYPE 2 DEFAULT_VALUE 2 ……………. > • We now describe the last two fields 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 59

Attribute DEFAULT_VALUEs • The DEFAULT_VALUE keywords following an attribute type can be • #IMPLIED

Attribute DEFAULT_VALUEs • The DEFAULT_VALUE keywords following an attribute type can be • #IMPLIED attribute is optional and there is no default value • “…………” string in quotes is the default value for attribute • #REQUIRED attribute is required but there is no default value • #FIXED “…. ” attribute is assigned a fixed value which follows #FIXED keyword. – If attribute is NOT set in XML file, it is generated automatically at fixed value – If attribute is set in XML file, parser generates an error unless value set is equal to “fixed value” 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 60

Attribute Types I • CDATA type is character data - may include markup <!ATTLIST

Attribute Types I • CDATA type is character data - may include markup <!ATTLIST form method CDATA #FIXED ‘POST’> • Enumerated type is a list of possible values – each of which must be a legal XML name <!ATTLIST form method (GET | POST) ‘POST’ > – Note that in enumeration one does NOT need quotes surrounding characters but these are needed when specifying default value – Note the type is the list and not keyword ENUMERATED • Less important types are NMTOKEN or NMTOKENS which restrict CDATA types to be strings that only contain XML name characters (or white space separated set of such XML name strings for NMTOKENS) • <!ATTLIST form method NMTOKEN ‘POST’ possiblemethods NMTOKENS ‘POST GET’ > 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 61

Attribute Types II • Any ELEMENT is allowed at most one attribute of type

Attribute Types II • Any ELEMENT is allowed at most one attribute of type ID and within any document all values of such attributes must be distinct • An ID must be a valid XML name and so can. NOT begin with a number. However ID=“X 123456” is allowed <ATTLIST CUSTOMER_ID ID #REQUIRED> ……. . And in the XML file one uses <CUSTOMER_ID=“X 123456”>Sucker</CUSTOMER> • An Attribute of type IDREF is required to have a value that matches an ID attribute within the same document <ATTLIST BUGS CUSTOMER_SOURCE IDREF #REQUIRED> And in the XML file one uses <BUGS CUSTOMER_SOURCE=“X 123456”>PC Caught Fire</BUGS> • Often ID and IDREF are mumbo jumbo as generated by a control software 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 62

Attribute Types III • One can define an attribute to have type ENTITY when

Attribute Types III • One can define an attribute to have type ENTITY when its value must be the name of an ENTITY defined in the DTD <!ENTITY image 1 SYSTEM “beauty. gif” > <!ENTITY image 2 SYSTEM “beast. jpeg” > <ATTLIST views picture ENTITY #REQUIRED allowedpictures ENTITIES ‘image 1 image 2’ > ……. . <views picture=‘image 2’ allowedpictures=‘image 2’>Out of Focus</views> • Attribute type of ENTITIES is a list of white space separated ENTITY names • The final attribute type is NOTATION but we must first define the NOTATION declaration 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 63

Notations and use in Attributes • Notations specify the format of non XML data

Notations and use in Attributes • Notations specify the format of non XML data and have syntax <!NOTATION Name SYSTEM “External_ID” > or <!NOTATION Name PUBLIC FPI “External_ID” > • Where External_ID is something like a MIME Type with an example DTD fragment: <!NOTATION GIF SYSTEM “image/gif” > <!NOTATION JPEG SYSTEM “image/jpeg” > <!ATTLIST STUDENT imageurl CDATA #REQUIRED image_type NOTATION (GIF|JPEG) #IMPLIED> • And this is used in a XML file with syntax like <STUDENT imageurl=“postcard. gif” image_type=“GIF” > 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 64

Unparsed Entity Declarations • These are specified as <!ENTITY NAME SYSTEM VALUE NDATA TYPE>

Unparsed Entity Declarations • These are specified as <!ENTITY NAME SYSTEM VALUE NDATA TYPE> or <!ENTITY NAME PUBLIC FPI VALUE NDATA TYPE> • Where NAME is name to be given to an unparsed external entity • SYSTEM or PUBLIC FPI have the roles described earlier for external entities • VALUE is value of entity – such as a external file URL • NDATA signifies unparsed • TYPE is any declared NOTATION • A typical example would be in DTD • <!NOTATION GIF SYSTEM “image/gif” > <!ENTITY IMAGE 1 SYSTEM “image. gif” NDATA GIF > <!ATTLIST STUDENT IMAGE ENTITY #IMPLIED> ]> ……. . And used in XML file as <STUDENT IMAGE=“IMAGE 1” > • This is a (rather clumsy) way of including “binary” (non XML) format data into a document 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 65

XML Example - the DTD • • • 4/1/99 Create a DTD file for

XML Example - the DTD • • • 4/1/99 Create a DTD file for an address book named “ab. dtd” <!ELEMENT address. Book (person)+> <!ELEMENT person (name, email*, link? ) > <!ATTLIST person id ID #REQUIRED > <!ATTLIST person gender (male|female) #IMPLIED > <!ELEMENT name (#PCDATA|(family, given))> <!ELEMENT family (#PCDATA)> <!ELEMENT given (#PCDATA)> <!ELEMENT email (#PCDATA)> <!ELEMENT link EMPTY > <!ATTLIST link manager IDREF #IMPLIED subordinates IDREFS #IMPLIED > it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 66

XML Example - the XML document • <? xml version="1. 0"? > <!DOCTYPE address.

XML Example - the XML document • <? xml version="1. 0"? > <!DOCTYPE address. Book SYSTEM ”ab. dtd"> <address. Book> <person id=“B. WALLACE” gender=“male”> <name> <family>Wallace</family> <given>Bob</given> </name> <email>bwallace@megacorp. com</email> <link> manager=“C. TUTTLE”/> </person> <person id=“C. TUTTLE” gender=“femail”> <name> <family>Tuttle</family> <given>Claire </given </name> <email>ctuttle@megacorp. com</email> <link subordinates=“B. WALLACE”/> </person> </address. Book> 4/1/99 it 2 xml 01 http: //aspen. csit. fsu. edu/it 2 spring 01 67