Lecture 5 XML Schema Based on Mller and
Lecture 5 XML Schema (Based on Møller and Schwartzbach, 2006, pp. 113 -159) David Meredith d. meredith@gold. ac. uk www. titanmusic. com/teaching/cis 336 -2006 -7. html CIS 336 Website design, implementation and management (also Semester 2 of CIS 219, CIS 221 and IT 226) 1
Problems with DTDs • DTDs cannot constrain character data – e. g. , cannot specify that (#PCDATA) must only be a valid integer representation – need more powerful datatype mechanism • Attribute types are too limited – e. g. , cannot specify that an attribute value must be an integer, a URI etc. • Element and attribute definitions cannot depend on context – e. g. , cannot specify that unit attribute only allowed if amount attribute is present • Character data cannot be combined with regular expression content model – i. e. , mixed content always has form (#PCDATA | e 1 | e 2)* • cannot specify order in which character data may be interspersed with elements • Element content model lacks "interleaving" operator that allows us to specify that an element may occur anywhere inside an element – e. g. , cannot (easily) specify that comment element may occur anywhere in contents of recipe element 2
More problems with DTDs • DTD provides very limited support for modularity, reuse and evolution of schemas – hard to write, maintain and read large DTD schemas • ID/IDREF mechanism is too limited – sometimes want to specify a more restricted scope for an ID attribute than the whole instance document – also might want to use multiple attribute values or character data as keys rather than just single attribute value • DTDs do not support namespaces 3
XML Schema • DTDs defined as part of the XML 1. 0 specification (February 1998) – inherited from SGML • Shortly afterwards, W 3 C initiated XML Schema project to deal with problems in DTDs • XML Schema Requirements (1999) specifies that XML Schema should be: – more expressive than XML DTD – a well-formed XML language – self-describing • i. e. , it should be possible to describe the syntax of XML Schema using an XML Schema (since XML Schema is an XML language) – simple enough to implement with modest design and runtime resources (which limits expressiveness) • XML Schema specification should be: – defined quickly to prevent competing schema languages gaining a foothold – precise, concise, human-readable and illustrated with examples 4
XML Schema technical requirements • XML Schema should – contain mechanism for constraining use of namespaces – allow creation of user-defined datatypes for describing character data and attribute values – enable inheritance for element, attribute and datatype definitions – support evolution of schemas – permit embedded structured documentation within schemas 5
XML Schema recommendation • Official XML Schema specification published as W 3 C recommendation in 2001 – in 2 parts: • XML Schema Part 1: Structures – Describes core XML Schema including, for example, element and attribute declarations – Most recent version: Second Edition, 28 October 2004 – Available online at http: //www. w 3. org/TR/xmlschema-1/ • XML Schema Part 2: Datatypes – Defines facilities for defining datatypes in XML Schema – Most recent version: Second Edition, 28 October 2004 – Available online at http: //www. w 3. org/TR/xmlschema-2/ • Does not satisfy all original requirements: – not simple • Partly remedied by XML Schema Part 0: Primer – Provides easily readable description of the XML Schema facilities – Most recent version: 28 October 2004 – Available online at » http: //www. w 3. org/TR/xmlschema-0/ – not fully self-describing – not sufficiently expressive • e. g. , cannot express full syntax of Recipe. ML 6
XML Schema overview • Contains a sophisticated type system like those in common programming languages – Facilitates re-use and improves schema structure • Four central constructs in XML Schema all based on types and are as follows: – Simple type definition • Defines a family of Unicode text strings • Describes text without markup – Complex type definition • Defines validity requirements for attributes, sub-elements and character data in an element of that type • Describes text which may contain markup – Element declaration • Associates element name with either a simple or complex type – Attribute declaration • Associates attribute name with simple type – Attribute values are always unstructured text 7
An example schema written in XML Schema • Schema at left shows – one element declaration • student – two attribute declarations: • id, score – one complex type definition: • Student. Type – one simple type definition: • Score • XML Schema elements identified by namespace http: //www. w 3. org/2001/XMLSchema ● • Namespace prefix ("xsd") is arbitrary but conventional Root element in XML Schema document is named schema ● usually contains target. Namespace attribute ● ● • • defines namespace being defined by the schema also declare this namespace with a prefix so that can refer to definitions within the schema Definitions create new types; declarations describe constituents of the instance document Definitions and declarations populate the target namespace 8
Syntax for element and attribute declarations • Element declaration has form <element name="name" type="type"/> – associates simple or complex type, with the element named name • Attribute declaration has form <attribute name="name" type="type"/> – associates simple type, with an attribute named name 9
Simple student instance document • Can avoid use of prefixes in attribute names 10
Business card example • Instance doc at top left in language defined at bottom left • Assume we own the domain businesscard. org – so no-one else uses this namespace • Can fix it so that no need for prefix in uri attribute • Compare DTD 11
• • • Connecting instance documents and schemas Instance document can refer to a schema using schema. Location attribute from the namespace, http: //www. w 3. org/2001/XMLSchema-instance Value of schema. Location attribute has two parts, separated by whitespace: – target namespace of schema – URI of schema document schema. Location indicates that document is supposed to be valid with respect to the schema. Location attributes may appear in any element – usually appear in root element – can also appear in another element to indicate that the schema applies to the subtree under that element • means XML languages can be combined at will schema. Location attribute value is actually sequence of "namespace URI" pairs 12 – if more than one pair, all schemas apply independently
More on schema. Location • All attributes defined in http: //www. w 3. org/2001/XMLSchemainstance implicitly declared for all elements in instance document • schema. Location attributes are optional – make instance documents self-describing • Applications require documents to be valid relative to schemas decided by application developers, not schemas decided by document authors • XMLSchema does not directly enforce a particular root element – e. g. , an XMLSchema definition of XHTML cannot express that the root element must be html – means that application must check root element as well as carrying out XML validation 13
Simple types • Simple type or datatype is set of Unicode strings with a particular semantic interpretation – e. g. , decimal datatype is built-in XML Schema datatype which consists of all strings that represent decimal numbers (e. g. , 3. 1415) • 3. 1415 is equal to 3. 141500 • 42 is less than 117 • XML Schema contains some primitive simple types with pre-defined meanings • XML Schema also provides various mechanisms for deriving new types from existing ones 14
Simple Types (Datatypes) – Primitive string any Unicode string boolean true, false, 1, 0 decimal 3. 1415 float 6. 02214199 E 23 double 42 E 970 date. Time 2004 -09 -26 T 16: 29: 00 -05: 00 time 16: 29: 00 -05: 00 date 2004 -09 -26 hex. Binary 48656 c 6 c 6 f 0 a base 64 Binary SGVsb. G 8 K any. URI http: //www. brics. dk/ixwt/ QName rcp: recipe, recipe. . . 15
Some built-in derived simple types • normalized. String – as string but whitespace facet is replace • token – as string but whitespace facet is collapse • language – "en", "da", "en-US", etc. • NMTOKEN – e. g. , "42", "my. form", "r 103" • NMTOKENS – e. g. , "42 my. form r 103" • non. Positive. Integer – e. g. , "-87", "0" 16
A simple type element declaration • <element name="serialnumber" type="non. Negative. Integer"/> – assigns built-in primitive simple type, non. Negative. Integer, to elements named serialnumber – contents of a serialnumber element must match non. Negative. Integer (possibly with surrounding whitespace) – serialnumber element cannot contain child elements or attributes 17
Deriving new simple types by restriction • Restriction of a simple type defines a new type by restricting possible values of a base type – restriction performed on facets of base type (see table above left) – restriction may contain multiple constraining facets • Facet restrictions operate at semantic not syntactic level – e. g. , <total. Digits value="3"/> allows 123, 0123 and 18 0123. 0 but not 1234 and 123. 05
Deriving new simple types by restriction • enumeration facet restricts values to a finite set of possibilities (see above left) • pattern facet allows values to be constrained to satisfy regular expressions (see above right) – symbols that have a special meaning within regular expressions can be escaped by prefixing with a backslash (e. g. , *) • For most facets, restrictions may be changed in further derivations unless fixed="true" attribute is added to constraining facet 19
Deriving simple types using list and union • Use the list element inside a simple. Type definition to define a whitespace separated string of values of a particular type (see above left) – e. g. , "23 4 56 -7" is of type integerlist • Use union element inside a simple. Type definition to specify that a value must be one of two or more types – e. g. , "true" and "1. 3" are both of type boolean_or_decimal 20
Complex types • An element declaration may assign a complex type to an element name: <element name="card" type="b: card_type"/> – means that elements with the name card must satisfy all the requirements specified in the definition of the type card_type – complex type definition may specify attributes, child element types and ordering and character data • Complex type defined using XML Schema element, complex. Type – content of complex. Type element can be either complex or simple 21
Element reference • Element reference takes the form <element ref="name" /> – name is the name of an element that has already been declared • Note difference between element with name attribute and one with a ref attribute! 22
sequence element • Concatenation within the content of an element with a complex content model is expressed using the sequence element 23
choice element • Union (i. e. , the '|' operator in a regular expression) corresponds to the choice element • At left, each card element contains either an email element or zero or 1 phone elements but not both 24
all element • A content sequence matches an all expression if each constituent of the expression is matched somewhere in the content model and every element in the content model is matched by a constituent in the expression • Essentially variant of sequence in which order does not matter 25
any element • any empty element is a wildcard that matches any element • Attribute namespace limits matching elements in various ways – whitespace separated list of URIs – ##target. Namespace – ##local • empty namespace – ##any – ##other • any namespace except target. Namespace 26
any element • Can be used to specify that a different language is used inside an element – e. g. , XHTML used inside the info element in Widget. ML (see above) – content must consist of one or more elements from the XHTML namespace 27
Some restrictions • all element may only contain element references • sequence and choice elements cannot contain all elements • complex. Type contents cannot consist of single element or any declaration – need to wrap it in a sequence or choice element 28
Attribute references • A complex type may optionally contain a number of attribute references of the form <attribute ref="name" /> – name is the name of the attribute that has been declared elsewhere – attribute reference must appear after the content model description of a complex type – attribute reference can contain an attribute named use which can take the values optional (default) or required 29
min. Occurs and max. Occurs • min. Occurs and max. Occurs attributes can be used with – element, sequence, choice, all and any elements – define possible cardinalities of the element – values must be non-negative integers or, for max. Occurs, unbounded – by default, min. Occurs and max. Occurs are 1 30
mixed attribute • complex. Type may optionally have an attribute, mixed="true" – means arbitrary character data is permitted anywhere in the content in addition to the elements declared in the content model – Without mixed="true" attribute, only whitespace allowed between elements in content model – Character data cannot be constrained if we also want to allow elements in the content 31
- Slides: 31