XML Extensible Markup Language DBI Representation and Management
XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet 1
Part I: Background • What’s the difference between – The world of documents and information retrieval, and – Databases and query interfaces? 2
Documents vs. Databases Document World • plenty of small documents • usually static • implicit structure section, paragraph, toc, • tagging • human friendly • content form/layout, annotation • Paradigms “Save as”, wysiwyg • meta-data author name, date, subject Database World • a few large databases • usually dynamic • explicit structure (schema) • records • machine friendly • content schema, data, methods • Paradigms Atomicity, Concurrency, Isolation , Durability • meta-data schema description 3
Documents vs. Databases Document World • Plenty of small documents • Usually static • Implicit structure: section, paragraph, table of contents • Tagging Database World • A few large databases • Usually dynamic • Explicit structure: schema • Records 4
Documents vs. Databases (cont’d) Document World Database World • Human friendly • Machine friendly • Content: form/layout, annotation • Paradigms: “Save as”, Wysiwyg • Meta-data: author name, date, schema, data, methods • Paradigms: Atomicity, Concurrency, Isolation, Durability • Meta-data: schema description 5
What can be Done with Them editing printing spell-checking counting words retrieving (IR) searching clustering Documents updating cleaning querying adjusting transforming Database 6
HTML • Hypertext Markup Language • Used for publishing hypertext on the World-Wide Web • Designed to describe how a Web browser should arrange text, images and push-buttons on a page • Easy to learn, but does not convey structure • Fixed tag set 7
Opening tag HTML Example Text (PCDATA) <HTML> <HEAD><TITLE>Welcome to the DBI course</TITLE></HEAD> <BODY> <H 1>Introduction</H 1> <IMG SRC= "dragon. gif" WIDTH="200" HEIGHT="150" > </BODY> </HTML> “Bachelor” tag Attribute name Closing tag Attribute value 8
HTML • The World-Wide Web is constructed from HTML documents • We can apply information-retrieval techniques to a set of documents – For example, clustering as Google does • How can we apply database techniques to the Web? 9
HTML Pages • We can – Edit (and put on the Web) – Print (or view with a browser) – Spell-check – Count words – Retrieve (again, with a browser) – Search (with a search engine, for example) – Cluster 10
How can we Ask Queries? • How can we find automatically the cheapest flight from Israel to Micronezia, knowing the Web sites of all airlines that have flights to Micronezia? • How can we find automatically the phone numbers of people that advertised on the Web that they want to sell a car for a price that is not greater than 30, 000 IS? • It can be useful to query data as we do in databases 11
Thin Red Line • The line between the document world and the database world is not clear • In some cases, both approaches are legitimate • An interesting middle ground is data formats – of which XML is an example 12
The Structure of XML • XML consists of tags and text • Tags come in pairs <date>. . . </date> • They must be properly nested – good <date>. . . <day>. . . </date> – bad <date>. . . <day>. . . </date>. . . </day> (You can’t do <i>. . . <b>. . . </i>. . . </b> in HTML) 13
XML Text XML has only one “basic” type – text It is bounded by tags, e. g. , <title> The Big Sleep </title> <year> 1935 </ year> – 1935 is still text • XML text is called PCDATA – (for parsed character data) • It uses a 16 -bit encoding, e. g. , &#x 0152 for the Hebrew letter Mem 14
XML Structure • Nesting tags can be used to express various structures, e. g. , a tuple (record): <person> <name> Lisa Simpson</name> <tel> 02 -828 -1234 </tel> <tel> 054 -470 -777 </tel> <email> lisa@cs. huji. ac. il </email> </person> 15
XML Structure (cont’d) • We can represent a list by using the same tag repeatedly: <addresses> <person> … </person> … </addresses> 16
XML Structure (cont’d) <addresses> <person> <name> Donald Duck</name> <tel> 04 -828 -1345 </tel> <email> donald@cs. technion. ac. il </email> </person> <name> Miki Mouse</name> <tel> 03 -426 -1142 </tel> <email>miki@yahoo. com</email> </person> </addresses> 17
Terminology The segment of an XML document between an opening and a corresponding closing tag is called an element, a sub-element of <person> <name> Bart Simpson </name> <tel> 02 – 444 7777 </tel> <tel> 051 – 011 022 </tel> <email> bart@tau. ac. il </email> </person> not an element 18
XML Document is a Tree person name tel email Bart Simpson 051 – 011 022 02 – 444 7777 bart@tau. ac. il Semistructured data models typically put the labels on the edges 19
Mixed Content An element may contain a mixture of subelements and PCDATA <airline> <name> British Airways </name> <motto> World’s <dubious> favorite</dubious> airline </motto> </airline> 20
Needs for Mixed Content • Mixed-content data is not typically generated from databases • It is needed for consistency with HTML • For example: <html> <head></head> <body> Why can’t you find <it>dragons</it> in a restaurant? Because <b>smoking</b> is not allowed </body> </html> 21
A Complete XML Document <? XML version ="1. 0" encoding="UTF-8" standalone="no"? > <!DOCTYPE addresses SYSTEM "http: //www. cs. huji. ac. il/~dbi/dbi-addresses. dtd"> <addresses> <person> <name>Lisa Simpson</name> <tel> 02 -828 -1234 </tel> <tel> 054 -470 -777 </tel> <email> lisa@cs. huji. ac. il </email> </person> </addresses> 22
The Header Tag • <? xml version="1. 0“ standalone="yes/no" encoding="UTF-8"? > • You can leave out the encoding attribute and the processor will use the UTF-8 default 23
Processing Instructions <? xml version="1. 0"? > <? xml-stylesheet href="doc. xsl“ type="text/xsl" ? > <!DOCTYPE doc SYSTEM "doc. dtd" <doc>Hello, world!<!-- Comment 1 --></doc> <? pi-without-data ? > <!-- Comment 2 --> <!-- Comment 3 --> 24
Two Ways of Representing a Relational Database in XML projects: title employees: name budget ssn managed. By age 25
Project and Employee relations in XML Projects and employees are intermixed <db> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managed. By> Joe </managed. By> </project> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 < /age> </employee> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> </employee> <project> <title> Auto guided vehicle </title> <budget> 70000 </budget> <managed. By> Sandra </managed. By> </project> : </db> 26
Employees follow projects <db> <projects> <employees> <project> <employee> <title> Pattern recognition </title> <name> Joe </name> <budget> 10000 </budget> <ssn> 344556 </ssn> <managed. By> Joe</managed. By> <age> 34 </age> </project> </employee> <project> Employees <employee> Projects <title> Auto guided vehicles <name>Sandra</name> </title> <ssn> 2234 </ssn> <budget> 70000 </budget> <age>35 </age> <managed. By>Sandra</managed. By> </employee> </project> : : <employees> </projects> </db> 27
Or without “separator” tags … <db> <projects> <title> Pattern recognition </title> <budget> 10000 </budget> <managed. By> Joe </managed. By> <title> Auto guided vehicles </title> <budget> 70000 </budget> <managed. By> Sandra </managed. By> : </projects> <employees> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> : </employees> </db> Can be done if it is clear where each employee and each project starts. 28
Attributes • An (opening) tag may contain attributes • These are typically used to describe the contents of an element <entry> <word language = “en”> cheese</word> <word language = “fr”> fromage</word> <word language = “ro”> branza </word> <meaning> A food made … </meaning> </entry> 29
Attributes (cont’d) Another common use for attributes is to express dimension or type <picture> <height dim= “cm”> 2400 </height> <width dim= “in”> 96 </width> <data encoding = “gif” compression = “zip”> M 05 -. +C$@02!G 96 YE<FEC. . . </data> </picture> 30
Well-Formed Documents A document that – obeys the “nested-tags” rule, and – does not repeat an attribute within a tag is said to be well-formed 31
<addresses > Using Attributes <person friend="yes"> <name> Jeff Cohen</name> <tel> 04 -828 -1345 </tel> <tel> 054 -470 -778 </tel> <email> jeffc@cs. technion. ac. il </email> </person> <person friend="no"> <name> Irma Levy</name> <tel> 03 -426 -1142 </tel> <email>irmal@yourmail. com</email> </person> </addresses> 32
When to Use Attributes • It’s not always clear when to use attributes ssno= “ 123 4589”> <person> <ssno> 123 4589 </ssno> <name> L. Simpson </name> <email> lisa@cs. huji. ac. il </email>. . . </person> <person 33
End of Lecture 4 34
Using IDs <person id="jeff" friend="yes" knows="irma"> <name> Jeff Cohen</name> <tel> 04 -828 -1345 </tel> ID attributes <tel> 054 -470 -778 </tel> <email> jeffc@cs. technion. ac. il </email> </person> <person id="irma" friend="no" knows="jeff"> <name> Irma Levy</name> <tel> 03 -426 -1142 </tel> <email>irmal@yourmail. com</email> 35
Using IDs <family> <person id=“lisa” mother=“marge” father=“homer”> <name> Lisa Simpson </name> </person> <person id=“bart” mother=“marge” father=“homer”> <name> Bart Simpson </name> </person> <person id=“marge” children=“bart lisa”> <name> Marge Simpson </name> </person> <person id=“homer” children=“bart lisa”> <name> Homer Simpson </name> </person> </family> 36
ODL Schema class Movie ( extent Movies, key title ) { attribute string title; attribute string director; relationship set<Actor> casts inverse Actor: : acted_In; attribute int budget; }; class Actor ( extent Actors, key name ) { attribute string name; relationship set<Movie> acted_In inverse Movie: : casts; attribute int age; attribute set<string> directed; }; 37
<db> <movie id=“m 1”> <title>Waking Ned Divine</title> <director>Kirk Jones III</director> <cast idrefs=“a 1 a 3”></cast> <budget>100, 000</budget> </movie> <movie id=“m 2”> <title>Dragonheart</title> <director>Rob Cohen</director> <cast idrefs=“a 2 a 9 a 21”></cast> <budget>110, 000</budget> </movie> <movie id=“m 3”> <title>Moondance</title> <director>Dagmar Hirtz</director> <cast idrefs=“a 1 a 8”></cast> <budget>90, 000</budget> </movie> : class Movie ( extent Movies, key title ) { attribute string title; attribute string director; relationship set<Actor> casts inverse Actor: : acted_In; attribute int budget; }; 38
class Actor ( extent Actors, key name ) { attribute string name; relationship set<Movie> acted_In inverse Movie: : casts; attribute int age; attribute set<string> directed; }; <db> : <actor id=“a 1”> <name>David Kelly</name> <acted_In idrefs=“m 1 m 3 m 78” > </acted_In> </actor> <actor id=“a 2”> <name>Sean Connery</name> <acted_In idrefs=“m 2 m 9 m 11”> </acted_In> <age>68</age> </actor> <actor id=“a 3”> <name>Ian Bannen</name> <acted_In idrefs=“m 1 m 35”> </acted_In> </actor> : </db> 39
<db> <movie id=“m 1”> <title>Waking Ned Divine</title> <director>Kirk Jones III</director> <cast idrefs=“a 1 a 3”></cast> <budget>100, 000</budget> </movie> <movie id=“m 2”> <title>Dragonheart</title> <director>Rob Cohen</director> <cast idrefs=“a 2 a 9 a 21”></cast> <budget>110, 000</budget> </movie> <movie id=“m 3”> <title>Moondance</title> <director>Dagmar Hirtz</director> <cast idrefs=“a 1 a 8”></cast> <budget>90, 000</budget> </movie> : <actor id=“a 1”> <name>David Kelly</name> <acted_In idrefs=“m 1 m 3 m 78” > </acted_In> </actor> <actor id=“a 2”> <name>Sean Connery</name> <acted_In idrefs=“m 2 m 9 m 11”> </acted_In> <age>68</age> </actor> <actor id=“a 3”> <name>Ian Bannen</name> <acted_In idrefs=“m 1 m 35”> </acted_In> </actor> : </db> 40
Part II: Document Type Descriptors Imposing Structure on XML Documents 41
Document Type Descriptors • Document Type Descriptors (DTDs) impose structure on an XML document • There is some relationship between a DTD and a schema, but it is not close – hence the need for additional “typing” systems • The DTD is a syntactic specification 42
Example: An Address Book <person> <name> Homer Simpson </name> Exactly one name <greet> Dr. H. Simpson </greet> At most one greeting As many address <addr>1234 Springwater Road </addr> lines as needed <addr> Springfield USA, 98765 </addr> (in order) <tel> (321) 786 2543 </tel> Mixed telephones <fax> (321) 786 2544 </fax> and faxes <tel> (321) 786 2544 </tel> As many as needed <email> homer@math. springfield. edu </email> </person> 43
Specifying the Structure • name to specify a name element • greet? to specify an optional (0 or 1) greet elements • name, greet? to specify a name followed by an optional greet 44
Specifying the Structure (cont) • addr* address lines to specify 0 or more • tel | fax a tel or a fax element • (tel | fax)* • email* elements 0 or more repeats of tel or fax 0 or more email 45
Specifying the Structure (cont’d) • So the whole structure of a person entry is specified by name, greet? , addr*, (tel | fax)*, email* • This is known as a regular expression • Why is it important? 46
Regular Expressions • Each regular expression determines a corresponding finite state automaton • Let’s start with a simpler example: name, addr*, email addr name email This suggests a simple parsing program 47
Another Example name, address*, (tel | fax)*, email* address name email tel email fax email Adding in the optional greet further complicates things 48
Internal DTD For the Address Book <? xml version="1. 0" encoding="UTF-8"? > <!DOCTYPE addressbook [ <!ELEMENT addressbook (person*)> <!ELEMENT person (name, greet? , address*, (fax | tel)*, email*)> <!ELEMENT name (#PCDATA)> <!ELEMENT greet (#PCDATA)> The name of the <!ELEMENT address (#PCDATA)> “Internal” means that DTD is <!ELEMENT tel (#PCDATA)> the DTD and the addressbook <!ELEMENT fax (#PCDATA)> XML Document <!ELEMENT email are in (#PCDATA)> ]> the same file 49
Rest of the Address Book <addressbook> <person> <name> Jeff Cohen </name> <greet> Dr. Cohen </greet> <email> jc@penny. com </email> </person> </addressbook> 50
Our Relational DB Revisited projects: title employees: name budget ssn managed. By age 51
Two DTDs for the Relational DB <!DOCTYPE db [ <!ELEMENT db <!ELEMENT projects <!ELEMENT employees <!ELEMENT project <!ELEMENT employee. . . ]> (projects, employees)> (project*)> (employee*)> (title, budget, managed. By)> (name, ssn, age)> <!DOCTYPE db [ <!ELEMENT db (project | employee)*> <!ELEMENT project (title, budget, managed. By)> <!ELEMENT employee (name, ssn, age)>. . . ]> 52
Recursive DTDs Each person should have <DOCTYPE genealogy [ a father and a mother. <!ELEMENT genealogy (person*)> This leads to either infinite <!ELEMENT person ( data or a person that is name, a descendent of himself. date. Of. Birth, person, -- mother person )> -- father. . . ]> What is the problem with this? A parser does notice it! 53
Recursive DTDs (cont’d) If a person has only <DOCTYPE genealogy [ a father, how can you <!ELEMENT genealogy (person*)> tell that he has <!ELEMENT person ( a father and does not name, have a mother? date. Of. Birth, person? , -- mother person? )> -- father. . . ]> What is now the problem with this? 54
Some Things are Hard to Specify Each employee element is to contain name, age and ssn elements in some order There are n! different <!ELEMENT employee orders of n elements ( (name, age, ssn) | (age, ssn, name) | (ssn, It name, age) |. . . is not even polynomial )> Suppose there were many more fields! 55
General Definitions of Entities ANY - tells that the element can have any content EMPTY - tells that the element has no content 56
Summary of XML regular expressions • A • e 1, e 2 • • • The tag A occurs The expression e 1 followed by e 2 e* 0 or more occurrences of e e? Optional – 0 or 1 occurrences e+ 1 or more occurrences e 1 | e 2 either e 1 or e 2 (e) grouping 57
Deterministic Requirement • If element-type declarations are deterministic, it is easier • Formally, the Glushkov automaton is deterministic • The states of this automaton are the positions of the regular expression (semantic actions) • The transitions are based on the “follows set” 58
Deterministic Requirement (cont. ) • The associated automata are succinct • A regular language may not have an associated deterministic grammar, e. g. , <!ELEMENT ndeter ((movie|director)*, movie, (movie|director))> 59
Specifying Attributes in the DTD <!ELEMENT height (#PCDATA)> <!ATTLIST height dimension CDATA #REQUIRED accuracy CDATA #IMPLIED > The dimension attribute is required The accuracy attribute is optional CDATA is the “type” of the attribute – it means string, and may take any literal string as a value 60
Specifying ID and The attributes mother and IDREF Attributes father reference IDs of other elements. <!DOCTYPE family [ However, those are not <!ELEMENT family (person)*> necessarily person elements! <!ELEMENT person (name)> The mother attribute does <!ELEMENT name (#PCDATA)> not necessarily reference <!ATTLIST person a female person. id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED> ]> References to IDs have no type 61
Some Conforming Data <family> <person id=“lisa” mother=“marge” father=“homer”> <name> Lisa Simpson </name> </person> <person id=“bart” mother=“marge” father=“homer”> <name> Bart Simpson </name> </person> <person id=“marge” children=“bart lisa”> <name> Marge Simpson </name> </person> <person id=“homer” children=“bart lisa”> <name> Homer Simpson </name> </person> </family> 62
Consistency of ID and IDREF Attribute Values • If an attribute is declared as ID – the associated values must all be distinct (no confusion) • If an attribute is declared as IDREF – the associated value must exist as the value of some ID attribute (no dangling “pointers”) • Similarly for all the values of an IDREFS attribute • ID and IDREF attributes are not typed 63
A Useful Abbreviation When an element has empty content we can use • <br/> for </br> • <hr width=“ 10”/> for <hr width=“ 10”></hr> For example: <family> <person id = “lisa”> <name> Lisa Simpson </name> <mother idref = “marge”/> <father idref = “homer”/> </person>. . . </family> 64
An Alternative Specification <? xml version="1. 0" encoding="UTF-8"? > <!DOCTYPE family [ <!ELEMENT family (person)*> <!ELEMENT person (name, mother? , father? , children? )> <!ATTLIST person id ID #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT mother EMPTY> <!ATTLIST mother idref IDREF #REQUIRED> <!ELEMENT father EMPTY> <!ATTLIST father idref IDREF #REQUIRED> <!ELEMENT children EMPTY> <!ATTLIST children idrefs IDREFS #REQUIRED> ]> 65
The Revised Data <family> <person id=“bart"> <person id=“marge"> <name> Bart <name> Marge Simpson </name> <mother <children idrefs=“bart lisa"/> idref=“marge"/> </person> <father <person id=“homer"> idref=“homer"/> <name> Homer </person> Simpson </name> <person id=“lisa"> <children idrefs=“bart lisa"/> <name> Lisa </person> Simpson </name> </person> </family> 66
End of Lecture 5 67
ODL Schema class Movie ( extent Movies, key title ) { class Actor ( extent Actors, key name ) { attribute string name; relationship set<Movie> attribute string title; attribute string director; relationship set<Actor> acted_In inverse Movie: : cast; attribute int age; attribute set<string> directed; cast inverse Actor: : acted_In; attribute int budget; }; 68
Schema. dtd <? xml version="1. 0" encoding="UTF-8"? > <!DOCTYPE db [ <!ELEMENT db (movie+, actor+)> <!ELEMENT movie (title, director, cast, budget)> <!ATTLIST movie id ID #REQUIRED> <!ELEMENT title (#PCDATA)> <!ELEMENT director (#PCDATA)> <!ELEMENT cast EMPTY> <!ATTLIST cast idrefs IDREFS #REQUIRED> <!ELEMENT budget (#PCDATA)> The DTD continues in the next slide 69
Schema. dtd (cont’d) <!ELEMENT <!ATTLIST <!ELEMENT ]> actor (name, acted_In, age? , directed*)> actor id ID #REQUIRED> name (#PCDATA)> acted_In EMPTY> acted_In idrefs IDREFS #REQUIRED> age (#PCDATA)> directed (#PCDATA)> 70
Data <db> <movie id="ohgod"> <title> Oh God!</title> <director> Woody Allen </director> <cast idrefs="burns"></cast> <budget> $2 M </budget> </movie> <actor id="burns"> <name> George Burns </name> <acted_In idrefs="ohgod" /> </actor> </db> 71
Constraints on IDs and IDREFs • ID stands for identifier – No two ID attributes may have the same value (of type CDATA) • IDREF stands for identifier reference – Every value associated with an IDREF attribute must exist as an ID attribute value • IDREFS specifies several (0 or more) identifiers 72
Adding a DTD to the Document • A DTD can be internal – The DTD is part of the document file • or external – The DTD and the document are on separate files – An external DTD may reside • In the local file system (where the document is) • In a remote file system 73
Connecting a Document with its DTD • An internal DTD: <? xml version="1. 0"? > <!DOCTYPE db [<!ELEMENT. . . > … ]> <db>. . . </db> • A DTD from the local file system: <!DOCTYPE db SYSTEM "schema. dtd"> • A DTD from a remote file system: <!DOCTYPE db SYSTEM "http: //www. schemaauthority. com/schema. dtd"> 74
Well-formed and Valid Documents • A document (with or without a DTD) is wellformed if it has – proper nesting of tags and unique attributes • A valid document conforms to the DTD, i. e. , – the document conforms to the regular-expression grammar, – types of attributes are correct, and – constraints on references are satisfied 75
DTDs vs. Schemas (or Types) • DTDs are rather weak specifications by DB & programming-language standards – Only one base type – PCDATA – No useful “abstractions”, e. g. , sets – IDREFs are untyped – the type of the object being referenced is not known – No constraints, e. g. , child is inverse of parent – No methods – Tag definitions are global • Some extensions of XML impose a schema or types on an XML document We may see these later 76
Part III: Entities To Take Storage into Account 77
What are Entities? • An entity is a shortcut to a set of information • You might think of an entity as being a bit like a macro • Entities allow dividing a document between some different storage devices 78
Why to Use Entities • Entities allow sharing data between documents • Entities save typing • Entities can reduce errors • Entities are easy to update • Entities can act as placeholders for TBD (to be determined) information 79
Defining Entities • Entities can be defined – in the local document as part of the DOCTYPE definition – with a link to external files that contain the entity data (this, too, is done through the DOCTYPE definition) – in an external DTD • Define locally when the entity is being used only in one particular document • Define by a link to an external file when the entity is being used in many documents 80
Kinds of Entities There are two kinds of entities: • General entities – For usage in documents • Parameter entities – For usage in declarations 81
General entities • The definition of a general entitiy in the DTD <!ENTITY Name Entity. Definition > • The usage of the entity in the document is by &Name; 82
Example <? xml version="1. 0" encoding="UTF-8"? > <!DOCTYPE mdb [ <!ENTITY bm "bad movie"> <!ELEMENT mdb (movie+)> <!ELEMENT movie (title, director, cast? , budget)> ]> <mdb> <movie id="ohgod" opinion="&bm; "> <title> Oh God!</title> <director> Woody Allen </director> <budget> $2 M </budget> </movie> </mdb> 83
Browser View 84
Unparsed Entities <!DOCTYPE mdb [ <!NOTATION gif SYSTEM "c: Program Types FilesNetscapeCommunicatorProgramNetscape. exe"> are <!ENTITY starpicture SYSTEM "http: //www. cs. huji. ac. il/~dbi/figures/star. gif" NDATA gif> defined <!ENTITY bm "bad movie"> <!ELEMENT mdb (movie+)> <!ELEMENT movie (title, director, budget)> <!ATTLIST movie id ID #REQUIRED opinion CDATA #IMPLIED starimage ENTITY #IMPLIED> <!ELEMENT title (#PCDATA)> <!ELEMENT director (#PCDATA)> <!ELEMENT budget (#PCDATA)> Entities are defined ]> 85
Data <mdb> <movie id="ohgod" opinion="&bm; " starimage="starpicture"> <title> Oh God!</title> <director> Woody Allen </director> <budget> $2 M </budget> </movie> </mdb> 86
Parameter Entities • Parameter entities are used only within DTDs • They carry information for use in the markup declaration – Internal entities - references are within the DTD – External entities - references draw information from outside files • Parameter Entity declaration: <!ENTITY % Name Entity. Definition > 87
Parameter Entity Example <? xml version="1. 0" encoding="UTF-8"? > <!ENTITY % essential "name, tel*"> <!ELEMENT email (#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT person (%essential; , email, advisor? )> <!ATTLIST person friend (yes | no) #IMPLIED id ID #REQUIRED knows IDREFS #IMPLIED> <!ELEMENT advisor (person)> <!ELEMENT addresses (person)*> 88
Entities Definition • Local Definition: <!DOCTYPE [ <!ENTITY copyright "Copyright 2000, As The World Spins Corp. All rights reserved. Please do not copy or use without authorization. For authorization contact legal@worldspins. com. "> ]> • Global Definition: <!DOCTYPE [ <!ENTITY copyright SYSTEM "http: //www. worldspins. com/legal/copyright. xml"> ]> 89
Example <? xml version="1. 0"> <!DOCTYPE [ <!ENTITY copyright "Copyright 2000, As The World Spins Corp. All rights reserved. Please do not copy or use without authorization. For authorization contact legal@worldspins. com. "> <!ENTITY trademark SYSTEM "http: //www. worldspins. com/legal/trademark. xml"> ]> 90
Example (cont’d) <PRESSRELEASE> <HEAD> Mini-globe revolutionizes keychain industry </HEAD> <LEAD> Today As The World Spins introduces a new approach to key chains. With the new MINI-GLOBE keys can be kept inside a chain, called for upon demand, and stored safely. Never more will consumers lose a key or stand at a door flipping through a stack of keys seeking the right one. </LEAD> <LEGAL>&trademark; ©right; </LEGAL> </PRESSRELEASE> 91
Name Spaces • Namespaces are a way of using elements from more than one DTD within the same XML document • An XML namespace is a collection of names, identified by a URI reference, which are used in XML documents as element types and attribute names • Declaring the namespace – identifying the namespaces used in the document at the beginning of the document 92
Example • Defining the used namespace <document xmlns: dbi= 'http: //www. cs. huji. ac. il/dbischema'> • Using a tag from the namespace <dbi: A>This is a text of an element A according to dbi’s definition</A> • Using a tag not from the namespace <A>This will probably be understood as an anchor</A> 93
DTD’s 94
The Data File 95
The Data File: shorthands 96
<? XMLversion ="1. 0" encoding="UTF-8" standalone="no"? > <container xmlns: bi="www. cs. technion. ac. il/~oshmu/container. dtd"> <bi: bdb xmlns: bi="www. cs. technion. ac. il/~oshmu/nss. dtd"> <bi: book> <title> Godzila</title> <author>Jeff Cohen </author> </bi: book> <bk: book xmlns: bk="www. cs. technion. ac. il/~oshmu/namespaces. dtd"> <title>A Suitable Boy</title> <price currency="US Dollar">22. 95</price> </bk: book> </bi: bdb> </container> 97
Using CDATA <HEAD 1> Entering a Kennel Club Member We want to see the </HEAD 1> text as is, even though <DESCRIPTION> Enter the member by the name on hisitorincludes her papers. Use the tags NAME tag. The NAME tag has two attributes. Common (all in lowercase, please!) is the dog's call name. Breed (also in all lowercase) is the dog's breed. Please see the breed reference guide for acceptable breeds. Your entry should look something like this: </DESCRIPTION> <EXAMPLE> <![CDATA[<NAME common="freddy" breed"=springerspaniel">Sir Fredrick of Ledyard's End</NAME>]]> </EXAMPLE> 98
99
Summary • XML is a new data format. Its main virtues: – widespread acceptance – the (important) ability to handle semistructured data (data without schema) • DTDs provide some useful syntactic constraints on documents. As schemas they are weak • How to store large XML documents? • How to query them? • How to map between XML and other representations? 100
- Slides: 100