SGML and XML Text Encoding and Markup Languages

  • Slides: 24
Download presentation
SGML and XML Text Encoding and Markup Languages Michael Popham michael. popham@oucs. ox. ac.

SGML and XML Text Encoding and Markup Languages Michael Popham michael. popham@oucs. ox. ac. uk

Overview n n n (Welcome to acronym hell) The Oxford Text Archive and Arts

Overview n n n (Welcome to acronym hell) The Oxford Text Archive and Arts and Humanities Data Service Markup languages SGML: development and features XML Activity at the W 3 C Why does all this matter?

Arts & Humanities Data Service AHDS Executive KCL ADS HDS OTA PADS VADS York

Arts & Humanities Data Service AHDS Executive KCL ADS HDS OTA PADS VADS York Essex Oxford Glasgow Surrey Inst. http: //ahds. ac. uk

Markup languages n n A markup language is a set of conventions governing the

Markup languages n n A markup language is a set of conventions governing the use of markup These rules typically state n n what kinds of markup are allowed or required where they are allowed or required how they relate to each other how to distinguish markup from content (the text itself)

Is all markup interchangeable? <C 1>Loomings chapter[1]{Loomings} : h 1. 1. Loomings . chapter

Is all markup interchangeable? <C 1>Loomings chapter[1]{Loomings} : h 1. 1. Loomings . chapter Loomings. cp; . sp 6 a; . ce. bd 1. ~x Loomings <div type=chapter n=1><head>Loomings</head>

SGML = ISO 8879 n n An ISO standard for the definition of markup

SGML = ISO 8879 n n An ISO standard for the definition of markup languages Markup n n a method of making explicit (and therefore processable) interpretations of a text Markup language n a set of defined codes and rules for specifying markup

An SGML document n n n SGML Declaration (techie stuff) Document Type Definition (DTD)

An SGML document n n n SGML Declaration (techie stuff) Document Type Definition (DTD) Document instance (document) n n n Elements Attributes Entities

Putting it all together SGML Declaration DOCTYPE Declaration Document Instance Intended for “human” readers

Putting it all together SGML Declaration DOCTYPE Declaration Document Instance Intended for “human” readers + optional, local extensions The text itself (content+markup)

SGML is a metalanguage SGML/XML DTD docs ISO/W 3 C DTD docs A. N.

SGML is a metalanguage SGML/XML DTD docs ISO/W 3 C DTD docs A. N. Other Users

SGML DTDs SGML HTML docs ISO 12083 TEI docs docs

SGML DTDs SGML HTML docs ISO 12083 TEI docs docs

A newspaper story n Elements n n Attributes n n It also has an

A newspaper story n Elements n n Attributes n n It also has an identifier, a date, section etc. Entities n n A story consists of data fields, followed by a headline, and then paragraphs containing sentences of character data, names etc. Represent boilerplate info. , special characters etc. NB: we’re saying nothing about what the elements look like, only what they are

A simple(!) SGML DTD <!ELEMENT story <!ATTLIST story - o ((%data; ), title, p+)>

A simple(!) SGML DTD <!ELEMENT story <!ATTLIST story - o ((%data; ), title, p+)> id ID #REQUIRED date CDATA #REQUIRED section CDATA #IMPLIED> <!ELEMENT title - - (#PCDATA)> <!ELEMENT p - o ((#PCDATA |q |name)+)> <!ELEMENT name - - (#PCDATA) > <!ATTLIST name type (person|place|org|any) any reg CDATA #IMPLIED > <!ENTITY % data “(author+, location? , keywords)> <!ELEMENT author - - (surname, firstname? )> <!ELEMENT surname - - (#PCDATA) > <!ELEMENT firstname - - (#PCDATA)> <!ENTITY Man. U “Manchester United” > <!ENTITY SAF “Sir Alex Ferguson” > …

An SGML instance <story id=7809 date=2000 -02 -22 section=sport> <data> <author><surname>Taylor</surname> <firstname>Daniel</firstname></author> <location>Manchester</location> <keywords>Beckham,

An SGML instance <story id=7809 date=2000 -02 -22 section=sport> <data> <author><surname>Taylor</surname> <firstname>Daniel</firstname></author> <location>Manchester</location> <keywords>Beckham, Posh Spice, Manchester United, childcare, Sir Alex Ferguson</keywords> </data> <title>&ellipsis; but the spin may not wash with Ferguson</title> <p><name type=“person” reg=“Beckham. D”>David Beckham</name>’s advisers claimed yesterday that he had <q>been given no reason whatsoever</q> for being banished from training and dropped from <name type=“org” reg=“Man. U”>&Man. U; </name>’s first-team after incurring the wrath of his manager <name type=“person” reg=“Ferguson. A”>&SAF; </name></p> <p>As <name type=“person” reg=“Beckham. D”>Beckham</name> attempted to focus on…</p> </story>

The formatted view

The formatted view

Defining an Element Omissibility element name or GI <!ELEMENT p <!ELEMENT name content model

Defining an Element Omissibility element name or GI <!ELEMENT p <!ELEMENT name content model - o ((#PCDATA|q|name)+)> - - (#PCDATA) >

Elements may take attributes n n n Providing information other than type or context

Elements may take attributes n n n Providing information other than type or context Useful for identification of element occurrences Limited data validation attribute name attribute value <P><NAME TYPE="person" REG="Beckham. D"> David Beckham</name>’s advisers claimed yesterday that he had… </S>

Documents: another view n n n Documents are made up of entities Entities are

Documents: another view n n n Documents are made up of entities Entities are named units of storage, using an associated notation Entities can be… n n n A single character or symbol (or a string of these) Another file (e. g. text, image, sound, video etc. ) Something on the Web

Like HTML, XML must. . . n n n n Be usable on the

Like HTML, XML must. . . n n n n Be usable on the net (but not restricted to it!) Support a wide variety of applications Be compatible with SGML Be easy to process Have few optional features (ideally none) Be human-legible and reasonably clear Be specified in a way that is both formal and concise

Unlike HTML. . . n n n XML is an extensible markup language XML

Unlike HTML. . . n n n XML is an extensible markup language XML markup can be verified XML markup reflects the meaning of your data, not its appearance

XML cf. SGML— differences n n No tag omission/minimization Properly delimited comments No inclusions/exclusions

XML cf. SGML— differences n n No tag omission/minimization Properly delimited comments No inclusions/exclusions Mixed content models n n optional-repeatable OR-groups with #PCDATA first No & in content model groups Simpler rules for handling whitespace Empty tags use new syntax <empty/>

How do they really differ? n n n Pre-/Post- the success of the Web

How do they really differ? n n n Pre-/Post- the success of the Web Ease-of-implementation and use Greater raw computing power on the desktop “XML is what SGML should have been” More tools, more books, easier to learn

XML Activity at W 3 C n XML Applications n n Extensible Stylesheet Language

XML Activity at W 3 C n XML Applications n n Extensible Stylesheet Language (XSL) n n n Resource Description Framework (RDF), Synchronized Multimedia Integration Language (SMIL), XHTML XSL Transformation Language, XSL Formatting Objects XML Linking Language(Xlink) and XML Pointer Language (Xpointer) XML Schema, namespaces

Why does this matter? n n n The XML revolution (hype? ) XML =

Why does this matter? n n n The XML revolution (hype? ) XML = big names XML means application independence for your data XML means shareable, reusable data Improved data longevity(? )

Further information n The SGML/XML web page n n W 3 C’s XML web

Further information n The SGML/XML web page n n W 3 C’s XML web page n n http: //www. w 3. org/XML/ The Text Encoding Initiative n n http: //www. oasis-open. org/cover/ http: //www. tei-c. org/ …and even n “XML: the future of web markup? ” by Elliott Pritchard at http: //panizzi. shef. ac. uk/elecdiss/edl 0003/index. html