INTRODUCTION TO TEI TOMA ERJAVEC DEPT OF KNOWLEDGE

INTRODUCTION TO TEI TOMAŽ ERJAVEC DEPT. OF KNOWLEDGE TECHNOLOGIES JOŽEF STEFAN INSTITUTE LJUBLJANA, SLOVENIA

Overview 1. Introduction to text markup 2. What is TEI 3. Some examples of

The ontology of text • Where is the text? • in the shape of

Encoding of texts • Texts are more then sequences of encoded characters • they

Some definitions • Markup makes explicit the distinctions we want to make when processing

What does markup capture? Compare <head>Upon Julia’s Clothes</head> <lg> <l>Whenas in silks my <hi>Julia</hi>

What is the point of markup? • To make explicit (to a machine) what

XML • XML is structured data represented as strings of text: • XML is

Schema languages • XML schemas are used to: • define the element and attribute

Developing schemas • For simple annotations, one can define a project-specific schema from scratch

Text Encoding Initiatve The TEI provides a framework for the definition of multiple XML

Where did the TEI come from? • Originally, a research project within the humanities

Goals of the TEI • • better interchange and integration of scholarly data support

TEI Guidelines • A set of recommendations for text encoding, covering both generic text

Legacy of the TEI • a way of looking at what ‘text’ really is

Users of TEI • Over 100 projects listed on the TEI project page •

Versions of the Guidelines • TEI P 3 (1994) first public version: • SGML

TEI modules • TEI is too general to be supported by a single schema

Support offerred by TEI • Web interface to make XML schemas from a TEI

Examples of applications Mostly work done by me in collaboration with other people institutions:

JOS corpus <s xml: id="F 0203. 557. 2"> <w xml: id="F 0203. 557. 2.

ja. Slo dictionary <entry id="jaslo. 55"> <form type="hw"> <orth type="roma">ainiku</orth> <orth type="kana">あいにく</orth> <orth type="kanji">生憎</orth>

e. ZISS text-critical editions <l>Na <app> <lem wit="#Drobt_1846">cesti</lem> <rdg wit="#UKM_123 #UKM_553">zeſti</rdg> </app> popotnik <app>

SBL biographical database <person> <sex value="1"/> <pers. Name xml: lang="lat"> <forename>Johannes</forename> <surname>Aquila de <place.

Conclusions • Gave a brief introduction to TEI • For more, visit the TEI

Slides: 28

Download presentation

INTRODUCTION TO TEI TOMAŽ ERJAVEC DEPT. OF KNOWLEDGE TECHNOLOGIES JOŽEF STEFAN INSTITUTE LJUBLJANA, SLOVENIA

Overview 1. Introduction to text markup 2. What is TEI 3. Some examples of usage

What‘s in a text?

What‘s in a text (2)

What‘s in a text (3)

The ontology of text • Where is the text? • in the shape of letters and their layout? • in the original from which this copy derives? • in the ideas it brings forth? in their format, or their intentions? • Texts are abstractions conjured up by readers. • Markup encodes those abstractions.

Encoding of texts • Texts are more then sequences of encoded characters • they have structure and content • they also have multiple readings • Encoding, or markup, is a way of making these things explicit • Only that which is explicit can be reliably processed

Some definitions • Markup makes explicit the distinctions we want to make when processing a string of bytes • Markup is a way of naming and characterizing the parts of a text in a formalized way • It’s (usually) more useful to markup what things are than what they look like

What does markup capture? Compare <head>Upon Julia’s Clothes</head> <lg> <l>Whenas in silks my <hi>Julia</hi> goes, </l> <l>Then, then (me thinks) how sweetly flowes</l> <l>That liquefaction of her clothes. </l> </lg> and <s n="1" role="head"> <w type="pp">Upon</w> <w type="np">Julia</w><w type="pos">’s </w> <w type="nn 2">Clothes</w> </s> <s n="2" role="line"> <w type="adv">Whenas</w> <w type="pp">in</w> <w type="nn 2">silks</w>. . . </s>

What is the point of markup? • To make explicit (to a machine) what is implicit (to a person) • To add value by supplying multiple annotations • To facilitate re-use of the same material • in different formats • in different contexts • for different users

XML • XML is structured data represented as strings of text: • XML is extensible • XML must be well-formed • XML can be validated • XML is application-, platform-, and vendor- independent • XML empowers the content provider and facilitates data integration

Schema languages • XML schemas are used to: • define the element and attribute vocabularies for particular text types • define content models for elements • define data types of attributes (and elements) • Schemas can be written in: • XML DTD Language • W 3 C schema language • ISO Relax NG schema language • (TEI mostly uses Relax NG)

Developing schemas • For simple annotations, one can define a project-specific schema from scratch • But if the markup will be complicated, it is better to take one of the standard schemas • Using standard schemas means: • better documentation • better interchange • better tool support • There are many schemas around, but only one initiative delals with encoding arbitrary texts for scholarly purposes

Text Encoding Initiatve The TEI provides a framework for the definition of multiple XML schemas • it defines and names several hundred useful textual distinctions • it provides a set of modules that can be used to define schemas making those distinctions • it provides a customization mechanism for modifying and combining those definitions with new ones using the same conceptual model

Where did the TEI come from? • Originally, a research project within the humanities • Sponsored by three professional associations • Funded 1990 -1994 by US, EU • Major influences • digital libraries and text collections • language corpora • scholarly datasets • International consortium established June 1999 (see http: //www. tei-c. org/)

Goals of the TEI • • better interchange and integration of scholarly data support for all texts, in all languages, from all periods guidance for the perplexed: what to encode — hence, a user-driven codification of existing best practice assistance for the specialist: how to encode — hence, a loose framework into which unpredictable extensions can be fitted These apparently incompatible goals result in a flexible and modular environment

TEI Guidelines • A set of recommendations for text encoding, covering both generic text structures and some highly specific areas based on (but not limited by) existing practice • A very large collection of element definitions with associated declarations for various schema languages • a modular system for creating personalized schemas from the foregoing for the full picture see http: //www. tei-c. org/Guidelines/

Legacy of the TEI • a way of looking at what ‘text’ really is • a codification of current scholarly practice • (crucially) a set of shared assumptions and priorities about the digital agenda: • focus on content and function (rather than presentation) • identify generic solutions (rather than application-specific ones)

Users of TEI • Over 100 projects listed on the TEI project page • Main areas of use: • digital libraries • text-critical editions • computer corpora • dictionaries

Versions of the Guidelines • TEI P 3 (1994) first public version: • SGML + book (1200 pp) and soon also on the Web. • TEI P 4 (2002): • provides equal support for XML and SGML applications using the TEI scheme; • error correction, while maintaining backward compatibility: documents conforming to TEI P 3 will not become illegal when processed with TEI P 4. • TEI P 5 (2007): • implements more fundamental changes to the schemas, in line with current practice and identified problems, e. g. uses namespaces • no longer backward compatible with P 3, P 5 • Relax NG becomes the main schema language • continuous improvement. .

TEI modules • TEI is too general to be supported by a single schema • Rather, TEI is composed of modules, and which modules the user select is determined by the project needs • Some examples of modules: • Transcription of spoken texts • Dictionaries and lexica • Varieties of linguistic annotation • Nonstandard characters and glyphs • Linking, alignment, non-hierarchic structures • Detailed metadata (the TEI Header) • Manuscript description • Text-critical apparatus

Support offerred by TEI • Web interface to make XML schemas from a TEI parametrisation • A set of XSLT stylesheets to convert TEI/XML to HTML or PDF • Mailing list tei-l • Various tutorials available from the TEI pages • Yearly conference and members‘ meeting

Examples of applications Mostly work done by me in collaboration with other people institutions: • Annotated corpora • Machine readable dictionaries • Text-critical editions • Biographical databases

JOS corpus <s xml: id="F 0203. 557. 2"> <w xml: id="F 0203. 557. 2. 1" lemma="ta" msd="Zk-sei">To</w><S/> <w xml: id="F 0203. 557. 2. 2" lemma="biti" msd="Gp-ste-n">je</w><S/> <term type="slo. WNet" sort. Key="kraj" key="ENG 20 -08114200 -n"> <w xml: id="F 0203. 557. 2. 3" lemma="turističen" msd="Ppnmein">turističen</w><S/> <w xml: id="F 0203. 557. 2. 4" lemma="kraj" msd="Somei">kraj</w> </term> <c xml: id="F 0203. 557. 2. 5">. </c><S/> </s> <link. Grp type="syntax" targ. Func="head argument" corresp="#F 0203. 557. 2"> <link type="ena" targets="#F 0203. 557. 2. 2 #F 0203. 557. 2. 1"/> <link type="modra" targets="#F 0203. 557. 2. 2"/> <link type="dol" targets="#F 0203. 557. 2. 4 #F 0203. 557. 2. 3"/> <link type="dol" targets="#F 0203. 557. 2. 2 #F 0203. 557. 2. 4"/> <link type="modra" targets="#F 0203. 557. 2. 5"/> </link. Grp>

ja. Slo dictionary <entry id="jaslo. 55"> <form type="hw"> <orth type="roma">ainiku</orth> <orth type="kana">あいにく</orth> <orth type="kanji">生憎</orth> </form> <gram. Grp> <pos>N/Ana/Adv</pos> </gram. Grp> <trans> <tr>nesrečen</tr> <tr>nepričakovan</tr> <tr>nesluten</tr> <tr>žal</tr> </trans> <eg> <q>おあいにくさまです。</q> <tr>Žal mi je za vas. </tr> </eg> <usg type="level">2</usg> </entry>

e. ZISS text-critical editions <l>Na <app> <lem wit="#Drobt_1846">cesti</lem> <rdg wit="#UKM_123 #UKM_553">zeſti</rdg> </app> popotnik <app> <lem wit="#Drobt_1846">zdihuje. –</lem> <rdg wit="#UKM_123">sdihuje. <add>–</add></rdg> <rdg wit="#UKM_553">sdihuje</rdg> </app> </l> Three readings: Drobt_1846: Na cesti popotnik zdihuje. – UKM_123: Na zeſti popotnik sdihuje. – UKM_553: Na zeſti popotnik sdihuje

SBL biographical database <person> <sex value="1"/> <pers. Name xml: lang="lat"> <forename>Johannes</forename> <surname>Aquila de <place. Name>Rakerspurga</place. Name></surname> </pers. Name> <forename>Janez</forename> <surname>Akvila iz <place. Name>Radgone</place. Name></surname> </pers. Name> <forename xml: lang="hun">János</forename> <surname>Aquila</surname> </pers. Name> <occupation>slikar</occupation> <floruit not. After="1392" not. Before="1378"> <place. Name> <region>Prekmurje</region> </place. Name> </floruit> </person>

Conclusions • Gave a brief introduction to TEI • For more, visit the TEI web pages!