Phonetic characters in digital editions Toma Erjavec 1

Phonetic characters in digital editions Tomaž Erjavec 1 & Matija Ogrin 2 tomaz. erjavec@ijs.

Overview of the talk 1. 2. 3. IPA PUA TEI

The problem n n n provide standardised encoding (XML) and Web viewing (HTML) of

Focus of the talk e-BS, a very complex document: facsimile, commentary, diplomatic and critical

HTML representation of e. BS phonetic transcription

IPA n n n International Phonetic Alphabet (International Phonetic Association) contains not-well supported characters,

Computer representation of IPA SAMPA (for HLT) n n transliteration to ASCII SAMPA for

ZRCola developed at ZRC SAZU (Peter Weiss) n Unicode input system for linguistic use

Why PUA? ZRCola font uses PUA mostly for n defining new Slovene (related) historical

Some comparissons PUA E 31 B ZRCola � mapping to 0105+0307 Times NR ą

Problem PUA = Private Use Area but n e-ZRC = standardised & interchangable n

Text Encoding Initiative e-ZRC editions encoded in XML n using the Text Encoding Initiative

PUA in TEI P 5 n n n TEI P 5 chapter 25. Representation

Markup in the document n n text: b� : ʒɛ g� : spɔdi miłɔstíwi

Markup in the header PUA characters are defined in tei. Header/encoding. Desc: <char. Desc>

Standardisation of ZRCola PUA n n n ZRCola very well documented “visually”, i. e.

TEI to HTML <xsl: template match="g"> <xsl: variable name="glyph" select="id(@corresp)/mapping[@type=$ENCODING]"/> <SPAN> <xsl: if test="$ENCODING

Conclusions introduced IPA, PUA & TEI n showed how PUA characters can be, via

Slides: 25

Download presentation

Phonetic characters in digital editions Tomaž Erjavec 1 & Matija Ogrin 2 tomaz. erjavec@ijs. si, matija. ogrin@zrc-sazu. si 1 Department of Knowledge Technologies Jožef Stefan Institute Ljubljana 2 Institute of Slovenian Literature and Literary Sciences Scientific Research Centre of the Slovenian Academy of Sciences and Arts, Ljubljana Slo. Fon 21 April 2006

Overview of the talk 1. 2. 3. IPA PUA TEI

The problem n n n provide standardised encoding (XML) and Web viewing (HTML) of complex digital editions in particular, the Freising manuscripts (e-BS) work in progress in the project “Scholarly Digital Editions of Slovenian Literature” http: //nl. ijs. si/e-zrc/

Focus of the talk e-BS, a very complex document: facsimile, commentary, diplomatic and critical trascriptions, translations, dictionary, bibliography, name index, … n but also: u phonetic transcription in IPA u (recording) n

HTML representation of e. BS phonetic transcription

IPA n n n International Phonetic Alphabet (International Phonetic Association) contains not-well supported characters, e. g. ɐ, ɕ, ɚ, ɷ heavy use of diacritics: u unusual diacritical marks: ˀ ˒ ˤ u more than one diacritic: ǡ u diacritics spanning digraphs:

Computer representation of IPA SAMPA (for HLT) n n transliteration to ASCII SAMPA for contemporary Slovenian: u http: //www. phon. ucl. ac. uk/home/sampa/sloven-uni. htm u ZEMLJAK, Melita, KAČIČ, Zdravko, DOBRIŠEK, Simon, ŽGANEC GROS, Jerneja, WEISS, Peter. Računalniški simbolni fonetični zapis slovenskega govora. Slav. rev. , apr. -jun. 2002, 50/2, 159 -169. UNICODE (for humans) n n n universal character set, better and better supported contains “IPA Extensions”, “Combining diacritical marks” various good Unicode IPA fonts available, e. g. Doulos SIL for non-standardised characters: Private Use Area (PUA) not to be used lightly!

Unicode definitions

ZRCola developed at ZRC SAZU (Peter Weiss) n Unicode input system for linguistic use in Win. Word program: u decomposed and composed characters: u keyboard input u font which covers historical characters as well as IPA & (now) some specifics of e-BS ideal for use in e-BS n

ZRCola and PUA

Why PUA? ZRCola font uses PUA mostly for n defining new Slovene (related) historical characters n composed characters with diacritics (+ digraphs), for better diacritic placement n Unicode offers Combining diacritical marks, but complex stacks can cause problems for font rendering

Some comparissons PUA E 31 B ZRCola � mapping to 0105+0307 Times NR ą MS Tahoma ą Doulos SIL ą PUA EB 25 ZRCola � mapping to r+0300+0329 Times NR r MS Tahoma r Doulos SIL r PUA E 35 E ZRCola � mapping to 00 E 6+0303+0300 Times NR æ MS Tahoma æ Doulos SIL æ PUA EEC 8 ZRCola � ~mapping to t+j+032 E Times NR tj MS Tahoma tj Doulos SIL tj

Problem PUA = Private Use Area but n e-ZRC = standardised & interchangable n How to retain the benefits of ZRCola, yet make e-BS interchangable? How to enable reading e-BS for platforms without the ZRCola font?

Text Encoding Initiative e-ZRC editions encoded in XML n using the Text Encoding Initiative Guidelines, TEI P 4 n TEI P 5 makes provisions for encoding PUA characters and glyphs n in TEI P 4 user extensions are necessary to achieve the same effect n

PUA in TEI P 5 n n n TEI P 5 chapter 25. Representation of non-standard characters and glyphs markup in text to identify PUA characters or glyphs link these elements to their TEI header definition TEI header can give, for each new character: u a name (text description a la Unicode), e. g. LATIN SMALL LETTER A u mapping to standard Unicode u character properties rendering software (e. g. XSLT stylesheet for conversion to HTML) can then use the PUA version, or the standard version

Markup in the document n n text: b� : ʒɛ g� : spɔdi miłɔstíwi � : t� ɛ b� : ʒɛ tɛbǽ ispɔwǽdæ in XML: <line n="2" id="bs. PT. 1. 002"> b<g corresp="zrcola. E 656"/>: ʒɛ g<g corresp="zrcola. E 656"/>: spɔdi miłɔstíwi <g corresp="zrcola. E 656"/>: t<g corresp="zrcola. EECC"/>ɛ b<g corresp="zrcola. E 656"/>: ʒɛ tɛbǽ ispɔwǽdæ </line>

Markup in the header PUA characters are defined in tei. Header/encoding. Desc: <char. Desc> <desc>PUA characters as defined by <xref url="http: //zrcola. zrc-sazu. si/">ZRCola</xref> Character descriptions taken from and based on The Unicode Standard 4. 1 U 41 M 050317. lst </desc> <char id="zrcola. E 31 B"> <char. Name>LATIN SMALL LETTER A WITH OGONEK AND DOT ABOVE</char. Name> <char. Prop><local. Name>font</local. Name><value>ZRCola</value></char. Prop> <char. Prop><local. Name>mapping</local. Name><value>exact</value></char. Prop> <mapping type="PUA">&#x. E 31 B; </mapping> <mapping type="standard">&#x 0105; </mapping> </char>  </char. Desc>

Standardisation of ZRCola PUA n n n ZRCola very well documented “visually”, i. e. for humans but lacking machine processable meta-data: Unicode compliant name mapping to standard Unicode (identity, similarity) we only implemented 50+ characters that actually appear in e. BS substantial work to describe all PUA characters in ZRCola distribution maybe better to abandon the precomposed PUA characters that can be expressed in standard Unicode?

PUA display with ZRCola

PUA display without ZRCola

Documentation

Mapping to Unicode, Doulos SIL font

TEI to HTML <xsl: template match="g"> <xsl: variable name="glyph" select="id(@corresp)/mapping[@type=$ENCODING]"/> <SPAN> <xsl: if test="$ENCODING = 'standard'"> <xsl: attribute name="class"> <xsl: value-of select="id(@corresp)/char. Prop[local. Name='mapping']/value"/> </xsl: attribute> </xsl: if> <xsl: attribute name="title"> <xsl: value-of select="id(@corresp)/char. Prop[local. Name='font']/value"/> <xsl: text>: </xsl: text> <xsl: value-of select="id(@corresp)/char. Name"/> </xsl: attribute> <xsl: value-of select="$glyph"/> </SPAN> </xsl: template>

Conclusions introduced IPA, PUA & TEI n showed how PUA characters can be, via TEI, made u interchangable u documented u flexibly presented n this does require investment of time by the designers of PUA characters n