Types und Tokens Distribution in TITUS TITUS Dr
Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität Frankfurt am Main E-Mail: l. ahlborn@em. uni-frankfurt. de
Tokens and Types Distribution in TITUS Outline • • TITUS Resource Data Peculiarities of TITUS texts Tokens and Types calculation in TITUS Resources Metadata for Tokens and Types distribution Корпусная лингвистика 2013 26. 06. 2013 2
Tokens and Types Distribution in TITUS Resource Data • TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) http: //titus. uni-frankfurt. de • TITUS includes currently 660 texts in 55 languages, more than 30 Mio. tokens A token represents the concrete occurrence of the linguistic unit, and in a type, tokens associated with each other are bundled. Корпусная лингвистика 2013 26. 06. 2013 3
Tokens and Types Distribution in TITUS Data http: //www. clarin. eu/node/1512 Added by J. Gippert, R. Mittmann Корпусная лингвистика 2013 26. 06. 2013 4
Tokens and Types Distribution in TITUS Search Engine • TITUS Search Engine does not determine the number of tokens in the concrete text, but the number of quotations of the word. Корпусная лингвистика 2013 26. 06. 2013 5
Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Gothic • Biblia Gothica contains additional parallel passages in Latin and Greek. Biblia Gothica (http: //titus. uni-frankfurt. de/texte/etcs/germ/gotnt/gotnt. htm). Корпусная лингвистика 2013 26. 06. 2013 6
Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Church Slavonic • Old Church Slavonic texts are represented in two ways: in the Glagolitic alphabet – original form of the text – and in Cyrillic one. Codex Marianus (http: //titus. uni-frankfurt. de/texte/etcs/slav/aksl/marianus/maria. htm). Корпусная лингвистика 2013 26. 06. 2013 7
Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Polish • Old Polish texts contain a simultaneous display of editions that have arisen at different times. Kazania S więtokrzyskie (http: //titus. uni-frankfurt. de/texte/etcs/slav/apoln/ kazania/kazan. htm). Корпусная лингвистика 2013 26. 06. 2013 8
Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Ossetian • The Ossetian Nart epic is represented in Latinica und in the advanced Cyrillic. Ossetian: Nart epic (http: //titus. uni-frankfurt. de/texte/etcs/iran/niran/oss/ nart/nart. htm). Корпусная лингвистика 2013 26. 06. 2013 9
Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Russian-Low German • Tönnies Fenne's Manual (17 th century) contains at least 9 different languages or language variations. Корпусная лингвистика 2013 26. 06. 2013 10
Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Prussian corpus consists of at least 21 different languages or language variants (Old Prussian, Old Lithuanian, Latin, Gothic, Old Low German, Old High German). Корпусная лингвистика 2013 26. 06. 2013 11
Tokens and Types Distribution in TITUS Creation • A digitized source consists not only of a source language words, but contains various information which does not belong originally to the document: numbers, tags, punctuation marks, edition information etc. $zeile =~ s/d*s+x{003 C}x 86x 87x 84x{003 E}//gi; #<†‡„> $zeile =~ s/d*s+<W<? Convert. Check: s+Level. Name. Too. Long>//g; #<? Convert. Check: Корпусная лингвистика 2013 26. 06. 2013 Level. Name. Too. Long> 12
Tokens and Types Distribution in TITUS Examples: Gothic Bible. Old Testament Fragments. Total: 1629 tokens und 893 types Tokens Types Gothic 420 240 Latin 572 325 Greek 627 319 Корпусная лингвистика 2013 26. 06. 2013 13
Tokens and Types Distribution in TITUS Examples: Gothic Bible. New Testament Books. Total: 170215 tokens und 28876 types Tokens Types Gothic 61167 9121 Latin 52648 9036 Greek 56400 10719 Корпусная лингвистика 2013 26. 06. 2013 14
Tokens and Types Distribution in TITUS Examples: Tönnies Fenne's Manual (17 th century) The language of the textbook of spoken Russian consists mainly of Russian in Latin transcription and Low German. Корпусная лингвистика 2013 26. 06. 2013 15
Tokens and Types Distribution in TITUS Examples: further application Корпусная лингвистика 2013 26. 06. 2013 16
Tokens and Types Distribution in TITUS Metadata • • • DC – Dublin Core TEI – Text Encoding Initiative CEI – Corpus Encoding Initiative IMDI – ISLE Meta Data Initiative OLAC – Open Language Archives Community CMDI – Component Meta. Data Infrastructure Корпусная лингвистика 2013 26. 06. 2013 17
Tokens and Types Distribution in TITUS CMDI - Component Meta. Data Infrastructure http: //www. clarin. eu/cmdi Корпусная лингвистика 2013 26. 06. 2013 18
Tokens and Types Distribution in TITUS Metadata: HTML Format <HEAD> <TITLE>TITUS Texts: Biblia gothica: Frame</TITLE> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> <META NAME="Author" CONTENT="Jost Gippert"> <META NAME="Description" CONTENT="TITUS: Texts: Biblia gothica: Frame"> <META NAME="Key. Words" CONTENT="TITUS Texte Texts Biblia gothica"> </HEAD> Корпусная лингвистика 2013 26. 06. 2013 19
Tokens and Types Distribution in TITUS New Metadata Set for TITUS * Name *Author *Project. Contact. Name *Project. Contact. Address *Project. Contact. Email *Project. Contact. Oranisation *Project. Description *Resource. Language *Resource. Link *Resource. Access. Availability *Resource. Access. Date *Resource. Access. Owner *Resource. Access. Publisher *Resource. Publication. Time. Original. Manuscript *Resource. Publication. Time. Original. Facsimile *Resource. Publication. Time. Original. Published *Resource. Publication. Time. Electronic *Resource. Wordcount. General. Tokens *Resource. Wordcount. General. Types *Resource. Wordcount. Language. Tokens *Resource. Wordcount. Language. Types *Resource. Metadata. Encoding Корпусная лингвистика 2013 26. 06. 2013 vorhanden new existing existing neu existing existing new new existing *new (CLARIN) new new 20
Tokens and Types Distribution in TITUS Metadata Example for TITUS – XML CMDI <Resource. Publication. Time. Electronic>16. 6. 2002</Resource. Publication. Time. Electronic> <Resource. Wordcount. General> <Tokens>1629 Tokens</Tokens> <Types>893 Types</Types> </Resource. Wordcount. General><Resource. Wordcount. TT> <Language></Language> <Language. Tokens. Types> Tokens | Types</Language. Tokens. Types> </Resource. Wordcount. TT><Resource. Wordcount. TT> <Language>Language 1_General</Language> <Language. Tokens. Types>10 Tokens | 9 Types</Language. Tokens. Types> </Resource. Wordcount. TT><Resource. Wordcount. TT> <Language>Language 2_Gothic</Language> <Language. Tokens. Types>420 Tokens | 240 Types</Language. Tokens. Types> </Resource. Wordcount. TT><Resource. Wordcount. TT> <Language>Language 4_Latin</Language> <Language. Tokens. Types>572 Tokens | 325 Types</Language. Tokens. Types> </Resource. Wordcount. TT><Resource. Wordcount. TT> <Language>Language 5_Greek</Language> <Language. Tokens. Types>627 Tokens | 319 Types</Language. Tokens. Types> </Resource. Wordcount. TT> Корпусная лингвистика 2013 26. 06. 2013 21
Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика 2013 26. 06. 2013 22
Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика 2013 26. 06. 2013 23
Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика 2013 26. 06. 2013 24
Tokens and Types Distribution in TITUS Thank you for your attention! Links • ARBIL (Metadaten-Editor) http: //tla. mpi. nl/tools/tla-tools/arbil/ • CLARIN http: //www. clarin. eu • CMDI http: //www. clarin. eu/cmdi • Dublin Core http: //dublincore. org/documents/dcmi-terms/ • IMDI http: //www. mpi. nl/IMDI/ • OLAT http: //www. language-archives. org/ • TEI http: //www. tei-c. org/index. xml • TITUS http: //titus. uni-frankfurt. de Корпусная лингвистика 2013 26. 06. 2013 25
Old Prussian Tokens and Types Distribution in TITUS Corpus Tokens General: 17662 tokens Types General: 8390 types Корпусная лингвистика 2013 26. 06. 2013 26
- Slides: 26