The FIDA MULTEXTEast language resources Toma Erjavec Department

The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan

Overview 1. 2. 3. 4. Background FIDA: a reference corpus of Slovene MULTEXT-East: morphosyntactic

Language Resources n LR comprise three layers of data: – corpora: mono- or multilingual,

Part I. The FIDA corpus n Gralis 2006 -05 -09 Slovene reference corpus for

FIDA http: //www. fida. net/ Joint project (1997 -2000) of n Filozofska fakulteta Vojko

Characteristics of FIDA monolingual n synchronous n written language n reference n – representative

Sizes Total 103, 513, 072 words 29, 177 texts Avg. text length 3, 548

Time Composition Oldest/most recent text: 1989/2000 n Average date 1997 -02 n Texts/Words with

FIDA taxonomoy: publication types … Ft. P. P. O (published) Ft. P. P. O.

FIDA taxonomoy: text types Ft. Z (text type) Ft. Z. N (non-ficiton) Ft. Z.

Markup of FIDA corpus elements annotated with metadata (bibliographic, taxonomy) n text linguistically annotated

Linguistic annotation Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan

Accesibility Exploitation by partners: – – DZS: new dictionaries Amebis: development of HLT Arts

FIDA+ http: //www. fidaplus. net/ n FIDA Plus project: – Filozofska fakulteta, Fakulteta za

Concordancer Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Output Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Extended searches Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan

Corpus “Nova Beseda” http: //bos. zrc-sazu. si/ being developed at Institute for Slovene language,

Part II. MULTEXT-East n Gralis 2006 -05 -09 multilingual morphosyntactic resources for HLT development

MULTEXT-East resources n MULTEXT-East: Copernicus Joint Project COP 106 (1995 -1997) Multilingual Texts and

History of MULTEXT-East resources n n n First release 1998 on TELRI CD-ROM Vol

The Languages of MULTEXT-East Slavic: Germanic: English n Romance: Romanian n Russian (East Slavic)

Version 3 Available on http: //nl. ijs. si/ME/V 3/ n Some parts completely free,

The MULTEXT morphosyntactic trinity 1. 2. 3. MULTEXT-East morphosyntactic specifications MULTEXT-East morphosyntactic lexica MULTEXT-East

1. Morphosyntactic specifications n n n Based on EAGLES / MULTEXT Define Po. S,

Example common table Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef

Example language specific section n table (shows only categories actually used) n notes n

Morphosyntactic Complexity Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan

2. The lexica n n Medium size morphosyntactic lexica Languages: English, Romanian, Slovene, Czech,

Example: Slovene lexicon abecedah abecedama abecedami abecede abecedi … Gralis 2006 -05 -09 abeceda

Lexicon sizes Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan

3. The “ 1984” corpus n n n Languages: En, Ro, Sl, Cs, Et,

Example linguistic encoding <text id="Osl. " lang="sl"> Sentence alignment & <body> <div type="part" id="Osl.

Quantifying the corpus Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef

Utility of MULTEXT-East LRs n n n Specifications became, for some, the “national” standard

LRs @ JSI http: //nl. ijs. si/nl. html#Resource Also ours: VAYNA, GORE, slo. WNet

Overview of Slovene LRs and services @ Slovenian Language Technologies Society http: //nl. ijs.

Thank you! Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan

Slides: 38

Download presentation

The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz. erjavec@ijs. si, http: //nl. ijs. si/et/ Gralis 2006 Institut für Slawistik der Universität Graz 2006 -05 -09

Overview 1. 2. 3. 4. Background FIDA: a reference corpus of Slovene MULTEXT-East: morphosyntactic resources for Central and East. European languages Other language resources for Slovene Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Language Resources n LR comprise three layers of data: – corpora: mono- or multilingual, reference or specialised, … /variously annotated/ – lexica: vocabularies, morphosyntactic, semantic, (ontologies) – standards: linguistic and technical encoding n LRs, esp. corpora are used for empirical language research: – linguistic studies: (annotated) corpus + (sophisticated) search engine – human language technology R&D: testing and training dataset Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Part I. The FIDA corpus n Gralis 2006 -05 -09 Slovene reference corpus for linguistic studies Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

FIDA http: //www. fida. net/ Joint project (1997 -2000) of n Filozofska fakulteta Vojko Gorjanc, Marko Stabej, Špela Vintar n Institut Jožef Stefan n DZS n Amebis Tomaž Erjavec Simon Krek Peter Holozan, Miro Romih Financed by industry partnerns Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Characteristics of FIDA monolingual n synchronous n written language n reference n – representative – balanced n annotated Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Sizes Total 103, 513, 072 words 29, 177 texts Avg. text length 3, 548 words Largest texts: Leksikon DZS: 508, 370 words 69 texts > 100. 000 Smallest texts: 2. 648 < 100 words 2 x <w>rezgrtshdrghgth 4</w> Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Time Composition Oldest/most recent text: 1989/2000 n Average date 1997 -02 n Texts/Words with unknown date: 3. 94%/8. 28% n Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

FIDA taxonomoy: publication types … Ft. P. P. O (published) Ft. P. P. O. K (books) Ft. P. P. O. P (periodicals) Ft. P. P. O. P. C (newspaper) Ft. P. P. O. P. C. D (daily) Ft. P. P. O. P. C. T (weekly) Ft. P. P. O. P. C. V (multi-weekly) … Gralis 2006 -05 -09 95. 72% 22. 71% 70. 50% 46. 59% 32. 67% 66. 18% 17. 74% Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

FIDA taxonomoy: text types Ft. Z (text type) Ft. Z. N (non-ficiton) Ft. Z. N. N (non-professional) Ft. Z. N. S (professional) Ft. Z. N. S. H (hum. & soc. sci. ) Ft. Z. N. S. N (nat. & tech. sci. ) Ft. Z. U (fiction) Ft. Z. U. D (drama) Ft. Z. U. P (poetry) Ft. Z. U. R (prose) Gralis 2006 -05 -09 99. 47% 93. 57% 75. 14% 18. 37% 10. 57% 6. 04% 5. 90% 0. 17% 5. 12% Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Markup of FIDA corpus elements annotated with metadata (bibliographic, taxonomy) n text linguistically annotated n encoded according to international standards and recommendations n – technical: SGML, TEI P 3 – linguistic: MULTEXT-East (MULTEXT, EAGLES) Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Linguistic annotation Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Accesibility Exploitation by partners: – – DZS: new dictionaries Amebis: development of HLT Arts faculty: teaching IJS: research on HLT Availability to the public: – – – access via concordance engine by Amebis free access, but displays only few hits possibility of academic licences FIDA (web site) no longer maintained! Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

FIDA+ http: //www. fidaplus. net/ n FIDA Plus project: – Filozofska fakulteta, Fakulteta za družbene vede, Institut Jožef Stefan – DZS, Amebis n n Financed by the ministry + ind. partners Extend the corpus with – Web materials – spoken component n n n Better linguistic markup Free concordances: up to 100 lines Also possibility of licences Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Concordancer Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Output Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Extended searches Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Corpus “Nova Beseda” http: //bos. zrc-sazu. si/ being developed at Institute for Slovene language, ZRC SAZU (Primož Jakopin) n Web concordancer with no hit limit n now larger than FIDA n but much less varied: fiction, Delo, DZ n not linguistically annotated n Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Part II. MULTEXT-East n Gralis 2006 -05 -09 multilingual morphosyntactic resources for HLT development Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

MULTEXT-East resources n MULTEXT-East: Copernicus Joint Project COP 106 (1995 -1997) Multilingual Texts and Corpora for Eastern and Central European Languages n n Based on the results of EU MULTEXT (~West) To produce a harmonised BLARK for six languages: – – – Gralis 2006 -05 -09 corpus encoding standardisation (TEI / CES) multilingual parallel, comparable, speech corpora morphosyntactic specifications (EAGLES / MULTEXT) (inflectional) lexicon annotated corpus language processing tools Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

History of MULTEXT-East resources n n n First release 1998 on TELRI CD-ROM Vol II: already extended with new languages Resources since 1998 available on the Web: http: //nl. ijs. si/ME/ Second release 2002 in scope of EU CONCEDE: re-encoding in XML/TEI, harmonisation Third release 2004: merge of first two releases, further languages Work (indirectly) supported by: TELRI, CONCEDE, NSF grant, bi-lateral projects Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

The Languages of MULTEXT-East Slavic: Germanic: English n Romance: Romanian n Russian (East Slavic) n Czech (West Slavic) n Baltic: n Slovene (South West Slavic) – Latvian n Resian (Slovene dialect) – Lithuanian n Croatian (South West Slavic) n Serbian (South West Slavic) n Finno-Ugric: n Bulgarian (South East Slavic) – Estonian – Hungarian In progress: n Macedonian n Persian Gralis Tomaž Erjavec n 2006 -05 -09 Dept. of Knowledge Technologies, Jožef Stefan Institute

Version 3 Available on http: //nl. ijs. si/ME/V 3/ n Some parts completely free, others free for research Web licence n Web pages gives: n – extensive documentation – bibliography list – web licence form – resource download Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

The MULTEXT morphosyntactic trinity 1. 2. 3. MULTEXT-East morphosyntactic specifications MULTEXT-East morphosyntactic lexica MULTEXT-East morphosyntactically annotated "1984" corpus Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

1. Morphosyntactic specifications n n n Based on EAGLES / MULTEXT Define Po. S, their attributes and values The specs are a document containing: – – – n n introduction common tables language particular sections Written in La. Te. X PDF & HTML Derived XML/TEI encoding as feature structures Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Example common table Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Example language specific section n table (shows only categories actually used) n notes n combinations n lexicon n for Slovene (FIDA): localisation of category names Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Morphosyntactic Complexity Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

2. The lexica n n Medium size morphosyntactic lexica Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. ~ all word-forms of cca 15. 000 lemmas Lexical entry is composed of three fields: – – – Gralis 2006 -05 -09 the word-form: the inflected form of the word the lemma: the base-form of the word the morphosyntactic description (MSD) Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Example: Slovene lexicon abecedah abecedama abecedami abecede abecedi … Gralis 2006 -05 -09 abeceda = abeceda abeceda abeceda Ncfdg Ncfpg Ncfsn Ncfdl Ncfpd Ncfdi Ncfpa Ncfpn Ncfsg Ncfda Ncfdn Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Lexicon sizes Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

3. The “ 1984” corpus n n n Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr, …)) Structuraly annotated Sentence aligned with English Words annotated with lemma and MSD Encoded in TEI P 4 (XML) Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Example linguistic encoding <text id="Osl. " lang="sl"> Sentence alignment & <body> <div type="part" id="Osl. 1"> Context disambiguated <div type="chapter" id="Osl. 1. 2"> lemmas and MSDs <p id="Osl. 1. 2. 2"> <s id="Osl. 1. 2. 2. 1"> <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcip 3 s--n">je</w> <w lemma="jasen" ana="Afpmsnn">jasen</w> <c>, </c> <w lemma="mrzel" ana="Afpmsnn">mrzel</w> <w lemma="aprilski" ana="Aopmsn">aprilski</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="in" ana="Ccs">in</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="biti" ana="Vcip 3 p--n">so</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="trinajst" ana="Mcnpnl">trinajst</w> <c>. </c> </s> … Gralis Tomaž Erjavec 2006 -05 -09 Dept. of Knowledge Technologies, Jožef Stefan Institute

Quantifying the corpus Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Utility of MULTEXT-East LRs n n n Specifications became, for some, the “national” standard Training/testing dataset for HLT development: Po. S taggers, lemmatizers, lexicon extractors, ILP A base dataset for further annotation and experiments: – – – n n n Word-sense disambiguation Word. Net development and evaluation Syntactic parser induction Teaching aid in HLT courses ~ 100 registered users As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

LRs @ JSI http: //nl. ijs. si/nl. html#Resource Also ours: VAYNA, GORE, slo. WNet Contributors to: FIDA, DSI, FDV, JRC-ACQUIS Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Overview of Slovene LRs and services @ Slovenian Language Technologies Society http: //nl. ijs. si/sdjt/ Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Thank you! Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute