The FIDA MULTEXTEast language resources Toma Erjavec Department

  • Slides: 38
Download presentation
The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan

The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz. erjavec@ijs. si, http: //nl. ijs. si/et/ Gralis 2006 Institut für Slawistik der Universität Graz 2006 -05 -09

Overview 1. 2. 3. 4. Background FIDA: a reference corpus of Slovene MULTEXT-East: morphosyntactic

Overview 1. 2. 3. 4. Background FIDA: a reference corpus of Slovene MULTEXT-East: morphosyntactic resources for Central and East. European languages Other language resources for Slovene Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Language Resources n LR comprise three layers of data: – corpora: mono- or multilingual,

Language Resources n LR comprise three layers of data: – corpora: mono- or multilingual, reference or specialised, … /variously annotated/ – lexica: vocabularies, morphosyntactic, semantic, (ontologies) – standards: linguistic and technical encoding n LRs, esp. corpora are used for empirical language research: – linguistic studies: (annotated) corpus + (sophisticated) search engine – human language technology R&D: testing and training dataset Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Part I. The FIDA corpus n Gralis 2006 -05 -09 Slovene reference corpus for

Part I. The FIDA corpus n Gralis 2006 -05 -09 Slovene reference corpus for linguistic studies Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

FIDA http: //www. fida. net/ Joint project (1997 -2000) of n Filozofska fakulteta Vojko

FIDA http: //www. fida. net/ Joint project (1997 -2000) of n Filozofska fakulteta Vojko Gorjanc, Marko Stabej, Špela Vintar n Institut Jožef Stefan n DZS n Amebis Tomaž Erjavec Simon Krek Peter Holozan, Miro Romih Financed by industry partnerns Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Characteristics of FIDA monolingual n synchronous n written language n reference n – representative

Characteristics of FIDA monolingual n synchronous n written language n reference n – representative – balanced n annotated Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Sizes Total 103, 513, 072 words 29, 177 texts Avg. text length 3, 548

Sizes Total 103, 513, 072 words 29, 177 texts Avg. text length 3, 548 words Largest texts: Leksikon DZS: 508, 370 words 69 texts > 100. 000 Smallest texts: 2. 648 < 100 words 2 x <w>rezgrtshdrghgth 4</w> Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Time Composition Oldest/most recent text: 1989/2000 n Average date 1997 -02 n Texts/Words with

Time Composition Oldest/most recent text: 1989/2000 n Average date 1997 -02 n Texts/Words with unknown date: 3. 94%/8. 28% n Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

FIDA taxonomoy: publication types … Ft. P. P. O (published) Ft. P. P. O.

FIDA taxonomoy: publication types … Ft. P. P. O (published) Ft. P. P. O. K (books) Ft. P. P. O. P (periodicals) Ft. P. P. O. P. C (newspaper) Ft. P. P. O. P. C. D (daily) Ft. P. P. O. P. C. T (weekly) Ft. P. P. O. P. C. V (multi-weekly) … Gralis 2006 -05 -09 95. 72% 22. 71% 70. 50% 46. 59% 32. 67% 66. 18% 17. 74% Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

FIDA taxonomoy: text types Ft. Z (text type) Ft. Z. N (non-ficiton) Ft. Z.

FIDA taxonomoy: text types Ft. Z (text type) Ft. Z. N (non-ficiton) Ft. Z. N. N (non-professional) Ft. Z. N. S (professional) Ft. Z. N. S. H (hum. & soc. sci. ) Ft. Z. N. S. N (nat. & tech. sci. ) Ft. Z. U (fiction) Ft. Z. U. D (drama) Ft. Z. U. P (poetry) Ft. Z. U. R (prose) Gralis 2006 -05 -09 99. 47% 93. 57% 75. 14% 18. 37% 10. 57% 6. 04% 5. 90% 0. 17% 5. 12% Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Markup of FIDA corpus elements annotated with metadata (bibliographic, taxonomy) n text linguistically annotated

Markup of FIDA corpus elements annotated with metadata (bibliographic, taxonomy) n text linguistically annotated n encoded according to international standards and recommendations n – technical: SGML, TEI P 3 – linguistic: MULTEXT-East (MULTEXT, EAGLES) Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Linguistic annotation Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan

Linguistic annotation Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Accesibility Exploitation by partners: – – DZS: new dictionaries Amebis: development of HLT Arts

Accesibility Exploitation by partners: – – DZS: new dictionaries Amebis: development of HLT Arts faculty: teaching IJS: research on HLT Availability to the public: – – – access via concordance engine by Amebis free access, but displays only few hits possibility of academic licences FIDA (web site) no longer maintained! Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

FIDA+ http: //www. fidaplus. net/ n FIDA Plus project: – Filozofska fakulteta, Fakulteta za

FIDA+ http: //www. fidaplus. net/ n FIDA Plus project: – Filozofska fakulteta, Fakulteta za družbene vede, Institut Jožef Stefan – DZS, Amebis n n Financed by the ministry + ind. partners Extend the corpus with – Web materials – spoken component n n n Better linguistic markup Free concordances: up to 100 lines Also possibility of licences Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Concordancer Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Concordancer Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Output Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Output Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Extended searches Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan

Extended searches Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Corpus “Nova Beseda” http: //bos. zrc-sazu. si/ being developed at Institute for Slovene language,

Corpus “Nova Beseda” http: //bos. zrc-sazu. si/ being developed at Institute for Slovene language, ZRC SAZU (Primož Jakopin) n Web concordancer with no hit limit n now larger than FIDA n but much less varied: fiction, Delo, DZ n not linguistically annotated n Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Part II. MULTEXT-East n Gralis 2006 -05 -09 multilingual morphosyntactic resources for HLT development

Part II. MULTEXT-East n Gralis 2006 -05 -09 multilingual morphosyntactic resources for HLT development Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

MULTEXT-East resources n MULTEXT-East: Copernicus Joint Project COP 106 (1995 -1997) Multilingual Texts and

MULTEXT-East resources n MULTEXT-East: Copernicus Joint Project COP 106 (1995 -1997) Multilingual Texts and Corpora for Eastern and Central European Languages n n Based on the results of EU MULTEXT (~West) To produce a harmonised BLARK for six languages: – – – Gralis 2006 -05 -09 corpus encoding standardisation (TEI / CES) multilingual parallel, comparable, speech corpora morphosyntactic specifications (EAGLES / MULTEXT) (inflectional) lexicon annotated corpus language processing tools Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

History of MULTEXT-East resources n n n First release 1998 on TELRI CD-ROM Vol

History of MULTEXT-East resources n n n First release 1998 on TELRI CD-ROM Vol II: already extended with new languages Resources since 1998 available on the Web: http: //nl. ijs. si/ME/ Second release 2002 in scope of EU CONCEDE: re-encoding in XML/TEI, harmonisation Third release 2004: merge of first two releases, further languages Work (indirectly) supported by: TELRI, CONCEDE, NSF grant, bi-lateral projects Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

The Languages of MULTEXT-East Slavic: Germanic: English n Romance: Romanian n Russian (East Slavic)

The Languages of MULTEXT-East Slavic: Germanic: English n Romance: Romanian n Russian (East Slavic) n Czech (West Slavic) n Baltic: n Slovene (South West Slavic) – Latvian n Resian (Slovene dialect) – Lithuanian n Croatian (South West Slavic) n Serbian (South West Slavic) n Finno-Ugric: n Bulgarian (South East Slavic) – Estonian – Hungarian In progress: n Macedonian n Persian Gralis Tomaž Erjavec n 2006 -05 -09 Dept. of Knowledge Technologies, Jožef Stefan Institute

Version 3 Available on http: //nl. ijs. si/ME/V 3/ n Some parts completely free,

Version 3 Available on http: //nl. ijs. si/ME/V 3/ n Some parts completely free, others free for research Web licence n Web pages gives: n – extensive documentation – bibliography list – web licence form – resource download Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

The MULTEXT morphosyntactic trinity 1. 2. 3. MULTEXT-East morphosyntactic specifications MULTEXT-East morphosyntactic lexica MULTEXT-East

The MULTEXT morphosyntactic trinity 1. 2. 3. MULTEXT-East morphosyntactic specifications MULTEXT-East morphosyntactic lexica MULTEXT-East morphosyntactically annotated "1984" corpus Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

1. Morphosyntactic specifications n n n Based on EAGLES / MULTEXT Define Po. S,

1. Morphosyntactic specifications n n n Based on EAGLES / MULTEXT Define Po. S, their attributes and values The specs are a document containing: – – – n n introduction common tables language particular sections Written in La. Te. X PDF & HTML Derived XML/TEI encoding as feature structures Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Example common table Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef

Example common table Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Example language specific section n table (shows only categories actually used) n notes n

Example language specific section n table (shows only categories actually used) n notes n combinations n lexicon n for Slovene (FIDA): localisation of category names Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Morphosyntactic Complexity Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan

Morphosyntactic Complexity Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

2. The lexica n n Medium size morphosyntactic lexica Languages: English, Romanian, Slovene, Czech,

2. The lexica n n Medium size morphosyntactic lexica Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. ~ all word-forms of cca 15. 000 lemmas Lexical entry is composed of three fields: – – – Gralis 2006 -05 -09 the word-form: the inflected form of the word the lemma: the base-form of the word the morphosyntactic description (MSD) Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Example: Slovene lexicon abecedah abecedama abecedami abecede abecedi … Gralis 2006 -05 -09 abeceda

Example: Slovene lexicon abecedah abecedama abecedami abecede abecedi … Gralis 2006 -05 -09 abeceda = abeceda abeceda abeceda Ncfdg Ncfpg Ncfsn Ncfdl Ncfpd Ncfdi Ncfpa Ncfpn Ncfsg Ncfda Ncfdn Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Lexicon sizes Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan

Lexicon sizes Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

3. The “ 1984” corpus n n n Languages: En, Ro, Sl, Cs, Et,

3. The “ 1984” corpus n n n Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr, …)) Structuraly annotated Sentence aligned with English Words annotated with lemma and MSD Encoded in TEI P 4 (XML) Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Example linguistic encoding <text id="Osl. " lang="sl"> Sentence alignment & <body> <div type="part" id="Osl.

Example linguistic encoding <text id="Osl. " lang="sl"> Sentence alignment & <body> <div type="part" id="Osl. 1"> Context disambiguated <div type="chapter" id="Osl. 1. 2"> lemmas and MSDs <p id="Osl. 1. 2. 2"> <s id="Osl. 1. 2. 2. 1"> <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcip 3 s--n">je</w> <w lemma="jasen" ana="Afpmsnn">jasen</w> <c>, </c> <w lemma="mrzel" ana="Afpmsnn">mrzel</w> <w lemma="aprilski" ana="Aopmsn">aprilski</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="in" ana="Ccs">in</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="biti" ana="Vcip 3 p--n">so</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="trinajst" ana="Mcnpnl">trinajst</w> <c>. </c> </s> … Gralis Tomaž Erjavec 2006 -05 -09 Dept. of Knowledge Technologies, Jožef Stefan Institute

Quantifying the corpus Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef

Quantifying the corpus Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Utility of MULTEXT-East LRs n n n Specifications became, for some, the “national” standard

Utility of MULTEXT-East LRs n n n Specifications became, for some, the “national” standard Training/testing dataset for HLT development: Po. S taggers, lemmatizers, lexicon extractors, ILP A base dataset for further annotation and experiments: – – – n n n Word-sense disambiguation Word. Net development and evaluation Syntactic parser induction Teaching aid in HLT courses ~ 100 registered users As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

LRs @ JSI http: //nl. ijs. si/nl. html#Resource Also ours: VAYNA, GORE, slo. WNet

LRs @ JSI http: //nl. ijs. si/nl. html#Resource Also ours: VAYNA, GORE, slo. WNet Contributors to: FIDA, DSI, FDV, JRC-ACQUIS Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Overview of Slovene LRs and services @ Slovenian Language Technologies Society http: //nl. ijs.

Overview of Slovene LRs and services @ Slovenian Language Technologies Society http: //nl. ijs. si/sdjt/ Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

Thank you! Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan

Thank you! Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute