The FIDA MULTEXTEast language resources Toma Erjavec Department
- Slides: 38
The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz. erjavec@ijs. si, http: //nl. ijs. si/et/ Gralis 2006 Institut für Slawistik der Universität Graz 2006 -05 -09
Overview 1. 2. 3. 4. Background FIDA: a reference corpus of Slovene MULTEXT-East: morphosyntactic resources for Central and East. European languages Other language resources for Slovene Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Language Resources n LR comprise three layers of data: – corpora: mono- or multilingual, reference or specialised, … /variously annotated/ – lexica: vocabularies, morphosyntactic, semantic, (ontologies) – standards: linguistic and technical encoding n LRs, esp. corpora are used for empirical language research: – linguistic studies: (annotated) corpus + (sophisticated) search engine – human language technology R&D: testing and training dataset Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Part I. The FIDA corpus n Gralis 2006 -05 -09 Slovene reference corpus for linguistic studies Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
FIDA http: //www. fida. net/ Joint project (1997 -2000) of n Filozofska fakulteta Vojko Gorjanc, Marko Stabej, Špela Vintar n Institut Jožef Stefan n DZS n Amebis Tomaž Erjavec Simon Krek Peter Holozan, Miro Romih Financed by industry partnerns Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Characteristics of FIDA monolingual n synchronous n written language n reference n – representative – balanced n annotated Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Sizes Total 103, 513, 072 words 29, 177 texts Avg. text length 3, 548 words Largest texts: Leksikon DZS: 508, 370 words 69 texts > 100. 000 Smallest texts: 2. 648 < 100 words 2 x <w>rezgrtshdrghgth 4</w> Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Time Composition Oldest/most recent text: 1989/2000 n Average date 1997 -02 n Texts/Words with unknown date: 3. 94%/8. 28% n Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
FIDA taxonomoy: publication types … Ft. P. P. O (published) Ft. P. P. O. K (books) Ft. P. P. O. P (periodicals) Ft. P. P. O. P. C (newspaper) Ft. P. P. O. P. C. D (daily) Ft. P. P. O. P. C. T (weekly) Ft. P. P. O. P. C. V (multi-weekly) … Gralis 2006 -05 -09 95. 72% 22. 71% 70. 50% 46. 59% 32. 67% 66. 18% 17. 74% Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
FIDA taxonomoy: text types Ft. Z (text type) Ft. Z. N (non-ficiton) Ft. Z. N. N (non-professional) Ft. Z. N. S (professional) Ft. Z. N. S. H (hum. & soc. sci. ) Ft. Z. N. S. N (nat. & tech. sci. ) Ft. Z. U (fiction) Ft. Z. U. D (drama) Ft. Z. U. P (poetry) Ft. Z. U. R (prose) Gralis 2006 -05 -09 99. 47% 93. 57% 75. 14% 18. 37% 10. 57% 6. 04% 5. 90% 0. 17% 5. 12% Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Markup of FIDA corpus elements annotated with metadata (bibliographic, taxonomy) n text linguistically annotated n encoded according to international standards and recommendations n – technical: SGML, TEI P 3 – linguistic: MULTEXT-East (MULTEXT, EAGLES) Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Linguistic annotation Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Accesibility Exploitation by partners: – – DZS: new dictionaries Amebis: development of HLT Arts faculty: teaching IJS: research on HLT Availability to the public: – – – access via concordance engine by Amebis free access, but displays only few hits possibility of academic licences FIDA (web site) no longer maintained! Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
FIDA+ http: //www. fidaplus. net/ n FIDA Plus project: – Filozofska fakulteta, Fakulteta za družbene vede, Institut Jožef Stefan – DZS, Amebis n n Financed by the ministry + ind. partners Extend the corpus with – Web materials – spoken component n n n Better linguistic markup Free concordances: up to 100 lines Also possibility of licences Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Concordancer Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Output Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Extended searches Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Corpus “Nova Beseda” http: //bos. zrc-sazu. si/ being developed at Institute for Slovene language, ZRC SAZU (Primož Jakopin) n Web concordancer with no hit limit n now larger than FIDA n but much less varied: fiction, Delo, DZ n not linguistically annotated n Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Part II. MULTEXT-East n Gralis 2006 -05 -09 multilingual morphosyntactic resources for HLT development Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
MULTEXT-East resources n MULTEXT-East: Copernicus Joint Project COP 106 (1995 -1997) Multilingual Texts and Corpora for Eastern and Central European Languages n n Based on the results of EU MULTEXT (~West) To produce a harmonised BLARK for six languages: – – – Gralis 2006 -05 -09 corpus encoding standardisation (TEI / CES) multilingual parallel, comparable, speech corpora morphosyntactic specifications (EAGLES / MULTEXT) (inflectional) lexicon annotated corpus language processing tools Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
History of MULTEXT-East resources n n n First release 1998 on TELRI CD-ROM Vol II: already extended with new languages Resources since 1998 available on the Web: http: //nl. ijs. si/ME/ Second release 2002 in scope of EU CONCEDE: re-encoding in XML/TEI, harmonisation Third release 2004: merge of first two releases, further languages Work (indirectly) supported by: TELRI, CONCEDE, NSF grant, bi-lateral projects Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
The Languages of MULTEXT-East Slavic: Germanic: English n Romance: Romanian n Russian (East Slavic) n Czech (West Slavic) n Baltic: n Slovene (South West Slavic) – Latvian n Resian (Slovene dialect) – Lithuanian n Croatian (South West Slavic) n Serbian (South West Slavic) n Finno-Ugric: n Bulgarian (South East Slavic) – Estonian – Hungarian In progress: n Macedonian n Persian Gralis Tomaž Erjavec n 2006 -05 -09 Dept. of Knowledge Technologies, Jožef Stefan Institute
Version 3 Available on http: //nl. ijs. si/ME/V 3/ n Some parts completely free, others free for research Web licence n Web pages gives: n – extensive documentation – bibliography list – web licence form – resource download Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
The MULTEXT morphosyntactic trinity 1. 2. 3. MULTEXT-East morphosyntactic specifications MULTEXT-East morphosyntactic lexica MULTEXT-East morphosyntactically annotated "1984" corpus Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
1. Morphosyntactic specifications n n n Based on EAGLES / MULTEXT Define Po. S, their attributes and values The specs are a document containing: – – – n n introduction common tables language particular sections Written in La. Te. X PDF & HTML Derived XML/TEI encoding as feature structures Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Example common table Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Example language specific section n table (shows only categories actually used) n notes n combinations n lexicon n for Slovene (FIDA): localisation of category names Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Morphosyntactic Complexity Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
2. The lexica n n Medium size morphosyntactic lexica Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. ~ all word-forms of cca 15. 000 lemmas Lexical entry is composed of three fields: – – – Gralis 2006 -05 -09 the word-form: the inflected form of the word the lemma: the base-form of the word the morphosyntactic description (MSD) Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Example: Slovene lexicon abecedah abecedama abecedami abecede abecedi … Gralis 2006 -05 -09 abeceda = abeceda abeceda abeceda Ncfdg Ncfpg Ncfsn Ncfdl Ncfpd Ncfdi Ncfpa Ncfpn Ncfsg Ncfda Ncfdn Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Lexicon sizes Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
3. The “ 1984” corpus n n n Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr, …)) Structuraly annotated Sentence aligned with English Words annotated with lemma and MSD Encoded in TEI P 4 (XML) Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Example linguistic encoding <text id="Osl. " lang="sl"> Sentence alignment & <body> <div type="part" id="Osl. 1"> Context disambiguated <div type="chapter" id="Osl. 1. 2"> lemmas and MSDs <p id="Osl. 1. 2. 2"> <s id="Osl. 1. 2. 2. 1"> <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcip 3 s--n">je</w> <w lemma="jasen" ana="Afpmsnn">jasen</w> <c>, </c> <w lemma="mrzel" ana="Afpmsnn">mrzel</w> <w lemma="aprilski" ana="Aopmsn">aprilski</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="in" ana="Ccs">in</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="biti" ana="Vcip 3 p--n">so</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="trinajst" ana="Mcnpnl">trinajst</w> <c>. </c> </s> … Gralis Tomaž Erjavec 2006 -05 -09 Dept. of Knowledge Technologies, Jožef Stefan Institute
Quantifying the corpus Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Utility of MULTEXT-East LRs n n n Specifications became, for some, the “national” standard Training/testing dataset for HLT development: Po. S taggers, lemmatizers, lexicon extractors, ILP A base dataset for further annotation and experiments: – – – n n n Word-sense disambiguation Word. Net development and evaluation Syntactic parser induction Teaching aid in HLT courses ~ 100 registered users As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
LRs @ JSI http: //nl. ijs. si/nl. html#Resource Also ours: VAYNA, GORE, slo. WNet Contributors to: FIDA, DSI, FDV, JRC-ACQUIS Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Overview of Slovene LRs and services @ Slovenian Language Technologies Society http: //nl. ijs. si/sdjt/ Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
Thank you! Gralis 2006 -05 -09 Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute
- Esslii
- Gimnazija jesenice
- Neelam fida
- Comportamenti controproduttivi
- What is the transformative process of operations management
- Example of fixed resources
- Renewable vs nonrenewable resources worksheet
- Management fifteenth edition
- California department of human resources
- Maharashtra water resources department
- Georgia ares
- Virginia department of historical resources
- Human resources department adalah
- Pfizer human resources department
- Oregon water resources department
- St lucie county environmental resources department
- Toma blizanac
- Solución de problemas y toma de decisiones
- Decisiones de consumo
- Escala de toma de decisiones
- Dds
- Partes donde se toma el pulso
- Pustolovine toma sawyera prezentacija
- Modelos gerenciales de toma de decisiones
- Modelos gerenciales
- Toma de decisiones programadas ejemplos
- Que es una empresa aspirante
- Arruada significado
- Hematocrito
- Partes donde se toma el pulso
- Pensamiento critico y toma de decisiones
- é facil sorrir quando a gente acredita que deus toma conta
- Gdss ejemplos
- Ven toma tu cruz y sigueme
- Levanta toma o teu leito e anda
- Probiotico atomy
- Tipos de subrayado
- Sistemas de apoyo a la toma de decisiones
- Os 5 sinais vitais