The MULTEXTEast multilingual language resources Toma Erjavec Department

The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute,

Overview 1. 2. 3. 4. Introduction to Language Resources MULTEXT-East: morphosyntactic resources for East-European

Introduction to Language Resources n LR comprise two types of data: n LRs, esp.

Characteristics of LRs n Separate development for each language n Costly to produce, so

History of LRs n 70 s: Chomskyan paradigm – no LRs 85 -95: renaissance

MULTEXT-East resources n MULTEXT-East: Copernicus Joint Project COP 106 (1995 -1997) Multilingual Texts and

History of MULTEXT-East resources n n n First release 1998 on TELRI CD-ROM Vol

The Languages of MULTEXT-East Slavic: Germanic: English n Romance: Romanian n Russian (East Slavic)

Version 3 Available on http: //nl. ijs. si/ME/V 3/ n Some parts completely free,

The MULTEXT morphosyntactic trinity 1. 2. 3. MULTEXT-East morphosyntactic specifications MULTEXT-East morphosyntactic lexica MULTEXT-East

1. Morphosyntactic specifications n n n Based on EAGLES / MULTEXT Define Po. S,

Example common table Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies,

Example language specific table Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge

Complexity Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan

2. The lexica n n Medium size morphosyntactic lexica Languages: English, Romanian, Slovene, Czech,

Example: Slovene lexicon abecedah abecedama abecedami abecede abecedi … Graz Uni January 27 2006

Lexicon sizes Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef

The specification in as TEI FS <f. Lib type="Noun"> <f id="N 0. " select="en

3. The “ 1984” corpus n n n Languages: En, Ro, Sl, Cs, Et,

Example linguistic encoding <text id="Osl. " lang="sl"> Context disambiguated <body> <div type="part" id="Osl. 1">

Quantifying the corpus Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies,

Utility of MULTEXT-East LRs n n n Specifications became, for some, the “national” standard

LRs @ JSI Graz Uni January 27 2006 Also ours: VAYNA, GORE, slo. WNet

JSI know-how in corpus compilation Encoding standardisation: XML, TEI, ISO 2. Up-conversion: character set,

Slovene LRs @ SDJT Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge

Conclusions n n Introduced language resources, MULTEXTEast and Slovene LRs Useful basis for empirical

Slides: 27

Download presentation

The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz. erjavec@ijs. si, http: //nl. ijs. si/et/

Overview 1. 2. 3. 4. Introduction to Language Resources MULTEXT-East: morphosyntactic resources for East-European languages A tour of Slovene language resources Conclusions Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Introduction to Language Resources n LR comprise two types of data: n LRs, esp. corpora are used for empirical language research: – corpora: mono- or multilingual, reference or specialised, …, /variously annotated/ – lexica: vocabularies, morphosyntactic, semantic (ontologies) – linguistic research: (annotated) corpus + (sophisticated) search engine – human language technology R&D: testing and training dataset Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Characteristics of LRs n Separate development for each language n Costly to produce, so should be widely available, but: n n great variation in availability between languages – – – “monopoly protection” problems of copyright lack of encoding standardisation – – text is becoming increasingly easy to acquire (WWW) un- & semi-supervised ML methods give increasingly good results Good side: Ideal: lots of different, large, high-quality, standardised, freely available, and supported LRs for all languages, multilingual and multimodal Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

History of LRs n 70 s: Chomskyan paradigm – no LRs 85 -95: renaissance of empiricism (LR-based): n 95 -05: established field ~ old hat n – became accepted in academic circles: corpus linguistics / (statistical) machine learning – advances in standardisation: TEI, EAGLES – large EU funded HLT/LR projects: EAGLES, MULTEXT, … – EU Copernicus (1995, ’ 97): MULTEXT-East, TELRI, … – LR brokers: LDC (1992), ELRA (1995) – – – LREC: bi-annual conferences (1998 -), LRE journal (2005) XML based standards: TEI, ISO, W 3 C national initiatives no more EU funding for LR collection or HLT R&D EU funding for component multimodal / multilingual technologies, standardisation and research infrastructures Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

MULTEXT-East resources n MULTEXT-East: Copernicus Joint Project COP 106 (1995 -1997) Multilingual Texts and Corpora for Eastern and Central European Languages n n Based on the results of EU MULTEXT (~West) To produce a harmonised BLARK for six languages: – – – corpus encoding standardisation (TEI / CES) multilingual parallel, comparable, speech corpora morphosyntactic specifications (EAGLES / MULTEXT) (inflectional) lexicon annotated corpus language processing tools Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

History of MULTEXT-East resources n n n First release 1998 on TELRI CD-ROM Vol II: already extended with new languages Resources since 1998 available on the Web: http: //nl. ijs. si/ME/ Second release 2002 in scope of EU CONCEDE: re-encoding in XML/TEI, harmonisation Third release 2004: merge of first two releases, further languages Work (indirectly) supported by: TELRI, CONCEDE, NSF grant, bi-lateral projects Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

The Languages of MULTEXT-East Slavic: Germanic: English n Romance: Romanian n Russian (East Slavic) n Czech (West Slavic) n Baltic: n Slovene (South West Slavic) – Latvian n Resian (Slovene dialect) – Lithuanian n Croatian (South West Slavic) n Serbian (South West Slavic) n Finno-Ugric: n Bulgarian (South East Slavic) – Estonian – Hungarian In progress: n Macedonian n Persian Graz Uni Tomaž Erjavec n January 27 2006 Dept. of Knowledge Technologies, Jozef Stefan Institute

Version 3 Available on http: //nl. ijs. si/ME/V 3/ n Some parts completely free, others free for research licence n Web pages gives: n – extensive documentation – bibliography list – web licence form – resources Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

The MULTEXT morphosyntactic trinity 1. 2. 3. MULTEXT-East morphosyntactic specifications MULTEXT-East morphosyntactic lexica MULTEXT-East morphosyntactically annotated "1984" corpus Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

1. Morphosyntactic specifications n n n Based on EAGLES / MULTEXT Define Po. S, their attributes and values The specs are a document containing: – – – n n introduction common tables language particular sections Written in La. Te. X PDF & HTML Derived XML/TEI encoding as feature structures Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Example common table Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Example language specific table Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Complexity Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

2. The lexica n n Medium size morphosyntactic lexica Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. ~ all word-forms of cca 15. 000 lemmas Lexical entry is composed of three fields: – – – the word-form: the inflected form of the word the lemma: the base-form of the word the morphosyntactic description (MSD) Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Example: Slovene lexicon abecedah abecedama abecedami abecede abecedi … Graz Uni January 27 2006 abeceda = abeceda abeceda abeceda Ncfdg Ncfpg Ncfsn Ncfdl Ncfpd Ncfdi Ncfpa Ncfpn Ncfsg Ncfda Ncfdn Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Lexicon sizes Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

The specification in as TEI FS <f. Lib type="Noun"> <f id="N 0. " select="en ro sl cs bg et hu hr sr sl-rozaj" name="Po. S"> <sym value="Noun" /> </f> <f id="N 1. c" select="en ro sl cs bg et hu hr sr sl-rozaj" name="Type"> <sym value="common" /> </f> <f id="N 1. p" select="en ro sl cs bg et hu hr sr sl-rozaj" name="Type"> <sym value="proper" /> </f> … <fs. Lib type="Noun"> <fs id="Nc" select="en et sr" feats="N 0. N 1. c" /> <fs id="Nc---n" select="ro" feats="N 0. N 1. c N 5. n" /> <fs id="Nc--g" select="sr" feats="N 0. N 1. c N 4. g" /> <fs id="Nc-p" select="cs en" feats="N 0. N 1. c N 3. p" /> <fs id="Nc-p 1" select="et" feats="N 0. N 1. c N 3. p N 4. 1" /> … Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

3. The “ 1984” corpus n n n Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr, …)) Structurally annotated Sentence aligned with English Words annotated with lemma and MSD Encoded in TEI P 4 (XML) Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Example linguistic encoding <text id="Osl. " lang="sl"> Context disambiguated <body> <div type="part" id="Osl. 1"> lemmas and MSDs <div type="chapter" id="Osl. 1. 2"> <p id="Osl. 1. 2. 2"> <s id="Osl. 1. 2. 2. 1"> <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcip 3 s--n">je</w> <w lemma="jasen" ana="Afpmsnn">jasen</w> <c>, </c> <w lemma="mrzel" ana="Afpmsnn">mrzel</w> <w lemma="aprilski" ana="Aopmsn">aprilski</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="in" ana="Ccs">in</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="biti" ana="Vcip 3 p--n">so</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="trinajst" ana="Mcnpnl">trinajst</w> <c>. </c> </s> … Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Quantifying the corpus Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Utility of MULTEXT-East LRs n n n Specifications became, for some, the “national” standard Training/testing dataset for HLT development: Po. S taggers, lemmatizers, lexicon extractors, ILP A base dataset for further annotation and experiments: – – – n n n Word-sense disambiguation Word. Net development and evaluation Syntactic parser induction Teaching aid in HLT courses ~ 100 registered users As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

LRs @ JSI Graz Uni January 27 2006 Also ours: VAYNA, GORE, slo. WNet Contributors to: FIDA, DSI, FDV, JRC-ACQUIS Contractors for: Inxight Nice try: EU CULTACT Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

JSI know-how in corpus compilation Encoding standardisation: XML, TEI, ISO 2. Up-conversion: character set, structure, meta-data 3. Linguistic annotation: token, lemma, MSD, alignment 4. Distribution via nl. ijs. si: concordancing, browsing, download & teaching in these areas: ESSLLI, JSIPS, FF, NG 1. Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Slovene LRs @ SDJT Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Conclusions n n Introduced language resources, MULTEXTEast and Slovene LRs Useful basis for empirical studies of the (Slovene) language Of course, more resources are needed, but we are working on it: SDT, slo. WNet, ja. Slo, ACQUIS, MULTEXT-East Further collaborations welcome… Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

Thank you!