The MULTEXTEast multilingual language resources Toma Erjavec Department
- Slides: 27
The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz. erjavec@ijs. si, http: //nl. ijs. si/et/
Overview 1. 2. 3. 4. Introduction to Language Resources MULTEXT-East: morphosyntactic resources for East-European languages A tour of Slovene language resources Conclusions Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Introduction to Language Resources n LR comprise two types of data: n LRs, esp. corpora are used for empirical language research: – corpora: mono- or multilingual, reference or specialised, …, /variously annotated/ – lexica: vocabularies, morphosyntactic, semantic (ontologies) – linguistic research: (annotated) corpus + (sophisticated) search engine – human language technology R&D: testing and training dataset Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Characteristics of LRs n Separate development for each language n Costly to produce, so should be widely available, but: n n great variation in availability between languages – – – “monopoly protection” problems of copyright lack of encoding standardisation – – text is becoming increasingly easy to acquire (WWW) un- & semi-supervised ML methods give increasingly good results Good side: Ideal: lots of different, large, high-quality, standardised, freely available, and supported LRs for all languages, multilingual and multimodal Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
History of LRs n 70 s: Chomskyan paradigm – no LRs 85 -95: renaissance of empiricism (LR-based): n 95 -05: established field ~ old hat n – became accepted in academic circles: corpus linguistics / (statistical) machine learning – advances in standardisation: TEI, EAGLES – large EU funded HLT/LR projects: EAGLES, MULTEXT, … – EU Copernicus (1995, ’ 97): MULTEXT-East, TELRI, … – LR brokers: LDC (1992), ELRA (1995) – – – LREC: bi-annual conferences (1998 -), LRE journal (2005) XML based standards: TEI, ISO, W 3 C national initiatives no more EU funding for LR collection or HLT R&D EU funding for component multimodal / multilingual technologies, standardisation and research infrastructures Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
MULTEXT-East resources n MULTEXT-East: Copernicus Joint Project COP 106 (1995 -1997) Multilingual Texts and Corpora for Eastern and Central European Languages n n Based on the results of EU MULTEXT (~West) To produce a harmonised BLARK for six languages: – – – corpus encoding standardisation (TEI / CES) multilingual parallel, comparable, speech corpora morphosyntactic specifications (EAGLES / MULTEXT) (inflectional) lexicon annotated corpus language processing tools Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
History of MULTEXT-East resources n n n First release 1998 on TELRI CD-ROM Vol II: already extended with new languages Resources since 1998 available on the Web: http: //nl. ijs. si/ME/ Second release 2002 in scope of EU CONCEDE: re-encoding in XML/TEI, harmonisation Third release 2004: merge of first two releases, further languages Work (indirectly) supported by: TELRI, CONCEDE, NSF grant, bi-lateral projects Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
The Languages of MULTEXT-East Slavic: Germanic: English n Romance: Romanian n Russian (East Slavic) n Czech (West Slavic) n Baltic: n Slovene (South West Slavic) – Latvian n Resian (Slovene dialect) – Lithuanian n Croatian (South West Slavic) n Serbian (South West Slavic) n Finno-Ugric: n Bulgarian (South East Slavic) – Estonian – Hungarian In progress: n Macedonian n Persian Graz Uni Tomaž Erjavec n January 27 2006 Dept. of Knowledge Technologies, Jozef Stefan Institute
Version 3 Available on http: //nl. ijs. si/ME/V 3/ n Some parts completely free, others free for research licence n Web pages gives: n – extensive documentation – bibliography list – web licence form – resources Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
The MULTEXT morphosyntactic trinity 1. 2. 3. MULTEXT-East morphosyntactic specifications MULTEXT-East morphosyntactic lexica MULTEXT-East morphosyntactically annotated "1984" corpus Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
1. Morphosyntactic specifications n n n Based on EAGLES / MULTEXT Define Po. S, their attributes and values The specs are a document containing: – – – n n introduction common tables language particular sections Written in La. Te. X PDF & HTML Derived XML/TEI encoding as feature structures Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Example common table Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Example language specific table Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Complexity Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
2. The lexica n n Medium size morphosyntactic lexica Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. ~ all word-forms of cca 15. 000 lemmas Lexical entry is composed of three fields: – – – the word-form: the inflected form of the word the lemma: the base-form of the word the morphosyntactic description (MSD) Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Example: Slovene lexicon abecedah abecedama abecedami abecede abecedi … Graz Uni January 27 2006 abeceda = abeceda abeceda abeceda Ncfdg Ncfpg Ncfsn Ncfdl Ncfpd Ncfdi Ncfpa Ncfpn Ncfsg Ncfda Ncfdn Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Lexicon sizes Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
The specification in as TEI FS <f. Lib type="Noun"> <f id="N 0. " select="en ro sl cs bg et hu hr sr sl-rozaj" name="Po. S"> <sym value="Noun" /> </f> <f id="N 1. c" select="en ro sl cs bg et hu hr sr sl-rozaj" name="Type"> <sym value="common" /> </f> <f id="N 1. p" select="en ro sl cs bg et hu hr sr sl-rozaj" name="Type"> <sym value="proper" /> </f> … <fs. Lib type="Noun"> <fs id="Nc" select="en et sr" feats="N 0. N 1. c" /> <fs id="Nc---n" select="ro" feats="N 0. N 1. c N 5. n" /> <fs id="Nc--g" select="sr" feats="N 0. N 1. c N 4. g" /> <fs id="Nc-p" select="cs en" feats="N 0. N 1. c N 3. p" /> <fs id="Nc-p 1" select="et" feats="N 0. N 1. c N 3. p N 4. 1" /> … Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
3. The “ 1984” corpus n n n Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr, …)) Structurally annotated Sentence aligned with English Words annotated with lemma and MSD Encoded in TEI P 4 (XML) Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Example linguistic encoding <text id="Osl. " lang="sl"> Context disambiguated <body> <div type="part" id="Osl. 1"> lemmas and MSDs <div type="chapter" id="Osl. 1. 2"> <p id="Osl. 1. 2. 2"> <s id="Osl. 1. 2. 2. 1"> <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcip 3 s--n">je</w> <w lemma="jasen" ana="Afpmsnn">jasen</w> <c>, </c> <w lemma="mrzel" ana="Afpmsnn">mrzel</w> <w lemma="aprilski" ana="Aopmsn">aprilski</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="in" ana="Ccs">in</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="biti" ana="Vcip 3 p--n">so</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="trinajst" ana="Mcnpnl">trinajst</w> <c>. </c> </s> … Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Quantifying the corpus Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Utility of MULTEXT-East LRs n n n Specifications became, for some, the “national” standard Training/testing dataset for HLT development: Po. S taggers, lemmatizers, lexicon extractors, ILP A base dataset for further annotation and experiments: – – – n n n Word-sense disambiguation Word. Net development and evaluation Syntactic parser induction Teaching aid in HLT courses ~ 100 registered users As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
LRs @ JSI Graz Uni January 27 2006 Also ours: VAYNA, GORE, slo. WNet Contributors to: FIDA, DSI, FDV, JRC-ACQUIS Contractors for: Inxight Nice try: EU CULTACT Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
JSI know-how in corpus compilation Encoding standardisation: XML, TEI, ISO 2. Up-conversion: character set, structure, meta-data 3. Linguistic annotation: token, lemma, MSD, alignment 4. Distribution via nl. ijs. si: concordancing, browsing, download & teaching in these areas: ESSLLI, JSIPS, FF, NG 1. Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Slovene LRs @ SDJT Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Conclusions n n Introduced language resources, MULTEXTEast and Slovene LRs Useful basis for empirical studies of the (Slovene) language Of course, more resources are needed, but we are working on it: SDT, slo. WNet, ja. Slo, ACQUIS, MULTEXT-East Further collaborations welcome… Graz Uni January 27 2006 Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute
Thank you!
- Xsl labs
- Matej erjavec
- Code switching and code mixing
- Language choice in multilingual communities ppt
- Multilingual e-learning
- Multilingual app toolkit
- Multilingual computing
- Multilingual state
- Mls screener
- Renfrew theory ap human geography
- Multilingual product information
- Linguistic varieties and multilingual nations
- Rasa multilingual
- Pie multilingual services
- Multilingual service desk
- Multilingual state
- Multilingual database design
- Multilingual
- Semantic search vs cognitive search
- Multilingual teaching methods
- Multilingual teaching methods
- Multilingual
- Multilingual technical support
- Semantic tags seo
- Multilingual nations
- What is transformation process
- Difference between fixed and variable resources
- Renewable resources vs nonrenewable resources