MULTEXTEast Version 4 multilingual morphosyntactic specifications for lots

  • Slides: 33
Download presentation
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec http: //nl.

MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec http: //nl. ijs. si/et/ Department of Knowledge Technologies Jožef Stefan Institute Ljubljana Slovenia Dublin April 3 rd, 2009

Overview of the talk 1. 2. 3. 4. 5. Part-of-speech tagging, tagsets and interoperability

Overview of the talk 1. 2. 3. 4. 5. Part-of-speech tagging, tagsets and interoperability MULTEXT(-East) morphosyntactic specifications Languages, formats, transformations An application: JOS resources for Slovene Conclusions Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Part-of-speech tagging n n The task of assigning the correct Po. S tag to

Part-of-speech tagging n n The task of assigning the correct Po. S tag to each word in a running text, e. g. Under/IN the/DT proposal/NN , /, Delmed/NNP would/MD issue/VB about/IN 123. 5/CD million/CD additional/JJ Delmed/NNP common/JJ shares/NNS to/TO Fresenius/NNP … Important HLT infrastructure Very useful annotations for linguists Some applications: u pre-processing step for further analyses: lemmas, syntactic structure, etc. u text indexing, e. g. nouns are more useful than verbs Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Methods of Po. S tagging n n Po. S tagging: u determine ambiguity class

Methods of Po. S tagging n n Po. S tagging: u determine ambiguity class or word (saw → NN | VBD) u disambiguate to correct tag in (local) context (“I saw/VBD a saw/NN “) Tagger training: u manually annotated corpus: source of probabilities for tags given a (local) context + u (lexicon: gives possible tags for each word-form) Popular taggers: u Tn. T (HMM tagger), Tree. Tagger (decision trees), TBL (transformation based tagging) Tagging usefulness as well as accuracy crucially depends on the tagset Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

English tagsets n n n Tagging first developed for English (Brown, CLAWS, PTB tagsets)

English tagsets n n n Tagging first developed for English (Brown, CLAWS, PTB tagsets) English inflectionally very poor language → small tagsets ~ 50 different tags Tags are typically “synthetic”, i. e. the tag does not transparently map to features e. g. : u to/TO (Po. S? ) u Delmed/NNP (number? ) u shares/NNS (number? ) Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Tagsets for other languages n n will often have many more morphosyntactic features associated

Tagsets for other languages n n will often have many more morphosyntactic features associated with a word, so tagsets will be larger e. g. Slovene nouns: u u u type: common, proper gender: masculine, feminine, neuter number: singular, dual, plural case: nom. , gen. , dat. , acc. , loc. , ins. (animacy: yes, no) = 104 “Po. S” tags just for Nouns Russian, Czech, Slovene ~ 1000 -2000 word level syntactict tags u n Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Po. S tags vs. MSDs n n n Po. S tags: u used in

Po. S tags vs. MSDs n n n Po. S tags: u used in corpora for corpus annotations / tagging u typically synthetic Morphosyntactic Descriptions (MSDs): u used in inflectional lexica for lexical annotations / morphological analysis u typically analytic Relation of Po. S tagsets to MSD tagsets/features u in general: |Po. S| < |MSD| u but in most MULTEXT-East languages: [Po. S] ≡ [MSD] Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Developing a multilingual morphosyntactic framework Interoperability: Tagsets developed for various languages (or even for

Developing a multilingual morphosyntactic framework Interoperability: Tagsets developed for various languages (or even for the same language) have no connection with each other and are often poorly documented n Best practice: Languages that do not yet have a tagset could benefit from an operational framework in which to model it n Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

so, wouldn’t it be nice to have: an open, standardised, documented, flexible model for

so, wouldn’t it be nice to have: an open, standardised, documented, flexible model for MSD/Po. S tagset design, n that would be instantiated for lots of languages, n and could be simply applied to any language? n Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

EU standardisation efforts n n n EAGLES: Expert Advisory Group for Language Engineering Standards

EU standardisation efforts n n n EAGLES: Expert Advisory Group for Language Engineering Standards (1993 -1996) MULTEXT: Multilingual Text Tools and Corpora (1995) MULTEXT-East: MULTEXT for Central and Eastern European Languages: u Version 1: TELRI edition (1998) u Version 2: Concede edition (2002) u Version 3: TEI edition (2004) u Version 4: Mondi. Lex edition (2009? ). . . ISO / TC 37 / LMF / iso. Cat (2008) Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

MULTEXT-East morphosyntactic resources n n Basic Language Resource Kit: 1. specifications: define features and

MULTEXT-East morphosyntactic resources n n Basic Language Resource Kit: 1. specifications: define features and MSDs 2. lexica (~15, 000 lemmas): triplets: word-form / lemma / MSD 3. parallel corpus: MSD and lemma annotated Freely available for research http: //nl. ijs. si/ME/ Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

1984: aligned annotated Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

1984: aligned annotated Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

MULTEXT-East languages Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

MULTEXT-East languages Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

The MULTEX(-East) morphosyntactic specifications n n They specify that e. g. ”Ncmsn” u corresponds

The MULTEX(-East) morphosyntactic specifications n n They specify that e. g. ”Ncmsn” u corresponds to the feature-structure [Noun, Type=common, Gender=masculine, Number=singular, Case=nominative] u is a valid MSD for Slovene Specifications consist of u Front matter u Common part - common definitions for all languages (features) u Language particular parts - particulars for each language (MSD set) Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

V 4 specs draft in HTML Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

V 4 specs draft in HTML Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Specifications in Version 4 n n Encoded in XML / tei. Lite (in Version

Specifications in Version 4 n n Encoded in XML / tei. Lite (in Version 3: La. Te. X) TEI = Text Encoding Initiative Guidelines P 4 Still in “book-like” in form, to make authoring easier XSLT into other formats: u HTML u tabular mapping formats (e. g. MSD to features) u XML/TEI feature library u (OWL) Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

The common specifications n n Define categories (“parts-of-speech”) For each category define features, i.

The common specifications n n Define categories (“parts-of-speech”) For each category define features, i. e. attributes and their values For each attribute-value specify for which languages it is appropriate Give positional mapping to MSDs: u each attribute assigned a position u each attribute-value assigned a onecharacter code Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Common table (HTML) Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Common table (HTML) Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Common table (source XML/tei. Lite) Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Common table (source XML/tei. Lite) Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Language particular sections Recap the feature definitions for the language n Add “combinations”, i.

Language particular sections Recap the feature definitions for the language n Add “combinations”, i. e. feature-coocurrence restrictions n Add “lexicon”, i. e. list of all valid MSDs for language n Possibly localise the features and codes n Possibly give notes and examples n Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Combinations Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Combinations Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Lexicon Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Lexicon Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Jezikoslovno označevanje slovenščine http: //nl. ijs. si/jos Erjavec: MULTEXT-East Version 4 Dublin, 4. 4.

Jezikoslovno označevanje slovenščine http: //nl. ijs. si/jos Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

JOS as a bridge to MULTEXT-East Version 4 Fida. PLUS corpus MTE V 3

JOS as a bridge to MULTEXT-East Version 4 Fida. PLUS corpus MTE V 3 slv specifications JOS corpora JOS (slv) specifications MTE V 4 specifications Erjavec: MULTEXT-East Version 4 MTE V 4 (slv) specifications Dublin, 4. 4. 2009

Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

JOS specifications XML/tei. Lite + XSLT transforms n Allow reordering of attribute positions (Vm-----d

JOS specifications XML/tei. Lite + XSLT transforms n Allow reordering of attribute positions (Vm-----d → Vmd) n i 18 n / slv+eng: u translation: specifications u localisation: attributes, values, codes u localisation: TEI element names n Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

MSD conversion tables n n Tabular UTF-8 files MSD-slv to -eng MSD to features

MSD conversion tables n n Tabular UTF-8 files MSD-slv to -eng MSD to features Collating sequence e. g. 01 N 010100 Somei Ncmsn 01 N 0101010200 Somer Ncmsg 01 N 0101010300 Somed Ncmsn Noun Type=common Gender=masculine Number=singular Case=nominative Animacy=0 Ncmsg Noun Type=common Gender=masculine Number=singular Case=genitive Animacy=0 Ncmsd Noun Type=common Gender=masculine Number=singular Case=dative Animacy=0 Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Adding a new language n n XSLT scripts: u mtems-split. xsl: make a template

Adding a new language n n XSLT scripts: u mtems-split. xsl: make a template for the language particular section of a new language u mtems-merge: merge a new language particular section to the common tables Maybe shortly to be tested on new Slavic languages in the scope of Mondi. Lex Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Critiques It’s just an exercise in encoding anyway n Same is different, different is

Critiques It’s just an exercise in encoding anyway n Same is different, different is same n The Procrustean bed of standards n n Policy change: from unification to harmonisation (hippy school) Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Conclusions n n Presented work-in-progress on “standardisation” of multilingual morphosyntactic specifications Specifications are a

Conclusions n n Presented work-in-progress on “standardisation” of multilingual morphosyntactic specifications Specifications are a de-facto standard for several languages (Romanian, Slovene, Croatian) Could serve as “hub” encoding for multilingual applications, e. g. MT and as an framework for new languages Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009

Further work n n n Finishing MTE V 4! Distribution: LDC, ELDA Relation to

Further work n n n Finishing MTE V 4! Distribution: LDC, ELDA Relation to ISO-TC 37 standards: u LMF, iso. CAT Connecting to GOLD ontology Adding new languages: u Slavic completion u Western European: MULTEXT u Japanese: chasen tagset, jp. Wa. C(-L 2) u Irish? ☺ Erjavec: MULTEXT-East Version 4 Dublin, 4. 4. 2009