Morphological Normalization and Collocation Extraction Jan najder Bojana

Morphological Normalization and Collocation Extraction Jan Šnajder, Bojana Dalbelo Bašić, Marko Tadić University of Zagreb Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences jan. snajder@fer. hr, bojana. dalbelo@fer. hr, marko. tadic@ffzg. hr Seminar at the K. U. Leuven, Department of Computing Science Leuven 2008 -05 -08 K. U. Leuven 2008 -05 -08

Morphological Normalization Jan Šnajder, Marko Tadić University of Zagreb Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences jan. snajder@fer. hr, bojana. dalbelo@fer. hr, marko. tadic@ffzg. hr Seminar at the K. U. Leuven, Department of Computing Science Leuven 2008 -05 -08 K. U. Leuven 2008 -05 -08

Talk overview § § § § § K. U. Leuven 2008 -05 -08 who we are? what are we doing? morphological processing: normalization lemmatization vs. stemming Mollex: a system for normalization of Croatian usage in document indexing and text classification collocations as features collocation extraction by co-occurrence measures usage of genetic programming

Who we are? § University of Zagreb, Croatia § founded 1669, 52, 500 undergraduate students § two faculties in the same mission K. U. Leuven 2008 -05 -08 § build the systems that will develop and enable the usage of language resources and tools for Croatian

Who we are 2? § Faculty of Humanities and Social Sciences § Institute / Department of Linguistics § dealing with basic computational linguistic tasks for Croatian § compiling and processing large scale language resources § Croatian National Corpus, Croatian Morphological Lexicon, Croatian Word. Net, Croatian Dependency Treebank K. U. Leuven 2008 -05 -08 § tagger, lemmatizer § chunker, parser § NERC system

Who we are 3? § Faculty of Electrical Engineering and Computing § Department of Electronics, Microelectronics, Computer and Intelligent Systems / KTLab § Knowledge Technogies Laboratory Group deals with § text preprocessing techniques for Croatian for machine learning procedures § dimensionality reduction and document clustering in the vector space model + visualisation § automatic indexing of documents § intelligent, language specific information retrieval and extraction K. U. Leuven 2008 -05 -08

What are we doing? § working jointly on several research projects § AIDE: Automatic Indexing with Descriptors from Eurovoc (cooperation with the Government of the Republic of Croatia, HIDRA) § RMJT: Computational Linguistic Models and Language Technologies for Croatian (national research programme, two of five projects) § Croatian language resources and their annotation 2007 -2011, prof. Marko Tadić § Knowledge discovery in textual data 2007 -2011, prof. Bojana Dalbelo Bašić § CADIAL: Computer Aided Document Indexing for Accessing Legislation K. U. Leuven 2008 -05 -08 § joint Flemish-Croatian project § 2007 -2009 § prof. Marie-Francine Moens & prof. Bojana Dalbelo Bašić

Morphological processing § computational linguistic / NLP task § important for inflectionally rich languages, e. g. § Croatian noun in 14 word-forms (7 cases, 2 numbers): N: student G: studenta D: studentu A: studenta V: studentu L: studentu I: studentom studenti studenata studentima studente studentima § unlike English noun in 2(3? ) word-forms (2 numbers + possesive? ): Sg: Pl: K. U. Leuven 2008 -05 -08 students Poss: (student’s) § present in all Slavic languages (excl. Bulgarian), German, Greek, Baltic languages, Finnish, . . .

Morphological processing 2 § three basic subtasks in inflection processing 1. generation of (all) word-forms (WFs) of a lexeme 2. analysis of WFs i. e. recognizing the values of morphosyntactical categories of a WF in text 3. recognizing to which lexeme(s) a WF belongs to § the last one helps us in avoiding the problem of data sparsness in many text processing tasks, e. g. § § § information retrieval, text mining, document indexing normalization: conflating the morphological variants of a word to a single representative form two main ways to do that 1. linguistically motivated: lemmatization 2. computationally motivated: stemming K. U. Leuven 2008 -05 -08

Morphological processing 3 § lemmatization § replacing the WF with its proper base WF, usually called lemma § e. g. mapping theoretical maximum of (e. g. 14) WFs to 1 lemma § lexicon based § large lexicons of all (generated) WFs needed § preparation expensive in time and manpower § mostly realized by databases § algorithmic based § mostly FST: compact, efficient, fast § lexicon of lemmas and their inflectional patterns needed anyway K. U. Leuven 2008 -05 -08

Morphological processing 4 § stemming § reducing the WF from the end by truncating the possible endings § does not have to respect the linguistic boundaries vuk+Ø vuk+a vuč+e > > > *vu+kØ *vu+ka *vu+če § reducing all the WFs to a common beginning § problems where there are many morphonological adaptations sla+ti šalj+em K. U. Leuven 2008 -05 -08 > > *? +slati *? +šaljem

Morphological normalization § Croatian language (like most Slavic languages) is morphologically complex § elaborated inflectional and derivational morphology § problematic for most NLP applications § requires the use of substantial linguistic knowledge § our lexicon based approach to normalization is somewhere in between lemmatization and stemming § suitable for other inflectionally complex languages K. U. Leuven 2008 -05 -08

Croatian Morphology 1. high degree of affixation § § inflection § § nouns: declination (7 cases, 2 numbers) verbs: conjugation (tenses, persons, numbers, genders) adjectives: declination (7 cases, 2 numbers, 3 genders), comparison (3 degrees), and definiteness derivation § K. U. Leuven 2008 -05 -08 word-forms are obtained by suffixation, prefixation, phonological alternations, stem extension a large number of rules for deriving nouns from verbs, verbs from nouns, possessive adjectives, . . .

Croatian Morphology 2 § inflection examples § adjective: brz, brza, brzima, brzih, brzoj, brze, brzim, brzoga, brza, brzom, brzomu, brži, bržeg, brža, bržima, bržih, bržoj, brže, bržim, bržem, bržima, najbrži, bržeg, najbrža, najbržima, najbržih, najbrže, najbržim, najbrži, najbržoj, . . . § noun: brzina, brzinom, brzine, brzinama, brzinu, brzina, brzini § adjective: brzinski, brzinskom, brzinske, brzinskih, brzinska, brzinskoj, brzinskog, brzinskoga, … § adverb: brzo, brže, najbrže, brzinski § derivation examples K. U. Leuven 2008 -05 -08 § brz > brzina > brzinski > …

Croatian Morphology 3 2. high degree of homography § § vode = voda (water) | voditi (to lead) | vod (a platoon) requires disambiguation (POS/MSD tagging) 3. affix ambiguity § many ambiguous suffixation rules § § § possible mismatches at inflectional level § § narančast / narančast-om vs. ruž / ruž-om (not ruža) possible mismatches at derivational level § K. U. Leuven 2008 -05 -08 e. g. bolnic-a / bolnic-i vs. ruk-a / ruc-i e. g. bolnic-a / bolnic-om vs. brodolom / brodolom-a e. g. kralj / kralj-ica vs. stan / stan-ica

Lexicon based normalization § lexicon-based morphological normalisation § a morphological lexicon associates to each WF its morphological norm (lemma, stem, . . . ) and, optionally, a MSD § incorporates linguistic knowledge and thus avoids aforementioned pitfalls § drawbacks § made by linguists, expensive and time-consuming § problems with coverage (neologisms, jargons, …) § our approach § rule-based acquisition of large coverage morphological lexica from raw (unannotated) corpora K. U. Leuven 2008 -05 -08

Our approach 1. acquisition of inflectional lexicon § input: raw corpora and sets of inflectional and derivational rules in convenient (grammarbook-like) formalism 2. normalisation of word-forms § § inflectional (lemmatization) inflectional + derivational § § advantages § § § K. U. Leuven 2008 -05 -08 comparable to stemming (but more precise) can be used as both a lemmatizer (with MSD) and a stemmer (with variable degree of conflation) provides good lexicon coverage requires only limited linguistic expertise

Morphology representation § e. g. noun inflectional paradigm § vojnik (soldier) Case N G D A V L I K. U. Leuven 2008 -05 -08 Singular vojnik-Ø vojnik-a vojnik-u vojnik-a vojnič-e vojnik-u vojnik-om Plural vojnic-i vojnik-a vojnic-ima vojnik-e vojnic-ima

Morphology representation 2 § defines inflectional and derivational rules § uses functions as building blocks: § A) condition functions § B) string transformation functions § each defined using a higer-order function § e. g. § sfx('a')('vojnik') = 'vojnika' § sfx(‘e’) alt(pal) § (sfx('e') alt(pal))('vojnik') = 'vojniče' K. U. Leuven 2008 -05 -08

Morphology representation 3 Case N G D A V L I Singular vojnik-Ø vojnik-a vojnik-u vojnik-a vojnič-e vojnik-u vojnik-om Plural vojnic-i vojnik-a vojnic-ima vojnik-e vojnic-ima § ( s. ends('k', 'g', 'h')(s) cons. Group(s), {null, sfx(‘a’), sfx(‘u’), sfx(‘om’), sfx(‘e’) alt(pal), sfx(‘i’) alt(sib), sfx(‘ima’) alt(sib), sfx(‘e’)}) K. U. Leuven 2008 -05 -08

Morphology representation 4 § suitable also for more complex paradigms (c, {null, sfx(‘a’), sfx(‘u’), . . . , sfx(‘ima’)} {sfx(‘og’), sfx(‘om’), . . . , sfx(‘ima’)} {sfx(‘i’) alt(jot), sfx(‘eg’) alt(jot), . . . , sfx(‘ima’) alt(jot)} {sfx(‘i’) alt(jot) pfx(‘naj’), . . . , sfx(‘ima’) alt(jot) pfx(‘naj’)}) K. U. Leuven 2008 -05 -08

Morphology representation 5 § advantages § resembles to morphology description as found in traditional grammar books § requires minimum amount of linguistic knowledge § highly expressive: arbitrary HOF functions can be defined § can be aplied to other morphologically similar languages § implemented in Haskell § purely functional programming language § requires minimum programming skills K. U. Leuven 2008 -05 -08

Lexicon acquisition § uses inflectional rules + raw corpora to extract lemmas and their paradigms § uses frequency counts of WFs attested in the corpus § much of the ambiguity is resolved by language-dependent heuristics § plausibility, priority § linguistic quality is not vital § word-form conflation rather than generation § human intervention is not required K. U. Leuven 2008 -05 -08

Results § example lexicon § acquired from 20 Mw newspaper corpus § based on 90 inflectional and >300 derivational rules § contains ca 42, 000 lemmas associated with over 500, 000 WFs § performance § § linguistic quality F 1 = 88% per type coverage 96% per type and 98% per token understemming = 7% overstemming < 4% § can be improved further by manual editing K. U. Leuven 2008 -05 -08

Derivational normalization § inflectional lexicon is partitioned into equivalence classes based on derivational rules § degree of normalisation depends on the number of derivational rules used § problem with semantics K. U. Leuven 2008 -05 -08 § context, degrees § derivation is not so semantically regular as inflection

References and applications § Reference § Šnajder, Jan; Dalbelo Bašić, Bojana; Tadić, Marko. Automatic Acquisition of Inflectional Lexica for Morphological Normalisation // Information Processing and Management, 2008. (in press) § Applied in document indexing § projects AIDE & CADIAL www. cadial. org § Dalbelo Bašić, Bojana; Tadić, Marko; Moens, Marie-Francine. Computer Aided Document Indexing for Accessing Legislation // Toegang tot de wet / J. Van Nieuwenhove & P. Popelier (eds). Brugge : Die Keure, 2008. pp. 107 -117. § Applied in text classification K. U. Leuven 2008 -05 -08 § Malenica, Mislav; Šmuc, Tomislav; Jan, Šnajder; Dalbelo Bašić, Bojana. Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus. // Information Processing and Management, 44 (2008), 1; 325 -339.

Thank you for your attention! K. U. Leuven 2008 -05 -08