Tagset Reductions in Morphosyntactic Tagging of Croatian Texts

Introduction • morphosyntactic tagging • asssigning word categories and subcategories to words in sentence

Introduction • data-driven tagging modules • the tagger and the data • data implies

Tagging Croatian texts • Cro. Tag tagger • • inspired by Tn. T and

From another perspective. . . • goals of tagging • reaching perfect accuracy on

Reducing the tagset • Mul. Text East version 3 • positional tagset, letters encode

More results • adjectives, nouns and pronouns • most difficultly tagged cattegories for Croatian

Conclusions • results are as expected • reducing tagset size raises tagging accuracy •

Your questions? Computational Linguistic Models and Language Technologies for Croatian rmjt. ffzg. hr |

Slides: 11

Download presentation

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg. hr

Introduction • morphosyntactic tagging • asssigning word categories and subcategories to words in sentence context • issues • modelling sentence context • handling unknown words, dealing with sparse data • common approaches • rule-based, stochastic, hybrid • data-driven models are predominant today • best performing taggers are based on SVM, CRF, HMM

Introduction • data-driven tagging modules • the tagger and the data • data implies tagset encoding word (sub)categories • a solved problem? • state-of-the-art accuracy on English is 97 -98% • tagsets for English max. 100 different tags • 1475 different morphosyntactic tags used in the Croatian Morphological Lexicon • accuracy for state-of-the art taggers drops by ca 10%

Tagging Croatian texts • Cro. Tag tagger • • inspired by Tn. T and Hun. Pos trained on manually MTE v 3 annotated 118 kw corpus accuracy identical to these (96 -97% EN, 85 -86% HR) all are highly dependent on unknown word counts • improvements • using the inflectional lexicon to handle unknown words • tagger voting, hibridization?

From another perspective. . . • goals of tagging • reaching perfect accuracy on full tagset or • making large-scale NLP systems perform better? • specific requirements • users and systems always have them • example: named entity normalization in Croatian Is it Ivo (m. ) or Iva (f. ) Sanader? • specific tasks may require specific tagset design • keeping speed and memory footprint • reducing tagset size means raising accuracy

Reducing the tagset • Mul. Text East version 3 • positional tagset, letters encode categories • example: Ncmsn = noun, common, masculine, etc. • the subsets 1 – strip non-inflective categories and numerals (800 tags) 2 – strip verbs (739) 3 – strip all but gender, number, case and noun type (243) 4 – remove case category (48) 5 – keep noun type category only (15) 6 – maintain part-of-speech information only (13)

Results

More results • adjectives, nouns and pronouns • most difficultly tagged cattegories for Croatian • combination of frequency and tags used • maybe these are most important to tag accurately? F 1 -measures on adjectives, nouns and pronouns type subset 0 subset 4 subset 5 Adjective 0. 64± 0. 04 0. 74± 0. 05 0. 92± 0. 02 Noun 0. 79± 0. 03 0. 86± 0. 03 0. 95± 0. 01 Pronoun 0. 76± 0. 03 0. 87± 0. 04 0. 99± 0. 01

Conclusions • results are as expected • reducing tagset size raises tagging accuracy • sacrificing information for efficiency • reductions are illustrative • careful tagset design required with regards to requirements • further work • as mentioned: reaching perfect accuracy on full tagset or making large-scale NLP systems perform better?

Your questions? Computational Linguistic Models and Language Technologies for Croatian rmjt. ffzg. hr | hml. ffzg. hr | hnk. ffzg. hr

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg. hr