CMSC 723 Intro to Computational Linguistics February 11
CMSC 723: Intro to Computational Linguistics February 11, 2003 Lecture 3: Finite-State Morphology Prof. Bonnie J. Dorr and Dr. Nizar Habash TAs: Nitin Madnani and Nate Waisbrot
Plan for Today’s Lecture • Morphology: Definitions and Problems – What is Morphology? – Topology of Morphologies • Approaches to Computational Morphology – Lexicons and Rules – Computational Morphology Approaches • Assignment 2
Morphology • The study of the way words are built up from smaller meaning units called Morphemes • Abstract versus Realized HOP +PAST hop +ed hopped /hapt/ Syntax Morphology Phonology Lexeme/Inflected Lexeme Grammars Morpheme/Allomorph Morphotactics words Phoneme/Allophone Phonotactics • Context sentences letters
Phonology and Morphology • Phonology vs. Orthography • Historical spelling – night, nite – attention, mission, fish • Script Limitations – Spoken English has 14 vowels • heed hid hayed head hoed hood who’d hide how’d taught Tut toy enough – English Alphabet has 5 • Use vowel combinatios: far fair fare • Consonantal doubling (hopping vs. hoping)
Syntax and Morphology ● Phrase-level agreement – Subject-Verb ● John studies hard (STUDY+3 SG) – Noun-Adjective ● ● Las vacas hermosas Sub-word phrasal structures – – conj שבספרינו prep נו + ים + ספר + ב + ש That+in+book+PL+Poss: 1 PL noun Which are in our books article plural poss
Topology of Morphologies • Concatinative vs. Templatic • Derivational vs. Inflectional • Regular vs. Irregular
Concatinative Morphology • Morpheme+Morpheme+… • Stems: also called lemma, base form, root, lexeme – hope+ing hoping hopping • Affixes – – Prefixes: Antidisestablishmentarianism Suffixes: Antidisestablishmentarianism Infixes: hingi (borrow) – humingi (borrower) in Tagalog Circumfixes: sagen (say) – gesagt (said) in German • Agglutinative Languages – uygarlaştıramadıklarımızdanmışsınızcasına – uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına – Behaving as if you are among those whom we could not cause to become civilized
Templatic Morphology • Roots and Patterns ﻙ ﺕ ﺏ KTB כת ב ? ? ? ﻭ ? ? ? ו ﻣﻜﺘﻮﺏ כתוב maktuub written ktuuv written
Templatic Morphology: Root Meaning • KTB: writing “stuff” ﻛﺘﺎﺏ book ﻣﻜﺘﺒﺔ library ﻣﻜﺘﺐ office ﻛﺘﺐ ﻣﻜﺘﻮﺏ ﻛﺎﺗﺐ write letter writer כתב מכתב כתיב spelling כתובת address
Inflectional vs. Derivational • Word Classes – Parts of speech: noun, verb, adjectives, etc. – Word class dictates how a word combines with morphemes to form new words
Derivational morphology • Nominalization: computerization, appointee, killer, fuzziness • Formation of adjectives: computational, clueless, embraceable • Cat. Var: Categorial Variation Database http: //clipdemos. umiacs. umd. edu/catvar/
Inflectional morphology • • • Adds: Tense, number, person, mood, aspect Word class doesn’t change Word serves new grammatical role Five verb forms in English Other languages have (lots more)
Nouns and Verbs (in English) • Nouns have simple inflectional morphology – cat+s, cat+’s • Verbs have more complex morphology
Regulars and Irregulars • Nouns – Cat/Cats – Mouse/Mice, Oxen, Goose, Geese • Verbs – Walk/Walked – Go/Went, Fly/Flew
Regular (English) Verbs Morphological Form Classes Regularly Inflected Verbs Stem walk merge try map -s form walks merges tries maps -ing form walking merging trying mapping Past form or –ed participle walked merged tried mapped
Irregular (English) Verbs Morphological Form Classes Irregularly Inflected Verbs Stem eat catch cut -s form eats catches cuts -ing form eating catching cutting Past form ate caught cut -ed participle eaten caught cut
“To love” in Spanish
Computational Morphology • Finite State Morphology – Finite State Transducers (FST) • Input/Output • Analysis/Generation
Computational Morphology WORD • cats • cat • cities • geese • ducks STEM (+FEATURES)* cat +N +PL cat +N +SG city +N +PL goose +N +PL (duck +N +PL) or (duck +V +3 SG) • merging merge +V +PRES-PART • caught (catch +V +PAST-PART) or (catch +V +PAST)
Computational Morphology • The Rules and the Lexicon – – General versus Specific Regular versus Irregular Accuracy, speed, space The Morphology of a language • Approaches – Lexicon only – Rules only – Lexicon and Rules • Finite-state Automata • Finite-state Transducers
Lexicon-only Morphology • The lexicon lists all surface level and lexical level pairs • No rules …? • Analysis/Generation is easy • Very large for English • What about Arabic or Turkish? • Chinese? acclaimed acclaiming acclaims acclamations acclimated acclimates acclimating acclaim $N$ acclaim $V+0$ acclaim $V+ed$ acclaim $V+en$ acclaim $V+ ing$ acclaim $N+s$ acclaim $V+s$ acclamation acclimate acclimate $N$ $N+s$ $V+0$ $V+ed$ $V+en$ $V+s$ $V+ ing$
Lexicon and Rules FSA Inflectional Morphology • English Noun Lexicon regnoun Irreg-plnoun Irreg-sg- plu noun ral fox cat dog geese sheep mice goose sheep mouse • English Noun Rule -s
FSA English Verb Inflectional Morphology reg-verb-stem irreg-past-verb past-part pres-part 3 sg walk fry talk impeach cut speak spoken sing sang caught ate eaten -ed -ing -ed -s
FSA for Derivational Morphology: Adjectival Formation
More Complex Derivational Morphology
Using FSAs for Recognition: English Nouns and their Inflection
Morphological Parsing • • Finite-state automata (FSA) – Recognizer – One-level morphology Finite-state transducers (FST) – Two-level morphology • PC-Kimmo (Koskenniemi 83) – input-output pair
Terminology for PC-Kimmo • • Upper = lexical tape Lower = surface tape Characters correspond to pairs, written a: b If “a: a”, write “a” for shorthand Two-level lexical entries # = word boundary ^ = morpheme boundary Other = “any feasible pair that is not in this transducer”
Four-Fold View of FSTs • • As a recognizer As a generator As a translator As a set relater
Nominal Inflection FST
Lexical and Intermediate Tapes
Spelling Rules Name Rule Description Example Consonant Doubling 1 -letter consonant doubled before -ing/-ed beg/begging E-deletion Silent e dropped before -ing and -ed make/making E-insertion e added after s, z, x, ch, sh before s watch/watches Y-replacement -y changes to -ie before -s, -i before -ed try/tries K-insertion verbs ending with vowel + -c add -k panic/panicked
Chomsky and Halle Notation ε→e/ x s z ^ __ s #
Intermediate-to-Surface Transducer
State Transition Table
Two-Level Morphology
Sample Run
FST Properties • Inversion • T-1= inversion of T • Input/Output switched • Composition • T 1 maps I 1 to O 1 • T 2 maps I 2 to O 2 • T 1° T 2 maps I 1 to O 2
FSTs and ambiguity • Kimmo Demo • Parse Example 1: unionizable • union +ize +able • un+ ion +ize +able • Parse Example 2: assess • assessv • ass. N +ess. N • Parse Example 3: tender • tender. AJ • ten. Num+d. AJ+er. CMP
What to do about Global Ambiguity? • Accept first successful structure • Run parser through all possible paths • Bias the search in some manner
Computational Morphology • The Rules and the Lexicon – – General versus Specific Regular versus Irregular Accuracy, speed, space The Morphology of a language • Approaches – Lexicon only – Rules only – Lexicon and Rules • Finite-state Automata • Finite-state Transducers
Lexicon-Free Morphology: Porter Stemmer • Lexicon-Free FST Approach • By Martin Porter (1980) http: //www. tartarus. org/%7 Emartin/Porter. Stemmer/ • Cascade of substitutions given specific conditions GENERALIZATIONS GENERALIZATION GENERALIZE GENERAL GENER • Porter Stemmer Game
Porter Stemmer Definitions • C = consonant = Not A E I O U or (Y preceded by C) • V = not C • M = Measure: Words = C*(V*C*){M}V* – – – M=0 M=1 M=2 TR, EE, TREE, Y, BY TROUBLE, OATS, TREES, IVY TROUBLES, PRIVATE, OATEN, ORRERY • Conditions – *S - stem ends with S – *v* - stem contains a V – *d - stem ends with double C • -DD, -ZZ – *o - stem ends CVC, where the second C is not W, X or Y • -WIL, -SOB
*<S> = ends with <S> *v* = contains a V Porter Stemmer *d = ends with double C *o = ends with CVC Step 1: Plural Nouns and Third Person Singular Verbs SSES SS IES I caresses ponies ties caress cats SS S second C is not W, X or Y caress poni ti caress cat Step 2 a: Verbal Past Tense and Progressive Forms (M>0) EED EE i (*v*) ED ii (*v*) ING feed, agreed agree plastered motoring plaster, bled motor, sing Step 2 b: If 2 a. i or 2 a. ii is successful, Cleanup AT ATE conflat(ed) conflate BL BLE troubl(ed) trouble IZ IZE siz(ed) size (*d and not (*L or *S or *Z)) single letter (M=1 and *o) E hopp(ing) hop, tann(ed) tan hiss(ing) hiss, fizz(ed) fizz fail(ing) fail, fil(ing) file
*<S> = ends with <S> *v* = contains a V Porter Stemmer Step 3: Y I (*v*) Y I happy happi sky *d = ends with double C *o = ends with CVC second C is not W, X or Y
Porter Stemmer Step 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> (m>0) TIONAL -> ATE TION (m>0) (m>0) (m>0) (m>0) (m>0) ENCE ANCE IZE ABLE AL ENT E OUS IZE ATE AL IVE FUL OUS AL IVE BLE ENCI ANCI IZER ABLI ALLI ENTLI ELI OUSLI IZATION ATOR ALISM IVENESS FULNESS OUSNESS ALITI IVITI BILITI -> -> -> -> -> relational -> conditional -> rational -> valenci -> hesitanci -> digitizer -> conformabli -> radicalli -> differentli -> vileli - > analogousli -> vietnamization -> predication -> operator -> feudalism -> decisiveness -> hopefulness -> callousness -> formaliti -> sensitiviti -> sensibiliti -> relate condition rational valence hesitance digitize conformable radical different vile analogous vietnamize predicate operate feudal decisive hopeful callous formal sensitive sensible
Porter Stemmer Step 5: Derivational Morphology II: More Multiple Suffixes (m>0) (m>0) ICATE ATIVE ALIZE ICITI ICAL FUL NESS -> -> IC AL IC IC triplicate formative formalize electriciti electrical hopeful goodness -> -> triplic formal electric hope good
*<S> = ends with <S> Porter Stemmer Step 5: Derivational Morphology III: Single Suffixes (m>1) AL -> (m>1) ANCE -> (m>1) ER -> (m>1) IC -> (m>1) ABLE -> (m>1) IBLE -> (m>1) ANT -> (m>1) EMENT -> (m>1) ENT -> (m>1 and (*S or *T)) ION -> (m>1) OU -> (m>1) ISM -> (m>1) ATE -> (m>1) ITI -> (m>1) OUS -> (m>1) IVE -> (m>1) IZE -> revival allowance inference airliner gyroscopic adjustable defensible irritant replacement adjustment dependent adoption homologou communism activate angulariti homologous effective bowdlerize -> -> -> -> -> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y reviv allow infer airlin gyroscop adjust defens irrit replac adjust depend adopt homolog commun activ angular homolog effect bowdler
*<S> = ends with <S> Porter Stemmer *v* = contains a V *d = ends with double C *o = ends with CVC Step 7 a: Cleanup (m>1) E (m=1 and not *o) E second C is not W, X or Y probate probat rate cease ceas Step 7 b: More Cleanup (m > 1 and *d and *L) single letter controll control roll
Porter Stemmer • Errors of Omission – – – European analysis matrices noise explain Europe analyzes matrix noisy explanation • Errors of Commission – – – organization doing generalization numerical university organ doe generic numerous universe
Readings for next time • J&M Chapter 6
Assignment 2 Due Date is Midnight 2/25/2004 Text from Assignment here…
- Slides: 52