Morphology See Harald Trost Morphology Chapter 2 of

Morphology - reminder • Internal analysis of word forms • morpheme – allomorphic variation

Role of morphology • Commonly made distinction: inflectional vs derivational • Inflectional morphology is

Inflectional morphology • Grammatical in nature • Does not carry meaning, other than grammatical

Derivational morphology • Lexical in nature • Can carry meaning • Fairly systematic, and

Morphological processes • • • Affixes: prefix, suffix, infix, circumfix Vowel change (umlaut, ablaut)

Morphophonemics • Morphemes and allomorphs – eg {plur}: +(e)s, vowel change, y ies, f

Morphology in NLP • Analysis vs synthesis – what does dogs mean? vs what

Morphology in NLP • String-handling programs can be written • More general approach –

Role of lexicon in morphology • Rules interact with the lexicon – Obviously category

Problems with rules • Exceptions have to be covered – Including systematic irregularities –

Tokenization • The simplest form of analysis is to reduce different word forms into

Morphological processing • Stemming • String-handling approaches – Regular expressions – Mapping onto finite-state

Stemming • Stemming is the particular case of tokenization which reduces inflected forms to

Finite state automata • A finite state automaton is a simple and intuitive formalism

Finite state automata • A bit like a flow chart, but can be used

Example Jurafsky & Martin, Figure 2. 10 a a b q 0 q 1

Non-deterministic FSA Jurafsky & Martin, Figure 2. 18 2. 19 a a b q

An FSA to handle morphology c o f q 0 q 1 x q

Finite State Transducers • A “transducer” defines a relationship (a mapping) between two things

Finite State Transducers • Three functions: – Recognizer (verification): takes a pair of strings

Some conventions • Transitions are marked by “: ” • A non-changing transition “x:

An example based on Trost p. 42 #: ε s p y: i +:

Using wild cards and loops #: 0 s p y: i +: e s

Another example (J&M Fig. 3. 9, p. 74) fox cat dog q 4 q

Lexical: surface mapping J&M Fig. 3. 14, p. 78 fox. NPs#: fox^s# cat. NPs#:

[0] f: f [0] o: o [0] x: x [1] ^: ε [2] ε:

FST • But you don’t have to draw all these FSTs • They map

FST compiler http: //www. xrce. xerox. com/competencies/content-analysis/fs. Compiler/fsinput. html [d o g N P.

Slides: 31

Download presentation

Morphology See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed. ) The Oxford Handbook of Computational Linguistics, Oxford (2004): OUP D Jurafsky & JH Martin: Speech and Language Processing, Upper Saddle River NJ (2000): Prentice Hall, Chapter 3 [quite technical]

Morphology - reminder • Internal analysis of word forms • morpheme – allomorphic variation • Words usually consist of a root plus affix(es), though some words can have multiple roots, and some can be single morphemes • lexeme – abstract notion of group of word forms that ‘belong’ together – lexeme ~ root ~ stem ~ base form ~ dictionary (citation) form 2

Role of morphology • Commonly made distinction: inflectional vs derivational • Inflectional morphology is grammatical – number, tense, case, gender • Derivational morphology concerns word building – part-of-speech derivation – words with related meaning 3

Inflectional morphology • Grammatical in nature • Does not carry meaning, other than grammatical meaning • Highly systematic, though there may be irregularities and exceptions – Simplifies lexicon, only exceptions need to be listed – Unknown words may be guessable • Language-specific and sometimes idiosyncratic • (Mostly) helpful in parsing 4

Derivational morphology • Lexical in nature • Can carry meaning • Fairly systematic, and predictable up to a point – Simplifies description of lexicon: regularly derived words need not be listed – Unknown words may be guessable • But … – Apparent derivations have specialised meaning – Some derivations missing • Languages often have parallel derivations which may be translatable 5

Morphological processes • • • Affixes: prefix, suffix, infix, circumfix Vowel change (umlaut, ablaut) Gemination, (partial) reduplication Root and pattern Stress (or tone) change Sandhi 6

Morphophonemics • Morphemes and allomorphs – eg {plur}: +(e)s, vowel change, y ies, f ves, um a, , . . . • Morphophonemic variation – Affixes and stems may have variants which are conditioned by context • eg +ing in lifting, swimming, boxing, raining, hopping – Rules may be generalisable across morphemes • eg +(e)s in cats, boxes, tomatoes, matches, dishes, buses • Applies to both {plur} (nouns) and {3 rd sing pres} (verbs) 7

Morphology in NLP • Analysis vs synthesis – what does dogs mean? vs what is the plural of dog? • Analysis – Need to identify lexeme • Tokenization • To access lexical information – Inflections (etc) carry information that will be needed by other processes (eg agreement useful in parsing, inflections can carry meaning (eg tense, number) – Morphology can be ambiguous • May need other process to disambiguate (eg German –en) • Synthesis – Need to generate appropriate inflections from underlying representation 8

Morphology in NLP • String-handling programs can be written • More general approach – formalism to write rules which express correspondence between surface and underlying form (eg dogs = dog +{plur}) – Computational algorithm (program) which can apply those rules to actual instances – Especially of interest if rules (though not program) is independent of direction: analysis or synthesis 9

Role of lexicon in morphology • Rules interact with the lexicon – Obviously category information • eg rules that apply to nouns – Note also morphology-related subcategories • eg “er” verbs in French, rules for gender agreement – Other lexical information can impact on morphology • eg all fish have two forms of the plural (+s and ) • in Slavic languages case inflections differ for inanimate and animate nouns) 10

Problems with rules • Exceptions have to be covered – Including systematic irregularities – May be a trade-off between treating something as a small group of irregularities or as a list of unrelated exceptions (eg French irregular verbs, English f ves) • Rules must not over/under-generate – Must cover all and only the correct cases – May depend on what order the rules are applied in 11

Tokenization • The simplest form of analysis is to reduce different word forms into tokens • Also called “normalization” • For example, if you want to count how many times a given ‘word’ occurs in a text • Or you want to search for texts containing certain ‘words’ (e. g. Google) 12

Morphological processing • Stemming • String-handling approaches – Regular expressions – Mapping onto finite-state automata • 2 -level morphology – Mapping between surface form and lexical representation 13

Stemming • Stemming is the particular case of tokenization which reduces inflected forms to a single base form or stem • (Recall our discussion of stem ~ base form ~ dictionary form ~ citation form) • Stemming algorithms are basic stringhandling algorithms, which depend on rules which identify affixes that can be stripped 14

Finite state automata • A finite state automaton is a simple and intuitive formalism with straightforward computational properties (so easy to implement) • A bit like a flow chart, but can be used for both recognition (analysis) and generation • FSAs have a close relationship with “regular expressions”, a formalism for expressing strings, mainly used for searching texts, or stipulating patterns of strings 15

Finite state automata • A bit like a flow chart, but can be used for both recognition and generation • “Transition network” • Unique start point • Series of states linked by transitions • Transitions represent input to be accounted for, or output to be generated • Legal exit-point(s) explicitly identified 16

Example Jurafsky & Martin, Figure 2. 10 a a b q 0 q 1 a q 2 ! q 3 q 4 • Loop on q 3 means that it can account for infinite length strings • “Deterministic” because in any state, its behaviour is fully predictable 17

Non-deterministic FSA Jurafsky & Martin, Figure 2. 18 2. 19 a a b q 0 q 1 a q 2 ! q 3 q 4 ε • At state q 2 with input “a” there is a choice of transitions • We can also have “jump” arcs (or empty transitions), which also introduce nondeterminism 18

An FSA to handle morphology c o f q 0 q 1 x q 2 r e q 3 q 4 s i q 6 q 5 y Spot the deliberate mistake: overgeneration q 7 19

Finite State Transducers • A “transducer” defines a relationship (a mapping) between two things • Typically used for “two-level morphology”, but can be used for other things • Like an FSA, but each state transition stipulates a pair of symbols, and thus a mapping 20

Finite State Transducers • Three functions: – Recognizer (verification): takes a pair of strings and verifies if the FST is able to map them onto each other – Generator (synthesis): can generate a legal pair of strings – Translator (transduction): given one string, can generate the corresponding string • Mapping usually between levels of representation – spy+s : spies – Lexical: intermediate fox. NPs : fox^s – Intermediate: surface fox^s : foxes 21

Some conventions • Transitions are marked by “: ” • A non-changing transition “x: x” can be shown simply as “x” • Wild-cards are shown as “@” • Empty string shown as “ε” 22

An example based on Trost p. 42 #: ε s p y: i +: e s #spy+s# : spies #: ε #toy+s# : toys #: ε t o s #: ε h w y e i +: 0 l f: v s f: v e #: ε +: e s s #: ε 23

Using wild cards and loops #: 0 s p y: i +: e s #: 0 t o y +: 0 s #: 0 Can be collapsed into a single FST: @ #: 0 y: i y +: e s #: 0 +: 0 24

Another example (J&M Fig. 3. 9, p. 74) fox cat dog q 4 q 1 q 0 goose sheep mouse g o: e s e sheep m o: i u: εs: c e lexical: intermediate P: ^ s # N: ε S: # N: ε q 2 q 5 N: ε q 3 q 6 S: # q 7 P: # 25

fox cat dog q 1 q 0 f q 0 c d s 1 s 3 s 5 o a o s 2 s 4 s 6 x t g q 1 26

[0] f: f o: o x: x [1] N: ε [4] P: ^ s: s #: # [7] [0] f: f o: o x: x [1] N: ε [4] S: # [7] [0] c: c a: a t: t [1] N: ε [4] P: ^ s: s #: # [7] [0] s: s h: h e: e p: p [2] N: ε [5] S: # [7] [0] g: g o: e s: s e: e [3] N: ε [5] P: # [7] fox cat dog fox. NPs#: fox^s# fox. NS: fox# cat. NPs#: cat^s# sheep. NS: sheep# goose. NP: geese# q 4 q 1 q 0 goose sheep mouse g o: e s e sheep m o: i u: εs: c e P: ^ s # N: ε S: # N: ε q 2 q 5 N: ε q 3 q 6 S: # q 7 P: # 27

Lexical: surface mapping J&M Fig. 3. 14, p. 78 fox. NPs#: fox^s# cat. NPs#: cat^s# ^: ε # other ε e / {x s z} ^ __ s # other q 5 z, s, x s z, s, x q 0 ^: ε ε: e q 1 #, other q 2 z, x s q 3 q 4 # 28

[0] f: f [0] o: o [0] x: x [1] ^: ε [2] ε: e [3] s: s [4] #: # [0] c: c [0] a: a [0] t: t [0] ^: ε [0] s: s [0] #: # [0] fox^s#foxes# cat^s#: cat^s# ^: ε # other q 5 z, s, x s z, s, x q 0 ^: ε ε: e q 1 #, other q 2 z, x s q 3 q 4 # 29

FST • But you don’t have to draw all these FSTs • They map neatly onto rule formalisms • What is more, these can be generated automatically • Therefore, slightly different formalism 30

FST compiler http: //www. xrce. xerox. com/competencies/content-analysis/fs. Compiler/fsinput. html [d o g N P. x. d o g s ] | [c a t N P. x. c a t s ] | [f o x N P. x. f o x e s ] | [g o o s e N P. x. g e e s e] s 0: c -> s 1, d -> s 2, f -> s 3, g -> s 4. s 1: a -> s 5. s 2: o -> s 6. s 3: o -> s 7. s 4: <o: e> -> s 8. s 5: t -> s 9. s 6: g -> s 9. s 7: x -> s 10. s 0 s 8: <o: e> -> s 11. s 9: <N: s> -> s 12. s 10: <N: e> -> s 13. s 11: s -> s 14. s 12: <P: 0> -> fs 15. s 13: <P: s> -> fs 15. s 14: e -> s 16. fs 15: (no arcs) s 16: <N: 0> -> s 12. c d s 1 s 2 f s 3 g s 4 31