Morphological Parsing CS 4705 1 Parsing Taking a

  • Slides: 19
Download presentation
Morphological Parsing CS 4705 1

Morphological Parsing CS 4705 1

Parsing • Taking a surface input and analyzing its components and underlying structure •

Parsing • Taking a surface input and analyzing its components and underlying structure • Morphological parsing: taking a word or string of words as input and identifying the stems and affixes (and possibly interpreting these) – E. g. : • goose +N +SG or goose + V • geese goose +N +PL • gooses goose +V +3 SG – Bracketing: indecipherable [in [ [de [cipher] ] able] ] 2

Why ‘parse’ words? • To find stems – Simple key to word similarity –

Why ‘parse’ words? • To find stems – Simple key to word similarity – Yellow, yellowish, yellows, yellowed, yellowing… • To find affixes and the information they convey – ‘ed’ signals a verb – ‘ish’ an adjective – ‘s’? • Morphological parsing provides information about a word’s semantics and the syntactic role it plays in a sentence 3

Some Practical Applications • For spell-checking – Is muncheble a legal word? • To

Some Practical Applications • For spell-checking – Is muncheble a legal word? • To identify a word’s part-of-speech (pos) – For sentence parsing, for machine translation, … • To identify a word’s stem – For information retrieval • Why not just list all word forms in a lexicon? 4

What do we need to build a morphological parser? • Lexicon: list of stems

What do we need to build a morphological parser? • Lexicon: list of stems and affixes (w/ corresponding p. o. s. ) • Morphotactics of the language: model of how and which morphemes can be affixed to a stem • Orthographic rules: spelling modifications that may occur when affixation occurs – in il in context of l (in- + legal) • Most morphological phenomena can be described with regular expressions – so finite state techniques often used to represent morphological processes 5

Using FSAs to Represent English Plural Nouns • English nominal inflection plural (-s) reg-n

Using FSAs to Represent English Plural Nouns • English nominal inflection plural (-s) reg-n q 0 q 1 q 2 irreg-pl-n irreg-sg-n • Inputs: cats, geese, goose 6

 • Derivational morphology: adjective fragment adj-root 1 unq 0 q 1 q 2

• Derivational morphology: adjective fragment adj-root 1 unq 0 q 1 q 2 -er, -ly, -est q 5 adj-root 1 q 3 q 4 -er, -est adj-root 2 • Adj-root 1: clear, happi, real (clearly) • Adj-root 2: big, red (*bigly) 7

FSAs can also represent the Lexicon • Expand each non-terminal arc in the previous

FSAs can also represent the Lexicon • Expand each non-terminal arc in the previous FSA into a sub-lexicon FSA (e. g. adj_root 2 = {big, red}) and then expand each of these stems into its letters (e. g. red r e d) to get a recognizer for adjectives e r ε q 0 q 1 q 2 q 3 b d q 4 i q 5 g q 6 8

But…. . • Covering the whole lexicon this way will require very large FSAs

But…. . • Covering the whole lexicon this way will require very large FSAs with consequent search and maintenance problems – Adding new items to the lexicon means recomputing the whole FSA – Non-determinism – Some stems require modification when they acquire affixes • FSAs tell us whether a word is in the language or not – but usually we want to know more: – What is the stem? – What are the affixes and what sort are they? – We used this information to recognize the word: why can’t we store it? 9

Parsing with Finite State Transducers • cats cat +N +PL (a plural NP) •

Parsing with Finite State Transducers • cats cat +N +PL (a plural NP) • Kimmo Koskenniemi’s two-level morphology – Idea: word is a relationship between lexical level (its morphemes) and surface level (its orthography) – Morphological parsing : find the mapping (transduction) between lexical and surface levels lexical c a t surface c a t +N +PL s 10

Finite State Transducers can represent this mapping • FSTs map between one set of

Finite State Transducers can represent this mapping • FSTs map between one set of symbols and another using a FSA whose alphabet is composed of pairs of symbols from input and output alphabets • In general, FSTs can be used for – Translators (Hello: Ciao) – Parser/generators (Hello: How may I help you? ) – As well as Kimmo-style morphological parsing 11

 • FST is a 5 -tuple consisting of – Q: set of states

• FST is a 5 -tuple consisting of – Q: set of states {q 0, q 1, q 2, q 3, q 4} – : an alphabet of complex symbols, each an i/o pair s. t. i I (an input alphabet) and o O (an output alphabet) and is in I x O – q 0: a start state – F: a set of final states in Q {q 4} – (q, i: o): a transition function mapping Q x to Q – Quizzical Cow Emphatic Sheep o: a m: b o: a ? : ! q 0 q 1 q 2 q 3 q 4 12

FST for a 2 -level Lexicon • E. g. q 0 g c: c

FST for a 2 -level Lexicon • E. g. q 0 g c: c q 1 q 4 a: a q 5 o: e q 2 q 6 o: e t: t q 7 q 3 e s Reg-n Irreg-pl-n Irreg-sg-n cat g o: e s e goose 13

FST for English Nominal Inflection +N: reg-n q 1 q 0 irreg-n-sg q 2

FST for English Nominal Inflection +N: reg-n q 1 q 0 irreg-n-sg q 2 irreg-n-pl q 3 q 4 +PL: ^s# +SG: # +N: q 5 q 6 +N: c a t +SG: # q 7 +PL: # +N +PL s 14

Useful Operations on Transducers • Cascade: running 2+ FSTs in sequence • Intersection: represent

Useful Operations on Transducers • Cascade: running 2+ FSTs in sequence • Intersection: represent the common transitions in FST 1 and FST 2 (ASR: finding pronunciations) • Composition: apply FST 2 transition function to result of FST 1 transition function • Inversion: exchanging the input and output alphabets (recognize and generate with same FST) • cf AT&T FSM Toolkit and papers by Mohri, Pereira, and Riley 15

Orthographic Rules and FSTs • Define additional FSTs to implement rules such as consonant

Orthographic Rules and FSTs • Define additional FSTs to implement rules such as consonant doubling (beg begging), ‘e’ deletion (make making), ‘e’ insertion (watch watches), etc. Lexical f o x +N +PL Intermediate f o x ^ s Surface f o x e s # 16

Porter Stemmer (1980) • Used for tasks in which you only care about the

Porter Stemmer (1980) • Used for tasks in which you only care about the stem – IR, modeling given/new distinction, topic detection, document similarity • Lexicon-free morphological analysis • Cascades rewrite rules (e. g. misunderstanding --> misunderstand --> …) • Easily implemented as an FST with rules e. g. – ATIONAL ATE – ING ε • Not perfect …. – Doing doe 17

 • Policy police • Does stemming help? – IR, little – Topic detection,

• Policy police • Does stemming help? – IR, little – Topic detection, more 18

Summing Up • FSTs provide a useful tool for implementing a standard model of

Summing Up • FSTs provide a useful tool for implementing a standard model of morphological analysis, Kimmo’s two-level morphology • But for many tasks (e. g. IR) much simpler approaches are still widely used, e. g. the rulebased Porter Stemmer • Next time: – Read Ch 5: 1 -8 • HW 1 assigned (read the assignment) 19