Finite State Morphology Alexander Fraser Luisa Berlanda frasercis
Finite State Morphology Alexander Fraser & Luisa Berlanda fraser@cis. uni-muenchen. de CIS, Ludwig-Maximilians-Universität München Computational Morphology and Electronic Dictionaries So. Se 2016 -25 -05
SFST • programming language for developing finite-state transducers • compiler which translates programs to transducers • tools for – applying transducers – printing transducers – comparing transducers
SFST Example Session > echo "Hello World!" > test. fst storing a small test program > fst-compiler test. fst test. a calling the compiler test. fst: 2 > fst-mor test. a reading transducer. . . finished. analyze> Hello World! analyze> Hello World no result for Hello World analyze> q interactive transducer usage transducer is loaded input recognised another input not recognised terminate program
Transducer Variables $Vroot$ = walk | talk | bark $Vinfl$ = <V>: <> ( [<inf><n 3 s>]: <> | {<3 s>}: {s} | {<ger>}: {ing} | {<past>}: {ed}) % list of verbs with regular inflection % regular verbal inflection $Nroot$ = hat | head | trick $Ninfl$ = <N>: <> ( {<sg>}: {} | {<pl>}: {s}) % list of nouns with regular inflection % regular nominal inflection $Vroot$ $Vinfl$ | $Nroot$ $Ninfl$ % combine stems and inflectional endings
Homework • Write a pipeline that maps all letters to lowercase and orders them backwards • Family Huber has three children. Their first child is called Mia, the next one Toni and the last one Pia. • Family Band has three children as well. Michael, Paul and Pia. • Write a program, that can tell us the following details about the children: which family does he/she belong to, is it a son or a daugter, was he/she the first, second or third child. • Output format: • <family><family name><first, second child><gender>
Solution • Write a pipeline that maps all letters to lowercase and orders them backwards • [a-z]: [A-Z]* || [ZYXWVUTSRQPONMLKJIHFECBA]: [A-Z]*
Solution {<family><huber>}: {} ({<1><daughter>}: {mia} |{<1><son>}: {toni} |{<2><daughter>}: {pia}) | {<family><band>}: {} ({<1><son>}: {michael} |{<2><son>}: {paul} |{<1><daughter>}: {pia})
Lexicon Files $Vroot$ = “verb. lex“ $Nroot$ = “noun. lex“ The command “filename“ reads the respective file line by line and forms the disjunction of all lines. Only the symbols : < > % are treated as operators.
Lexicon Files $Vroot$ = “verb. lex“ $Vinfl$ = <V>: <> ( [<inf><n 3 s>]: <> | {<3 s>}: {s} | {<ger>}: {ing} | {<past>}: {ed}) $Nroot$ = “noun. lex“ $Ninfl$ = <N>: <> ( {<sg>}: {} | {<pl>}: {s}) $Vroot$ $Vinfl$ | $Nroot$ $Ninfl$
Symbol Set Variables #cons# = bcdfghjklmnpqrstvwxzß #CONS# = BCDFGHJKLMNPQRSTVWXZß #Cons# = #cons# #CONS# #vowel# = aeiouäöü #VOWEL# = AEIOUÄÖÜ #Vowel# = #vowel# #VOWEL# #letter# = #vowel# #cons# #LETTER# = #VOWEL# #CONS# [#LETTER# #letter#]: [#letter# #LETTER#]* What would you get for Hallo and Ruß?
Solution • What would you get for Hallo and Ruß? • h. ALLO • r. Uß
Alphabet The alphabet defines the set of available symbol pairs which is relevant for the wildcard symbol „. “, the negation operator „!“ and the replacement operators (introduced later). ALPHABET = [A-Z]: [a-z] The expression on the right-hand side is compiled into a transducer and the set of character pairs is extracted from its transitions. A: . is here identical to A: [Aa]. is identical to. : . [^A-Z] all characters appearing in the alphabet which are not uppercase letters, i. e. the set of lowercase letters. . * maps mixed letter sequences to all uppercase letter sequences (analysis)
Alphabet fullform. lex: house<N><sg> house<>: s<N><pl> walk<V><inf> walk<>: i<>: n<>: g<V><ger> emorph. fst: ALPHABET = [a-z. A-Z] [<V><N><sg><pl><ger><inf>]: <> “fullform. lex“ ||. * reads the lexicon and deletes the grammatical markers on the surface side.
Orthographic Rules Replace operator: t ^─> (l _ r) applies the mapping implemented in the transducer t in the left context l and right context r. l and r are automata (i. e. transducers mapping strings to themselves. ) e-elision: bake<V>ing → baking $Morph$ = bake<V>{<ger>}: {ing} ALPHABET = [A-Za-z] <V> $e-elision$ = e: <> ^─> (_ _<V> [ei] ) $Morph$ = $Morph$ || $e-elision$ ALPHABET = [A-Za-z] <V>: <> $Morph$ ||. * % delete e before <V>e or <V>i % apply the rule % delete the <V> marker
Orthographic Rules What does this program? $Morph$ = bake<V>{<ger>}: {ing} | crash<V>{<3><sg>}: s | happy<ADJ>{<comp>}: {er} | fly<V>{<3><sg>}: s ALPHABET = [A-Za-z] <V><N><ADJ> $e-elision$ = e: <> ^-> (__<V> [ei]) $e-epenthesis$ = (<V> <>: e) ^-> ([sh]__ s) $y 2 i$ = y: i ^-> ([^ae] __[<ADJ><V>] e) $y 2 ie$ = y: {ie} ^-> ([^ae] __ [<V><N>] s) $Morph$ = $Morph$ || $e-elision$ || $e-epenthesis$ || $y 2 ie$ ALPHABET = [A-Za-z] [<V><N><ADJ>]: <> $Morph$ ||. *
Solution • e-epenthesis: crash<V>s → crashes • y to i: happy<ADJ>er→ happier fly<V>s → flies • only in the analyze mode!
Agreement Variables $Morph$ = big<ADJ>{<comp>}: {er} | fat<ADJ>{<comp>}: {er} $cons$ = [bcdfghjklmnpqrstvwxz] $vowel$ = [aeiouy] #=g# = bdglmnpt $g$ = [#=g#] <>: [#=g#] ALPHABET = [A-Za-z] <V><N><ADJ> $gemination$ = $g$ ^-> ($cons$ $vowel$ __<ADJ> e) $Morph$ = $Morph$ || $gemination$ ALPHABET = [A-Za-z] [<V><N><ADJ>]: <> $Morph$ ||. *
Solution • analyze> bigger big<ADJ><comp> • analyze> fater no result for fater • analyze> fatter fat<ADJ><comp>
• Thank you for your attention
- Slides: 19