Morphological Recognition We take each sublexicon of each
Morphological Recognition • We take each sub-lexicon of each stem class and we expand each arc (e. g. the reg-noun arc) with all the morphemes that make up the set of stems in the reg-noun word class. • This way a FSA is created that can be used for morphological recognition.
Two-level Morphology • Ideally, for morphological parsing we would like to input a word and get as output its stem with morphological information. e. g. cats -> cat + N + PL • Two-level morphology represents a word as the correspondence between the lexical and the surface level.
Finite State Transducer (FST) • A FST is an automaton that we use for performing the mapping between the two-levels. • A FST is an automaton with two-tapes that recognizes or generates pairs of strings, therefore it defines a relation between strings. • Another view of a FST is as a machine that reads one string and generates another string.
Formal FST definition • Extention to FSA definition – Q: a finite set of states. (q 0, q 1, q 2, …) – Σ: a finite alphabet of complex symbols i: o pairs where i is a symbol from the input alphabet and o a symbol from the output alphabet (ε might be part of both the input and output alphabets) – q 0: the start state (first state) – F: the states with of final states (subset of Q) – δ(q, i: o): the transition function from states and complex input symbols to states. Given a state q and an input i, it returns a new state q’. • e. g Σ= {a: a, b: b, !: !, a: b, b: a, a: ε, ε: !}
Useful FST Properties • Inversion: The inversion of a transducer simply switches the input and output labels of the transducer (the two tapes). Therefore it is very easy to transform a FST from a parser into a generator. • Composition: Given two FSTs T 1 that maps from I to C and T 2 that maps from C to O, their composition is a new transducer T 1 o T 2 that maps from I to O. Therefore is we have a number of FST that run serialy, it is possible to build a new FST that maps from the initial input to the final output.
Finite State Transducers • It is convenient to view a FST as having two tapes. – The upper or lexical tape – The lower of surface tape • Each symbol a: b in the FST alphabet expresses how a symbol from one tape is mapped to a symbol on the other tape. • Symbols such as a: a are called default pairs and are represented simply as a.
FST Morphotactics FST for English plural formation. ^ marks a morpheme boundary and # a word boundary.
FST Lexicon
Combining FST Lexicon and Morphtactics • The two FST for lexicon and morphotactics can be cascaded, i. e. the input is run through the lexicon FST and then the output is run through the morphotactics FST. • Based on the composition propery it is possible to compose these two FSTs into a single FST that maps directly from the lexical to the surface level (without any reference to word classes).
Orthographic Rules • The previous FST will accept the word foxs and reject the word foxes. • We need a way to deal with the spelling changes that often take place at morpheme boundaries. This is done by introducing orthographic rules. E. g. for English – e is inserted after -s, -z, -x, -ch, -sh before -s. – -y becomes -ie before -s. • Formal rule notation: a -> b/c__d means “rewrite a as b when it occurs between c and d. – ε ->e/{x, s, z}^__s#.
Orthographic Rules and FST • The spelling rule can be seen as taking a simple concatenation of morphemes (intermediate level) and producing the surface form of the word.
Orthographic Rules and FST • The previous orthographic rule can be represented as a FST.
Orthographic Rules and FST • Transition table for the previous FST. State/ Input s: s x: x z: z ^: ε ε: e # other q 0: 1 1 1 0 - 0 0 q 1: 1 1 1 2 - 0 0 q 2: 5 1 1 0 3 0 0 q 3 4 - - - q 4 - - - 0 - q 5 1 1 1 2 - - 0
Combining FST Lexicon and Rules • First the lexicon FST maps between the lexical level and the intermediate level which is just a concatenation of morphemes. • Then, a number of spelling rule FSTs run in parallel (or as a cascade) mapping from the intermediate level to the surface level. • The lexicon FST and the orthographic rules FST form a cascade. This can be run top-down (generation) or bottomup (parsing).
FST Parsing • Parsing is more complicated than generation because of ambiguity. E. g. foxes may be parsed as both fox+V+3 SG and as fox+N+PL. Disambiguiation cannot be performed at the lexical level. Both parses should be given by the FST. • Also ambiguities occur during parsing due to ε arcs or multiple possible paths. In fact, this is similar to the case for NFSA and similar search techniques must be employed.
- Slides: 19