Writing Lexical Transducers Using xfst Overview of Transduction

Theory-Neutral Morphological Analysis Analyses Black-Box Morphological Analyzer Words Beesley 2000

Finite-State Transducers (FSTs) • An FST encodes a Regular Relation, i. e. a relation

What Do the Two Languages Look Like? • In commercial natural-language processing • The

Non-Commercial (Lesser-Studied) Languages 1. All normal human beings speak a natural language, but there

Two Main Tasks to Morphology • Morphotactics • Describe the structure/grammar of words •

Describing Morphotactics Using Regular Expressions Some very simple morphotactics can be described using just

Esperanto Verb Morphotactics xfst[]: read regex (ne|mal) [ d o n | d i

Esperanto Verb Morphotactics, Version 2 (xfst script) xfst[]: xfst[]: define Prefix n e |

Morphophonological/Orthographical Alternations • If simple concatenation doesn’t produce valid words, then we need to

The Simplest Xerox Replace Rules Schema: upper -> lower || left _ right where

The Simplest Replace Rules II Referring to the beginning or the end of a

Rule Abbreviations Instead of two rules: e -> i || _ (s). #. o

Simple Replace-Rule Semantics upper -> lower || leftcontext _ rightcontext • The overall rule

Understanding Replace Rules xfst> xfst> read regex a -> b ; apply down aaa

Review of Notations for Transducers The cross-product operator: [ u p p e r.

Esperanto Verb Morphotactics, Version 3; A Lexicon with Two Levels xfst[]: define Prefix Neg%+:

Esperanto Verb Transducer 0 Neg+ n Op+ m Pres+ a 0 e 0 a

The Usual Strategy: Define a dictionary and alternation rules Upper: Op+don+Cont+Past Lower: maldonadis Dictionary

The Bambona Language Review the Xerox regular-expression syntax. Review the difference between • regular

Slides: 20

Download presentation

Writing Lexical Transducers Using xfst Overview of Transduction Review of xfst Rules Creating Two-Level Lexicons Putting it All Together Beesley 2000

Theory-Neutral Morphological Analysis Analyses Black-Box Morphological Analyzer Words Beesley 2000

Finite-State Transducers (FSTs) • An FST encodes a Regular Relation, i. e. a relation between two regular languages. • FSTs can be used for morphological analysis, if • The set of surface words (strings) to be analyzed is a regular language, and • The “analyses” are also defined to be a regular language, i. e. just another set of strings Analysis String Language FST Surface String Language Beesley 2000

What Do the Two Languages Look Like? • In commercial natural-language processing • The surface language (e. g. French words written in the standard French orthography) is usually a given. • Periodic official spelling reforms may require fixes to your analyzer. • You may have to worry about national variations. • In contrast, the analysis-language strings must be designed by the linguist. In the most common Xerox convention, each analysis string consists of the traditional dictionary-citation baseform followed by multicharacter-symbol “tags”. cantar+Verb+PInd+1 P+Sg canto+Noun+Masc+Sg alto+Adj+Fem+Pl Beesley 2000

Non-Commercial (Lesser-Studied) Languages 1. All normal human beings speak a natural language, but there is nothing necessary or natural about reading and writing. 2. An orthography is a set of symbols, and conventions for using them, for “making language visible”. 3. Orthographies are technologies, like agriculture or metalworking. 4. Most languages have never been written, i. e. there is no standard orthography; or linguists and governments may have proposed several competing orthographies. 5. When working with lesser-studied languages, you may have to choose (or devise) a surface orthography for use in your morphological analyzer. Beesley 2000

Two Main Tasks to Morphology • Morphotactics • Describe the structure/grammar of words • Classic finite-state operations required – Concatenation of one morpheme to the next – Union of morphemes within classes • Some languages require other finite-state operations – Arabic stems require intersection – Malay requires special algorithms for reduplication • Phonological/Orthographical Alternation • Union and concatenation by themselves tend to build abstract morphophonemic strings • Use finite-state rules to map from underlying (or “lexical”) morphophonemic strings to surface strings Beesley 2000

Describing Morphotactics Using Regular Expressions Some very simple morphotactics can be described using just union, concatenation and perhaps optionality. Simple Esperanto Verbs Opt. Prefix Req. Root ne mal Beesley 2000 don dir pens ir. . . Opt. Aspect ad Req. Verb Ending as is os us u i

Esperanto Verb Morphotactics xfst[]: read regex (ne|mal) [ d o n | d i r | p e n s | i r] (ad) [as|is|os|us|u|i]; • Each morpheme class is a unioned list of morphemes. • Optional classes are surrounded with parentheses. • Then morpheme classes are concatenated together, in the right order. Beesley 2000

Esperanto Verb Morphotactics, Version 2 (xfst script) xfst[]: xfst[]: define Prefix n e | m a l ; define Root d o n | d i r | p e n s | i r ; define Aspect a d ; define VSuff a s | i s | o s | u | i ; read regex (Prefix) Root (Aspect) VSuff ; Beesley 2000

Morphophonological/Orthographical Alternations • If simple concatenation doesn’t produce valid words, then we need to handle alternations. • In today’s exercises, we will use Replace Rules, e. g. if Spanish pluralization is done by concatenating [ %+ s] to a noun, we will need to fix cases like the following: pez+s. o. z %+ -> c e || _ s. #. pez+s FST peces Beesley 2000

The Simplest Xerox Replace Rules Schema: upper -> lower || left _ right where upper, lower, left and right are regular expressions denoting regular languages (not relations!) Remember to use regular-expression syntax. Replace Rules are regular expressions! The overall Replace Rule denotes a relation. E. g. s -> z || [ a | e | i | o | u ] _ [ a | e | i | o | u ] A context can be left empty, which is equivalent to a context of ? * E. g. Beesley 2000 s -> z || _ m p -> m || m _

The Simplest Replace Rules II Referring to the beginning or the end of a word: z -> s || _. #. e -> i || _ (s). #. e -> i ||. #. p _ r A rule may be unconditioned, with no context at all c h -> %$ s s -> s Do not write “ss” or “ch” in regular expressions unless you want them to be treated as single symbols. Remember to “unspecialize” special symbols when you want a literal dollar sign, etc. Beesley 2000

Rule Abbreviations Instead of two rules: e -> i || _ (s). #. o -> u || _ (s). #. You can write: e -> i , o -> u || _ (s). #. a comma separates the “left-hand sides” of the rule Instead of two rules: e -> i || _ (s). #. e -> i ||. #. p _ r You can write: e -> i || _ (s). #. , . #. p _ r a comma separates the “right-hand sides” of the rule Beesley 2000

Simple Replace-Rule Semantics upper -> lower || leftcontext _ rightcontext • The overall rule denotes a finite-state relation (not an algorithm) • The upper-side language of a -> relation is the universal language (? *) • By default, all symbols on the upper side are mapped to the same symbol on the lower side • But IF a string on the upper side contains a designated “upper” string, in the designated context, then it is mapped to a string (or strings) on the lower side where the matched substring is replaced by the designated “lower” string. • The context must “match” on the upper side string • A right-arrow -> rule has a downward orientation. Beesley 2000

Understanding Replace Rules xfst> xfst> read regex a -> b ; apply down aaa apply down dog apply up bbb apply up dog xfst> xfst> read regex a: b ; apply down aaa apply up bbb Beesley 2000

Review of Notations for Transducers The cross-product operator: [ u p p e r. x. l o w e r ] In general, for any two regular expressions A and B denoting languages: A. x. B For convenience, we can also write a: b equivalent to [ a. x. b ] %+Tag: {ing} [ %+Tag. x. i n g ] {upper}: {lower} [ u p p e r. x. l o w e r ] Beesley 2000

Esperanto Verb Morphotactics, Version 3; A Lexicon with Two Levels xfst[]: define Prefix Neg%+: {ne} | Op%+: {mal} ; xfst[]: define Root d o n | d i r | p e n s | i r ; xfst[]: define Aspect %+Cont: {ad} ; xfst[]: define VSuff %+Pres: {as} | %+Past: {is} | %+Fut: {os} | %+Cond: {us} | %+Subj: u | %+Inf: i ; xfst[]: read regex (Prefix) Root (Aspect) VSuff ; Beesley 2000

Esperanto Verb Transducer 0 Neg+ n Op+ m Pres+ a 0 e 0 a 0 l o d i e Beesley 2000 0 Cont+ a r p Apply up: n i s n malpensadus d 0 Past+ Fut+ i o Cond+ u Subj+ u Inf+ i 0 s

The Usual Strategy: Define a dictionary and alternation rules Upper: Op+don+Cont+Past Lower: maldonadis Dictionary Transducer. o. As necessary, apply alternation rules via composition Beesley 2000 Alternation Rules Final FST

The Bambona Language Review the Xerox regular-expression syntax. Review the difference between • regular expression file – contains a single regular expression, ends with a semicolon and newline – xfst[]: read regex < myfile. regex • script file – contains a list of commands to xfst (including perhaps “define” and “read regex” commands) – xfst[]: source myfile. script Read the description carefully (not just the final test data). Describe the morphotactics using union and concatenation. Handle the variations using replace rules. Beesley 2000