Xerox FiniteState Programming Course or How to Write


























![A Portuguese Morphological Analyzer cantar[Verb][Pres][Indic][1 P][Sg] Finite-State Transducer canto A Portuguese Morphological Analyzer cantar[Verb][Pres][Indic][1 P][Sg] Finite-State Transducer canto](https://slidetodoc.com/presentation_image_h2/89d66e640f21287789bc589b9b3ee041/image-27.jpg)

















- Slides: 44
Xerox Finite-State Programming Course or How to Write a Morphological Analyzer using lexc and xfst 20 -23 September 2004 Xerox Research Centre Europe Grenoble Laboratory 6, chemin de Maupertuis 38240 MEYLAN, France Kenneth BEESLEY ken. beesley@xrce. xerox. com
Ken Beesley: Brief Introduction q B. A. , Linguistics and Computer Science, Brigham Young University, 1978 q Diploma, Linguistics and Phonetics, Univ. of Glasgow, Scotland, 1979 D. Phil. , Epistemics (later called “Cognitive Science”, then “Informatics”), Univ. of Edinburgh, Scotland, 1983 q q ALPNET, commercial computer-assisted translation, 1984 -1990 1988 -1990 Arabic morphology project, exposure to Finite-State Morphology from Dr. Lauri Karttunen at COLING 1988 q q Microlytics (Xerox spinoff), 1990 -1993 q Xerox Corporation 1993 -present
Ken Beesley: Brief Introduction II Morphology projects: Arabic, Spanish, Portuguese, Italian, Dutch, (Malay), (Aymara) q q Taught finite-state programming in France, Ireland, Turkey, Malta, South Africa “Students” and associates working on Turkish, Greek, Irish, Hebrew, Sámi, Nahuatl, Korean, Zulu, Xhosa, Ndebele, Tswana, Northern Sotho, … q Kenneth R. Beesley and Lauri Karttunen, Finite State Morphology, CSLI Publications, 2003. Affectionately known as “The Book”. q q I enjoy writing morphological analyzers, and helping others to write them. Morphological analyzers are often a perfect first step in computational linguistics
Some Preliminaries Don’t Panic! This is not a contest. Some will do better than others. Making mistakes seems to be part of the learning process. This is going to be a long week for us all … but hopefully not unpleasant.
Lecture Schedule Monday Tuesday Wednesday 9: 30 10: 00 Gentle Introduction 13: 30 Introduction to xfst 9 h 30 Lexical Transducers in xfst 13: 30 Introduction to lexc 9: 30 13: 30 Thursday 9: 30 13: 30 (Friday) Welcome A More Formal Introduction Composition Testing with the Finite-State software Advanced Filtering, Flag Diacritics
The Big Picture of Low-Level NLP A running text in your favorite language Tokenizer A tokenized text (divided into “words”) Morphological Analyzer Tokens with their analyses (often ambiguous) • Future steps • Disambiguator (“tagger”) • Shallow parser (“chunker”) • Syntactic parser • Semantic analysis, information extraction • Ultimate applications • Spelling checking, indexing, aid to corpus analysis, lexicography • Dictionary lookup aids, language teaching, spelling correction • Text-to-speech systems • Question answering, machine translation, etc.
Some computer basics … q Installing the Xerox finite-state software on your machine n Contact Dr. Lauri Karttunen, karttunen@parc. com, for the latest versions q Bring up a terminal/command window q For Unix/Linux u cd (takes you to your home directory) u pwd (print the current “working” directory) u ls (list files and directories in the current directory) nameofnewdir (create a new subdirectory) u cd nameofnewdir (change/move to indicated subdirectory) u cd dir 1/dir 2/dir 3 (change/move down several levels) u mkdir u cd . . u rm nameoffile u rmdir (change/move UP to the mother directory) (remove a file) (remove a directory)
Using a text editor q Bring up a text editor (e. g. vi, pico, emacs, notepad) q Beware of word processors (like Microsoft Word) that insert format codes n q q q If you use one, you need to save your files as plain text (Latin-1 for this course, or UTF-8) Try to create a small test file (using your text editor) n Type in something like “this is a test” n Save the file as testfile. txt Display the file to the screen; for Unix-like systems n cat testfile. txt n more testfile. txt Delete the file n rm testfile. txt
Technical Course Goals q Understand “Regular” (also called “Rational”) Languages and Relations. q Understand the mathematical operations that can be performed on Regular Languages and Relations. q Understand how Languages, Relations, Regular Expressions, and Networks are interrelated. q Learn n how to compute with them using Xerox Finite-State Technology xfst interface u. Regular-Expression u. Access n Compiler to Finite-State Algorithms lexc language u. Used mainly for lexicons and describing morphotactics in natural language
Practical Linguistic Goals q Develop the “finite-state mindset”; know what’s possible, what is hard and what is easy q Learn how to apply finite-state modeling and computing techniques to natural languages like French, English, Finnish, Zulu, Amharic, Sámi, etc. q In particular, learn how to write morphological analyzer/generators q Flush q So out questions and mistakes during the Course you can return home and start building significant natural-language processing systems
Developing the Finite-State Mind-Set Some of us will have to overcome instincts from previous training in algorithmic/procedural programming. All of us need to learn the Xerox notations, appreciate the possibilities and learn a few idioms and tricks.
Why is Finite State Computing So Interesting? q Finite-state systems are mathematically elegant, flexible and modifiable. The mathematical elegance translates directly to computational efficiency. Finite-state system are typically very fast. q Finite-state systems are typically very compact for typical natural-language tasks. q Finite-state systems are inherently bidirectional: we can use the same system to analyze and to generate. q
Why is Finite State Computing So Interesting? The programming we linguists do is declarative. We write grammars and rules. We describe the facts of our natural language; we do not write code. q There is runtime code, which applies our systems to linguistic input; but it is already written and it is completely language-independent. q Xerox has competitors who are also into finite-state computing, for exactly the same reasons. AT&T, Teragram, Grail, Groningen Univ. , … q
What is Finite-State Computing Good For? q Mostly “lower-level” natural language processing n Tokenization n Spelling checking/correction n Phonology n Morphological Analysis/Generation n Part-of-Speech Tagging n “Shallow” Syntactic Parsing and “Chunking” Finite-state techniques cannot do everything; but for tasks where they do apply, they are extremely attractive.
Where is Xerox Finite-State Technology Used? q Xerox Research q PARC Inc. q Xerox Business Units, Partners and Spinoffs n Inxight n Temis n Documentum q Universities q An and Research Groups, over 70 licensees unknown number of non-commercial “book” users
The Gentle Introduction q Chapter 1 n Physical Finite-State Machines (Automata) n Linguistic Finite-State Machines u. Symbol u. Alphabet u. Language n Lookup and Generation n Quick Review of Set Theory n Languages, Relations and Transducers
Physical Machines with Finite States The Light-Switch Machine PUSH UP OFF ON PUSH DOWN
Physical Machines with Finite States The Light-Switch Toggle Machine PUSH OFF ON PUSH
Physical Machines with Finite States The Fan in Ken’s Old Car R R LOW OFF L R MED L HI L
Physical Machines with Finite States Three-Way Light-Switch R OFF R LOW R MED R HI
Description of the Cola Machine q Need to enter 25 cents (USA) to get a drink q Accepts the following coins: n Nickel = 5 cents n Dime = 10 cents n Quarter = 25 cents q For simplicity, our machine needs exact change q We will model only the coin-accepting mechanism that decides when enough money has been entered q This machine will have a Start state, intermediate states, and a Final or “Accepting” state
Physical Machines with Finite States The Cola Machine Start State Final/Accept State N 0 N 5 D N N 10 15 D D Q N 20 25 D
The Cola Machine Language q List of all the sequences of coins accepted by the Cola Machine n Q n DDN n DND n NDD n DNNN n NDNN n NNDN n NNND n NNNNN And Here is Where We Make the Step from Physical Machines to Linguistic Machines: • Think of the coins as SYMBOLS or CHARACTERS • The set of symbols accepted is the ALPHABET of the machine • Think of accepted sequences of coins as WORDS or “strings” • The set of words accepted by the machine is its LANGUAGE
Linguistic Machines a n t i g r o c t m “Apply” e s m e s a e a Either accepts or rejects
More Linguistic Machines c l e a r v e e “Apply Down” mesa+Noun+Fem+Pl m e s a +Noun m e s a 0 “Apply Up” A Transducer m e s a s +Fem 0 +Pl s
A Morphological Analyzer Analysis Word Language Transducer Surface Word Language
A Portuguese Morphological Analyzer cantar[Verb][Pres][Indic][1 P][Sg] Finite-State Transducer canto
A Quick Review of Set Theory A set is a collection of objects. B A D E We can enumerate the “members” or “elements” of finite sets: { A, D, B, E }. There is no significant order in a set, so { A, D, B, E } is the same set as { E, A, D, B }, etc.
Uniqueness of Elements You cannot have two or more ‘A’ elements in the same set B A D E { A, A, D, B, E } is just a redundant specification of the set { A, D, B, E }.
Cardinality of Sets q The q. A Empty Set: Finite Set: q An Norway Denmark Sweden Infinite Set: e. g. The Set of all Positive Integers
Operations on Sets: Union A B D E C Set 1 Set 2 B C A D E Union of Set 1 and Set 2
Operations on Sets (2): Union A B C D C Set 1 Set 2 B C A D Union of Set 1 and Set 2
Operations on Sets (3): Intersection A B C D C Set 1 Set 2 C Intersection of Set 1 and Set 2
Operations on Sets (4): Subtraction A B C D C Set 1 Set 2 A B Set 1 minus Set 2
Formal Languages Very Important Concept in Formal Language Theory: A Language is just a Set of Words. • We use the terms “word” and “string” interchangeably. • A Language can be empty, have finite cardinality, or be infinite in size. • You can union, intersect and subtract languages, just like any other sets.
Union of Languages (Sets) dog cat rat elephant mouse Language 1 Language 2 dog cat rat elephant mouse Union of Language 1 and Language 2
Intersection of Languages (Sets) dog cat rat elephant mouse Language 1 Language 2 Intersection of Language 1 and Language 2
Intersection of Languages (Sets) dog cat rat mouse Language 1 Language 2 rat Intersection of Language 1 and Language 2
Subtraction of Languages (Sets) dog cat rat mouse Language 1 Language 2 dog cat Language 1 minus Language 2
Formal Languages q. A language is a set of words (=strings). q Words (strings) are composed of symbols (letters) that are “concatenated” together. q At another level, words are composed of “morphemes”. q In most natural languages, we concatenate morphemes together to form whole words. For sets consisting of words (i. e. for Languages), the operation of concatenation is very important.
Concatenation of Languages work talk walk 0 ing Root Language Suffix Language ed s 0 or ε denotes the empty string working worked works talking talked talks walking walked walks The concatenation of the Suffix language after the Root language.
Concatenation of Languages II re out 0 Prefix Language reworking reworked reworks retalking retalked retalks rewalking rewalked rewalks work talk walk 0 ing Root Language Suffix Language outworking outworked outworks outtalking outtalked outtalks outwalking outwalked outwalks ed s working worked works talking talked talks walking walked walks The concatenation of the Prefix language, Root language, and the Suffix language.
u Languages and Networks o t t 0 s s w r u t 0 s r l a o i k e r n g d Network/Language 3 0 a w e a Network/Language 2 t s s o e Network/Language 1 o a 0 l s k i r e n g d The concatenation of Networks 1, 2 and 3, in that order
Tasks/Exercises q Read q Do chapter 1, at least up to page 28 Exercises 1. 10. 1 and 1. 10. 2, pp. 37 -41 q For more rigor, read the rest of Chapter 1 and Chapter 2. Do the graphing exercise in Appendix A, p. 457.