LING 138238 SYMBSYS 138 Intro to Computer Speech
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 9/16/2021 LING 138/238 Autumn 2004 1
Today 9/30 Week 1 • • Finite State Automata Deterministic Recognition of FSAs Non-Determinism (NFSAs) Recognition of NFSAs Proof that regular expressions = FSAs Very brief sketch: Morphology, FSAs, FSTs Why finite-state machines are so great. 9/16/2021 LING 138/238 Autumn 2004 2
Three Views • Three equivalent formal ways to look at what we’re up to (thanks to Martin Kay) Regular Expressions Finite State Automata 9/16/2021 Regular Languages LING 138/238 Autumn 2004 3
Finite State Automata • Terminology: Finite State Automata, Finite State Machines, FSA, Finite Automata • Regular expressions are one way of specifying the structure of finite-state automata. • FSAs and their close relatives are at the core of most algorithms for speech and language processing. 9/16/2021 LING 138/238 Autumn 2004 4
Intuition: FSAs as Graphs • Let’s start with the sheep language from the text – /baa+!/ 9/16/2021 LING 138/238 Autumn 2004 5
Sheep FSA • We can say the following things about this machine – – – 9/16/2021 It has 5 states At least b, a, and ! are in its alphabet q 0 is the start state q 4 is an accept state It has 5 transitions LING 138/238 Autumn 2004 6
But note • There are other machines that correspond to this language • More on this one later 9/16/2021 LING 138/238 Autumn 2004 7
More Formally: Defining an FSA • You can specify an FSA by enumerating the following things. – – – 9/16/2021 The set of states: Q A finite alphabet: Σ A start state q 0 A set F of accepting/final states F Q A transition function (q, i) that maps QxΣ to Q LING 138/238 Autumn 2004 8
Yet Another View • State-transition table 9/16/2021 LING 138/238 Autumn 2004 9
Recognition • Recognition is the process of determining if a string should be accepted by a machine • Or… it’s the process of determining if a string is in the language we’re defining with the machine • Or… it’s the process of determining if a regular expression matches a string 9/16/2021 LING 138/238 Autumn 2004 10
Recognition • Traditionally, (Turing’s idea) this process is depicted with a tape. 9/16/2021 LING 138/238 Autumn 2004 11
Recognition • • Start in the start state Examine the current input Consult the table Go to a new state and update the tape pointer. • Until you run out of tape. 9/16/2021 LING 138/238 Autumn 2004 12
D-Recognize 9/16/2021 LING 138/238 Autumn 2004 13
Tracing D-Recognize 9/16/2021 LING 138/238 Autumn 2004 14
Key Points • Deterministic means that at each point in processing there is always one unique thing to do (no choices). • D-recognize is a simple table-driven interpreter • The algorithm is universal for all unambiguous languages. – To change the machine, you change the table. 9/16/2021 LING 138/238 Autumn 2004 15
Key Points • Crudely therefore… matching strings with regular expressions (ala Perl) is a matter of – translating the expression into a machine (table) and – passing the table to an interpreter 9/16/2021 LING 138/238 Autumn 2004 16
Recognition as Search • You can view this algorithm as state-space search. • States are pairings of tape positions and state numbers. • Operators are compiled into the table • Goal state is a pairing with the end of tape position and a final accept state 9/16/2021 LING 138/238 Autumn 2004 17
Generative Formalisms • Formal Languages are sets of strings composed of symbols from a finite set of symbols. • Finite-state automata define formal languages (without having to enumerate all the strings in the language) • The term Generative is based on the view that you can run the machine as a generator to get strings from the language. 9/16/2021 LING 138/238 Autumn 2004 18
Generative Formalisms • FSAs can be viewed from two perspectives: – Acceptors that can tell you if a string is in the language – Generators to produce all and only the strings in the language 9/16/2021 LING 138/238 Autumn 2004 19
Dollars and Cents 9/16/2021 LING 138/238 Autumn 2004 20
Summary • Regular expressions are just a compact textual representation of FSAs • Recognition is the process of determining if a string/input is in the language defined by some machine. – Recognition is straightforward with deterministic machines. 9/16/2021 LING 138/238 Autumn 2004 21
Three Views • Three equivalent formal ways to look at what we’re up to (thanks to Martin Kay) Regular Expressions Finite State Automata 9/16/2021 Regular Languages LING 138/238 Autumn 2004 22
Regular Languages • More on these in a couple of weeks S A A 9/16/2021 → → → b a a A ! LING 138/238 Autumn 2004 23
Non-Determinism 9/16/2021 LING 138/238 Autumn 2004 24
Non-Determinism cont. • Yet another technique – Epsilon transitions – Key point: these transitions do not examine or advance the tape during recognition ε 9/16/2021 LING 138/238 Autumn 2004 25
Equivalence • Non-deterministic machines can be converted to deterministic ones with a fairly simple construction • That means that they have the same power; non-deterministic machines are not more powerful than deterministic ones • It also means that one way to do recognition with a non-deterministic machine is to turn it into a deterministic one. 9/16/2021 LING 138/238 Autumn 2004 26
Non-Deterministic Recognition • In a ND FSA there exists at least one path through the machine for a string that is in the language defined by the machine. • But not all paths directed through the machine for an accept string lead to an accept state. • No paths through the machine lead to an accept state for a string not in the language. 9/16/2021 LING 138/238 Autumn 2004 27
Non-Deterministic Recognition • So success in a non-deterministic recognition occurs when a path is found through the machine that ends in an accept. • Failure occurs when none of the possible paths lead to an accept state. 9/16/2021 LING 138/238 Autumn 2004 28
Example b q 0 9/16/2021 a q 1 a a q 2 LING 138/238 Autumn 2004 ! q 3 q 4 29
Example 9/16/2021 LING 138/238 Autumn 2004 30
Example 9/16/2021 LING 138/238 Autumn 2004 31
Example 9/16/2021 LING 138/238 Autumn 2004 32
Example 9/16/2021 LING 138/238 Autumn 2004 33
Example 9/16/2021 LING 138/238 Autumn 2004 34
Example 9/16/2021 LING 138/238 Autumn 2004 35
Example 9/16/2021 LING 138/238 Autumn 2004 36
Example 9/16/2021 LING 138/238 Autumn 2004 37
Key Points • States in the search space are pairings of tape positions and states in the machine. • By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input. 9/16/2021 LING 138/238 Autumn 2004 38
ND-Recognize Code 9/16/2021 LING 138/238 Autumn 2004 39
Infinite Search • If you’re not careful such searches can go into an infinite loop. • How? 9/16/2021 LING 138/238 Autumn 2004 40
Why Bother? • Non-determinism doesn’t get us more formal power and it causes headaches so why bother? – More natural solutions – Machines based on construction are too big 9/16/2021 LING 138/238 Autumn 2004 41
Regular languages • The class of languages characterizable by regular expressions • Given alphabet , the reg. lgs. over is: – The empty set is a regular language – a , {a} is a regular language – If L 1 and L 2 are regular lgs, then so are: • L 1 · L 2 = {xy|x L 1, y L 2}, concatenation of L 1 & L 2 • L 1 L 2, the union of L 1 and L 2 • L 1*, the Kleene closure of L 1 9/16/2021 LING 138/238 Autumn 2004 42
Going from regexp to FSA • Since all regular lgs meet above properties • And reg lgs are the lgs characterizable by regular expressions • All regular expression operators can be implemented by combinations of union, disjunction, closure – Counters (*, +) are repetition plus closure – Anchors are individual symbols – [] and () and. are kinds of disjunction 9/16/2021 LING 138/238 Autumn 2004 43
Going from regexp to FSA • So if we could just show to turn closure/union/concat from regexps to FSAs, this would give an idea of how FSA compilation works. • The actual proof that reg lgs = FSAs has 2 parts – An FSA can be built for each regular lg – A regular lg can be built for each automaton • So I’ll give the intuition of the first part: – Take any regular expression and build an automaton – Intuition: induction • Base case: build an automaton for single symbol (say ‘a’) • Inductive step: Show to imitate the 3 regexp operations in automata 9/16/2021 LING 138/238 Autumn 2004 44
Union • Accept a string in either of two languages 9/16/2021 LING 138/238 Autumn 2004 45
Concatenation • Accept a string consisting of a string from language L 1 followed by a string from language L 2. 9/16/2021 LING 138/238 Autumn 2004 46
FSAs and Computational Morphology • An important use of FSAs is for morphology, the study of word parts • We’ll just have time for a quick overview today • This is the exact topic of LING 239 F, being offered this quarter! 9/16/2021 LING 138/238 Autumn 2004 47
English Morphology • Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes • We can usefully divide morphemes into two classes – Stems: The core meaning bearing units – Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions 9/16/2021 LING 138/238 Autumn 2004 48
Nouns and Verbs (English) • Nouns are simple (not really) – Markers for plural and possessive • Verbs are only slightly more complex – Markers appropriate to the tense of the verb 9/16/2021 LING 138/238 Autumn 2004 49
Regulars and Irregulars • Ok so it gets a little complicated by the fact that some words misbehave (refuse to follow the rules) – Mouse/mice, goose/geese, ox/oxen – Go/went, fly/flew • The terms regular and irregular will be used to refer to words that follow the rules and those that don’t. 9/16/2021 LING 138/238 Autumn 2004 50
Regular and Irregular Nouns and Verbs • Regulars… – Walk, walks, walking, walked – Table, tables • Irregulars – – 9/16/2021 Eat, eats, eating, ate, eaten Catch, catches, catching, caught Cut, cuts, cutting, cut Goose, geese LING 138/238 Autumn 2004 51
Compute • Many paths are possible… • Start with compute – – 9/16/2021 Computer -> computerize -> computerization Computation -> computational Computer -> computerize -> computerizable Compute -> computee LING 138/238 Autumn 2004 52
Why care about morphology? • `Stemming’ in information retrieval – Might want to search for “aardvark” and find pages with both “aardvark” and “aardvarks” • Morphology in machine translation – Need to know that the Spanish words quiero and quieres are both related to querer ‘want’ • Morphology in spell checking – Need to know that misclam and antiundoggingly are not words despite being made up of word parts 9/16/2021 LING 138/238 Autumn 2004 53
Can’t just list all words • Turkish • Uygarlastiramadiklarimizdanmissinizcasina • `(behaving) as if you are among those whom we could not civilize’ • Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ + lar ‘plural’+ imiz ‘p 1 pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘ 2 pl’ + casina ‘as if’ 9/16/2021 LING 138/238 Autumn 2004 54
What we want • Something to automatically do the following kinds of mappings: • Cats cat +N +PL • Cat cat +N +SG • Cities city +N +PL • Merging merge +V +Present-participle • Caught catch +V +past-participle 9/16/2021 LING 138/238 Autumn 2004 55
FSAs and the Lexicon • This will actual require a kind of FSA we won’t be studying: the Finite State Transducer (FST) • But we’ll give a quick overview anyhow • First we’ll capture the morphotactics – The rules governing the ordering of affixes in a language. • Then we’ll add in the actual words 9/16/2021 LING 138/238 Autumn 2004 56
Simple Rules 9/16/2021 LING 138/238 Autumn 2004 57
Adding the Words 9/16/2021 LING 138/238 Autumn 2004 58
Derivational Rules 9/16/2021 LING 138/238 Autumn 2004 59
Parsing/Generation vs. Recognition • Recognition is usually not quite what we need. – Usually if we find some string in the language we need to find the structure in it (parsing) – Or we have some structure and we want to produce a surface form (production/generation) • Example – From “cats” to “cat +N +PL” 9/16/2021 LING 138/238 Autumn 2004 60
Finite State Transducers • The simple story – Add another tape – Add extra symbols to the transitions – On one tape we read “cats”, on the other we write “cat +N +PL” 9/16/2021 LING 138/238 Autumn 2004 61
Transitions c: c a: a t: t +N: ε +PL: s • c: c means read a c on one tape and write a c on the other • +N: ε means read a +N symbol on one tape and write nothing on the other • +PL: s means read +PL and write an s 9/16/2021 LING 138/238 Autumn 2004 62
Lexical to Intermediate Level 9/16/2021 LING 138/238 Autumn 2004 63
Some on-line demos • Finite state automata demos • http: //www. xrce. xerox. com/competencies/c ontent-analysis/fs. Compiler/fsinput. html • Finite state morphology • http: //www. xrce. xerox. com/competencies/c ontent-analysis/demos/english • Some other downloadable FSA tools: • http: //www. research. att. com/sw/tools/fsm/ 9/16/2021 LING 138/238 Autumn 2004 64
- Slides: 64