Regular Expressions and Automata Chapter 2 Regular Expressions
- Slides: 31
Regular Expressions and Automata Chapter 2
Regular Expressions • Standard notation for characterizing text sequences • Used in all kinds of text processing and information extraction tasks • As things have progressed, the RE languages used in various tools and languages (grep, Emacs, Python, Ruby, Java, …) are very similar 11/5/2020 Speech and Language Processing - Jurafsky and Martin 2
Regular Expressions • We’ll look at a few examples [in lecture], make a note about types of errors, and then move toward automata 11/5/2020 Speech and Language Processing - Jurafsky and Martin 3
Example • Find all the instances of the word “the” in a text. w /the/ w /[t. T]he/ w /b[t. T]heb/ 11/5/2020 Speech and Language Processing - Jurafsky and Martin 4
Errors • We fixed two kinds of errors w Matching strings that we should not have matched § False positives (Type I) w Not matching things that we should have matched § False negatives (Type II) 11/5/2020 Speech and Language Processing - Jurafsky and Martin 5
Errors • We’ll see the same story for many tasks, all semester. Reducing the error rate for an application often involves two antagonistic efforts: w Increasing precision, (minimizing false positives) w Increasing coverage, or recall, (minimizing false negatives) 11/5/2020 Speech and Language Processing - Jurafsky and Martin 6
Formal Languages and Models • Language: a (possibly infinite) set of strings made up of symbols from a finite alphabet • Model of a language: can recognize and generate all and only the strings from the language w Serves as a definition of the formal language 11/5/2020 Speech and Language Processing - Jurafsky and Martin 7
Chomsky Hierarchy • Regular language w Model: regular expressions, finite state automata • Context free language • Context sensitive language • Unrestricted language w Model: Turning Machine 11/5/2020 Speech and Language Processing - Jurafsky and Martin 8
Regular Expressions and Languages • A regular expression pattern can be mapped to a set of strings • A regular expression pattern defines a language (in the formal sense) – the class of this type of languages is called a regular language 11/5/2020 Speech and Language Processing - Jurafsky and Martin 9
Finite State Automata • FSAs and their probabilistic relatives are at the core of much of what we’ll be doing all semester. • They also capture significant aspects of what linguists say we need for morphology and parts of syntax. • They are formally equivalent to regular expressions 11/5/2020 Speech and Language Processing - Jurafsky and Martin 10
Formal Definition of a Finite Automaton 1. 2. 3. 4. Finite set of states, typically Q. Alphabet of input symbols, typically One state is the start/initial state, typically q 0 // q 0 Q Zero or more final/accepting states; the set is typically F. Q 5. A transition function, typically δ. This function • • // F Takes a state and input symbol as arguments. Returns a state. One “rule” would be written δ(q, a) = p, where q and p are states, and a is an input symbol. Intuitively: if the FA is in state q, and input a is received, then the FA goes to state p (note: q = p OK). 6. A FA is represented as the five-tuple: A = (Q, , δ, q 0, F). Here, F is a set of accepting states.
A Simple Example • Language: “Sheepish” Any string that starts with the letter b, followed by two or more a’s, and ending in ! • {“baa!”, ”baaaa!”, ”baaaaa!”, …} • Regular expression for this? 11/5/2020 Speech and Language Processing - Jurafsky and Martin 12
One Possible Sheepish FSA • Formal definition of this FSA? 11/5/2020 Speech and Language Processing - Jurafsky and Martin 13
FSA as a Recognizer • Does a string belong to its language? 1. Place the input string on a tape, point at start 2. Initialize current state to q 0 3. Iteratively check the next letter on tape. 1. From the current state, if an outgoing arc label matches new letter, move to new state 2. If stuck, REJECT 4. If reach the end of the tape and in a final state, then ACCEPT; else, REJECT 11/5/2020 Speech and Language Processing - Jurafsky and Martin 14
Recognition • Traditionally, (Turing’s notion) this process is depicted with a tape. 11/5/2020 Speech and Language Processing - Jurafsky and Martin 15
FSA as Generator • FSA can also produce strings in the language it represents 1. Start from q 0 2. Pick an out-going arc to a new state (for now, assume picking randomly) and print the symbol on the arc 3. Follow the arc to the new state 4. Repeat from step 2 until reaching a final state 11/5/2020 Speech and Language Processing - Jurafsky and Martin 16
FSAs and Regular Expressions • These are formally equivalent. • Both of these classes of models recognize/generate exactly the class of regular languages • Interesting proofs: constructive! Given any regular expression, create an equivalent FSA; given any FSA, create an equivalent regular expression 11/5/2020 Speech and Language Processing - Jurafsky and Martin 17
Note on Practical Regular Expression Utilities • NOTE: additional features added to regular expression processing can make them more powerful; think of memory 11/5/2020 Speech and Language Processing - Jurafsky and Martin 18
About Alphabets • Don’t take term alphabet word too narrowly; it just means we need a finite set of symbols in the input. • These symbols can and will stand for bigger objects that can have internal structure. 11/5/2020 Speech and Language Processing - Jurafsky and Martin 19
Often there is more than one FSA for a given language • E. g. , here is another FSA for “Sheepish” 11/5/2020 Speech and Language Processing - Jurafsky and Martin 20
Yet Another View • The guts of FSAs can ultimately be represented as tables If you’re in state 1 and you’re looking at an a, go to state 2 11/5/2020 0 1 2 3 4 Speech and Language Processing - Jurafsky and Martin b a ! 1 2 2, 3 4 e 21
Deterministic versus Non. Deterministic FSAs • Deterministic means that at each point in processing there is always one unique thing to do (no choices). • Non-deterministic means there are choices • Go back and look at previous DFA • How do deterministic and nondeterministic FSAs compare? 11/5/2020 Speech and Language Processing - Jurafsky and Martin 22
Non-Deterministic FSAs • May include w Epsilon transitions w Key point: these transitions do not examine or advance the tape during recognition 11/5/2020 Speech and Language Processing - Jurafsky and Martin 23
ND Recognition • Two basic approaches 1. Either take a ND machine and convert it to a D machine and then do recognition with that. 2. Or explicitly manage the process of recognition as a state-space search (leaving the machine as is). 11/5/2020 Speech and Language Processing - Jurafsky and Martin 24
Non-Deterministic Recognition: Search • In a ND FSA there exists at least one path through the machine for a string that is in the language defined by the machine. • But not all paths through the machine for an accept string lead to an accept state. • If a string is not in the language, there are no paths through the machine that lead to an accept state 11/5/2020 Speech and Language Processing - Jurafsky and Martin 25
Non-Deterministic Recognition • So success in non-deterministic recognition occurs when a path is found through the machine that ends in an accept. • Failure occurs when all of the possible paths for a given string lead to failure. 11/5/2020 Speech and Language Processing - Jurafsky and Martin 26
Example 11/5/2020 Speech and Language Processing - Jurafsky and Martin 27
Why Non-Determinism? • Non-determinism doesn’t get us more formal power and it causes headaches so why bother? w More natural (understandable) solutions 11/5/2020 Speech and Language Processing - Jurafsky and Martin 28
Compositional Machines • Formal languages are just sets of strings • Therefore, we can talk about various set operations (intersection, union, concatenation) • We’ll just do a couple 11/5/2020 Speech and Language Processing - Jurafsky and Martin 29
Union 11/5/2020 Speech and Language Processing - Jurafsky and Martin 30
Concatenation 11/5/2020 Speech and Language Processing - Jurafsky and Martin 31
- Xkcd regex problems
- How to simplify rational expressions
- Regular expressions
- Regular language
- Regular expressions wikipedia
- Algebraic properties of regular expression
- I formal
- Primitive regular expressions
- Regular expressions
- Regular grammar generates regular language
- Alan turing machine
- Formal and informal language definition
- Length of a string in automata theory
- Formal vs informal language examples
- Formal languages and automata theory tutorial
- Formal languages and automata theory tutorial
- Cyk algo
- An introduction to formal languages and automata
- Cyk algorithm
- Formal languages and automata theory tutorial
- Central concepts of automata theory in flat
- Reduksi fsa
- Cis 262
- Chapter 9 rational expressions and equations answers
- Chapter 1 expressions equations and inequalities
- Chapter 1 expressions equations and inequalities
- Dfa stands for in automata
- Pda with two stacks
- Ekspresi reguler adalah
- Grammar math
- Contoh soal dan jawaban aturan produksi fsa
- Grammar in levels