LING 6932 Topics in Computational Linguistics Hana Filip

  • Slides: 42
Download presentation
LING 6932 Topics in Computational Linguistics Hana Filip Lecture 2: Regular Expressions, Finite State

LING 6932 Topics in Computational Linguistics Hana Filip Lecture 2: Regular Expressions, Finite State Automata LING 6932 Spring 2007 1

Regular expressions formulas for specifying text strings How can we search for any of

Regular expressions formulas for specifying text strings How can we search for any of these strings? woodchucks Woodchucks Figure from Dorr/Monz slides LING 6932 Spring 2007 2

Regular Expressions Basic patterns of regular expressions Perl-based syntax (slightly different from other notations

Regular Expressions Basic patterns of regular expressions Perl-based syntax (slightly different from other notations for regular expressions as used in UNIX, for example) /Woodchuck/ matches any string containing the substring Woodchuck, if your search application returns entire lines, for example ‘/’ notation used by Perl, NOT part of the RE Google: Woodchuck Draft Cider Producers of Woodchuck Draft Cider in Spingfield, VT. www. woodchuck. com/ - 17 k - Cached - Similar pages Slide from Dorr/Monz LING 6932 Spring 2007 3

Regular Expressions Regular expressions are CASE SENSITIVE The pattern /woodchuck/ will not match the

Regular Expressions Regular expressions are CASE SENSITIVE The pattern /woodchuck/ will not match the string Woodchuck Disjunction /[w. W]oodchuck/ Slide from Dorr/Monz LING 6932 Spring 2007 4

Regular Expressions Ranges [A-Z] Slide from Dorr/Monz LING 6932 Spring 2007 5

Regular Expressions Ranges [A-Z] Slide from Dorr/Monz LING 6932 Spring 2007 5

Regular Expressions § Negation /[^a]/ ^: caret ‘match any single character except a’ Slide

Regular Expressions § Negation /[^a]/ ^: caret ‘match any single character except a’ Slide from Dorr/Monz LING 6932 Spring 2007 6

Regular Expressions Operators ? , * and + ? (0 or 1) /woodchucks? /

Regular Expressions Operators ? , * and + ? (0 or 1) /woodchucks? / woodchuck or woodchucks /colou? r/ color or colour * (0 or more) /oo*h!/ oh! or ooooh! + (1 or more) *+ • /o+h!/ oh! or ooooh! related to the immediately preceding character or regular expression Stephen Cole Kleene ¬ Wild card. /beg. n/ begin or began or begun any character between beg and n (except a carriage return) Slide from Dorr/Monz LING 6932 Spring 2007 7

Regular Expressions Anchors ^ and $ start of line /^[A-Z]/ “Ramallah, Palestine” /^[^A-Z]/ “¿verdad?

Regular Expressions Anchors ^ and $ start of line /^[A-Z]/ “Ramallah, Palestine” /^[^A-Z]/ “¿verdad? ” “really? ” end of line /. $/ “It is over. ” /. $/ ? Boundaries b and B /bonb/ “on my way” “Monday” /Bonb/ “automaton” Slide from Dorr/Monz (boundary) (non-boundary) LING 6932 Spring 2007 8

Disjunction, Grouping, Precedence Disjunction | /yours|mine/ “it is either yours or mine” /gupp(y|ies)/ “guppy”

Disjunction, Grouping, Precedence Disjunction | /yours|mine/ “it is either yours or mine” /gupp(y|ies)/ “guppy” Column 1 Column 2 Column 3 … How do we express this? /Column[0 -9]�*/ /(Column[0 -9]�*)*/ or “guppies” � ‘space’ NOT a RE character matches the word Column, followed by one number, followed by zero or more spaces, the whole pattern repeated any number of times (zero or more times) Slide from Dorr/Monz LING 6932 Spring 2007 9

Disjunction, Grouping, Precedence Operator Precedence Hierarchy Parenthesis () Counters * + Sequences and anchors

Disjunction, Grouping, Precedence Operator Precedence Hierarchy Parenthesis () Counters * + Sequences and anchors the Disjunction | ? ^my end$ REs are greedy! They always match the largest string they can Slide from Dorr/Monz LING 6932 Spring 2007 10

Example Find me all instances of the word “the” in a text. /the/ Misses

Example Find me all instances of the word “the” in a text. /the/ Misses capitalized examples /[t. T]he/ Returns “other” or “theology” /b[t. T]heb/ matches “the” or “The” /[^a-z. A-Z][t. T]he[^a-z. A-Z]/ /(^|[^a-z. A-Z])[t. T]he[^a-z. A-Z]/ Matches “the_” or “the 25” LING 6932 Spring 2007 11

Errors The process we just went through was based on two fixing kinds of

Errors The process we just went through was based on two fixing kinds of errors Not matching things that we should have matched (The) – False negatives Matching strings that we should not have matched (there, then, other) – False positives LING 6932 Spring 2007 12

Errors cont. We’ll be telling the same story for many tasks Reducing the error

Errors cont. We’ll be telling the same story for many tasks Reducing the error rate for an application often involves two antagonistic efforts: Increasing accuracy (minimizing false positives) Increasing coverage (minimizing false negatives). LING 6932 Spring 2007 13

More complex RE example Regular expressions for prices /$[0 -9]+/ Doesn’t deal with fractions

More complex RE example Regular expressions for prices /$[0 -9]+/ Doesn’t deal with fractions of dollars /$[0 -9]+. [0 -9]/ Doesn’t allow $199, not at a word boundary /b$[0 -9]+(. [0 -9])? b)/ LING 6932 Spring 2007 14

Advanced operators Regular expression operators for counting RE Match {n} exactly n occurrences of

Advanced operators Regular expression operators for counting RE Match {n} exactly n occurrences of the previous character or expression {n, m} from n to m occurrences of the previous character or expression {n, } at least n occurrences of the previous character or expression /a. {24}z/ a followed by 24 dots followed by z LING 6932 Spring 2007 15

Advanced operators To refer to characters that are special themselves precede them with a

Advanced operators To refer to characters that are special themselves precede them with a backslash RE Match Example Strings Matched * an asterisk “*” “K*A*P*L*A*N” . a period “. ” “Dr. Livingston, I presume. ” ? A question mark “? ” “Would you light my candle? ” n a newline t tab LING 6932 Spring 2007 16

Advanced operators Slide from Dorr/Monz LING 6932 Spring 2007 17

Advanced operators Slide from Dorr/Monz LING 6932 Spring 2007 17

Substitutions and Memory Substitution operator s/regexp 1/regexp 2/ s/colour/color/g s/colour/color/i Slide from Dorr/Monz (UNIX,

Substitutions and Memory Substitution operator s/regexp 1/regexp 2/ s/colour/color/g s/colour/color/i Slide from Dorr/Monz (UNIX, Perl) Substitute as many times as possible! Case insensitive matching LING 6932 Spring 2007 18

Substitutions and Memory Substitutions “the Xer they were, the Xer they will be” constrain

Substitutions and Memory Substitutions “the Xer they were, the Xer they will be” constrain the two X’s to be the same string Using numbered memories or registers: $1, $2, etc. used to refer back to matches An extended feature of regular expressions /the (. *)er they were, the $1 er they will be/ /the (. *)er they (. *), the $1 er they $2/ Slide from Dorr/Monz LING 6932 Spring 2007 19

Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re

Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED LING 6932 Spring 2007 20

Eliza-style regular expressions Step 1: replace first person with second person references s/b. I(’m

Eliza-style regular expressions Step 1: replace first person with second person references s/b. I(’m | am)b /YOU ARE/g s/bmyb /YOUR/g S/bmineb /YOURS/g Step 2: use substitutions that look for relevant patterns in the input and create an appropriate output (reply) s/. * YOU ARE (depressed|sad). */I AM SORRY TO HEAR YOU ARE 1/ YOU ARE (depressed|sad). */WHY DO YOU THINK YOU ARE 1/ all. */IN WHAT WAY/ always. */CAN YOU THINK OF A SPECIFIC EXAMPLE/ Step 3: use scores to rank possible transformations Slide from Dorr/Monz LING 6932 Spring 2007 21

Summary on REs so far Regular expressions are perhaps the single most useful tool

Summary on REs so far Regular expressions are perhaps the single most useful tool for text manipulation Dumb but ubiquitous Eliza: you can do a lot with simple regular-expression substitutions LING 6932 Spring 2007 22

Three Views Three equivalent formal ways to look at what we’re up to (thanks

Three Views Three equivalent formal ways to look at what we’re up to (thanks to Martin Kay) Regular Expressions Finite State Automata Regular Languages LING 6932 Spring 2007 23

Finite State Automata Terminology: Finite State Automata, Finite State Machines, FSA, Finite Automata Regular

Finite State Automata Terminology: Finite State Automata, Finite State Machines, FSA, Finite Automata Regular expressions are one way of specifying the structure of finite-state automata. FSAs and their close relatives are at the core of most algorithms for speech and language processing. LING 6932 Spring 2007 24

Finite-state Automata (Machines) baa! baaaa! baaaaa!. . . /^baa+!$/ a b q 0 a

Finite-state Automata (Machines) baa! baaaa! baaaaa!. . . /^baa+!$/ a b q 0 a q 1 a q 2 state Slide from Dorr/Monz ! q 3 transition LING 6932 Spring 2007 q 4 final state 25

Sheep FSA We can say the following things about this machine It has 5

Sheep FSA We can say the following things about this machine It has 5 states At least b, a, and ! are in its alphabet q 0 is the start state q 4 is the final (= accept) state It has 5 transitions LING 6932 Spring 2007 26

More Formally: Defining an FSA You can specify an FSA by enumerating the following

More Formally: Defining an FSA You can specify an FSA by enumerating the following things. a finite set of states: Q a finite alphabet of symbols: the start state: q 0 The set of accepting/final states: F such that F Q A transition function (q, i) that maps Qx to Q Given a state q Q and an input symbol i , (q, i) returns a new state q’ Q. LING 6932 Spring 2007 27

Yet Another View State-transition table LING 6932 Spring 2007 28

Yet Another View State-transition table LING 6932 Spring 2007 28

Recognition is the process of determining if a string should be accepted by a

Recognition is the process of determining if a string should be accepted by a machine Or… it’s the process of determining if a string is in the language we’re defining with the machine Or… it’s the process of determining if a regular expression matches a string LING 6932 Spring 2007 29

Recognition Traditionally, (Turing’s idea, 1936) this process is depicted with a tape. http: //www.

Recognition Traditionally, (Turing’s idea, 1936) this process is depicted with a tape. http: //www. cs. princeton. edu/introcs/75 turing/ LING 6932 Spring 2007 30

Recognition - Execution Start in the start state Examine the current input in the

Recognition - Execution Start in the start state Examine the current input in the active cell Consult the table: a finite table of instructions (a state transition diagram) that specifies exactly what action the machine takes at each step Go to a new state and update the tape pointer. Until you run out of tape. LING 6932 Spring 2007 31

Input Tape q 0 q 1 q 2 q 3 b a a a

Input Tape q 0 q 1 q 2 q 3 b a a a b 0 a 1 Slide from Dorr/Monz q 4 a a 2 ACCEPT ! 3 ! 4 LING 6932 Spring 2007 32

Input Tape q 0 a b b 0 a ! a 1 Slide from

Input Tape q 0 a b b 0 a ! a 1 Slide from Dorr/Monz a a 2 REJECT b 3 ! 4 LING 6932 Spring 2007 33

Adding a failing state a b q 0 a a q 1 ! q

Adding a failing state a b q 0 a a q 1 ! q 2 ! ! b q 3 ! b a q. F Slide from Dorr/Monz ! b b a q 4 LING 6932 Spring 2007 34

Tracing D-Recognize LING 6932 Spring 2007 35

Tracing D-Recognize LING 6932 Spring 2007 35

Key Points Deterministic means that at each point in processing there is always one

Key Points Deterministic means that at each point in processing there is always one unique thing to do (no choices). D-recognize is a simple table-driven interpreter The algorithm is universal for all unambiguous languages. To change the machine, you change the table. LING 6932 Spring 2007 36

Key Points Deterministic Pattern Example: Consider a set of traffic lights; the sequence of

Key Points Deterministic Pattern Example: Consider a set of traffic lights; the sequence of lights is red - red/amber - green - amber - red. The sequence can be pictured as a state machine, where the different states of the traffic lights follow each other. Each state is dependent solely on the previous state, so if the lights are green, an amber light will always follow - that is, the system is deterministic. Deterministic systems are relatively easy to understand analyse, once the transitions are fully known. LING 6932 Spring 2007 37

Key Points Crudely therefore… matching strings with regular expressions (a la Perl) is a

Key Points Crudely therefore… matching strings with regular expressions (a la Perl) is a matter of translating the expression into a machine (table) and passing the table to an interpreter LING 6932 Spring 2007 38

Recognition as Search You can view this algorithm as state-space search. States are pairings

Recognition as Search You can view this algorithm as state-space search. States are pairings of tape positions and state numbers. Operators are compiled into the table Goal state is a pairing with the end of tape position and a final accept state LING 6932 Spring 2007 39

Generative Formalisms A formal Language is a model m which can both generate and

Generative Formalisms A formal Language is a model m which can both generate and recognize all and only the strings of a formal language; each string is composed of symbols from a finite set of symbols (alphabet) L(m) ‘a formal language L characterized by the model m’ Finite-state automata define formal languages (without having to enumerate all the strings in the language) The term Generative is based on the view that you can run the machine as a generator to get strings from the language. LING 6932 Spring 2007 40

Generative Formalisms FSAs can be viewed from two perspectives: Acceptors that can tell you

Generative Formalisms FSAs can be viewed from two perspectives: Acceptors that can tell you if a string is in the language (recognition) Generators to produce all and only the strings in the language (production) LING 6932 Spring 2007 41

Summary Regular expressions are just a compact textual representation of FSAs Recognition is the

Summary Regular expressions are just a compact textual representation of FSAs Recognition is the process of determining if a string/input is in the language defined by some machine. Recognition is straightforward with deterministic machines. LING 6932 Spring 2007 42