Dan Jurafsky Natural Language Processing NLP Lecture No









![Dan Jurafsky Regular Expressions: Disjunctions • Letters inside square brackets [] Pattern Matches [w. Dan Jurafsky Regular Expressions: Disjunctions • Letters inside square brackets [] Pattern Matches [w.](https://slidetodoc.com/presentation_image_h/db41e42752fbfe18e1ee2a5a9e427272/image-10.jpg)
![Dan Jurafsky Regular Expressions: Negation in Disjunction • Negations [^Ss] • Carat means negation Dan Jurafsky Regular Expressions: Negation in Disjunction • Negations [^Ss] • Carat means negation](https://slidetodoc.com/presentation_image_h/db41e42752fbfe18e1ee2a5a9e427272/image-11.jpg)


![Dan Jurafsky Regular Expressions: Anchors ^ $ Pattern Matches ^[A-Z] Palo Alto ^[^A-Za-z] 1 Dan Jurafsky Regular Expressions: Anchors ^ $ Pattern Matches ^[A-Z] Palo Alto ^[^A-Za-z] 1](https://slidetodoc.com/presentation_image_h/db41e42752fbfe18e1ee2a5a9e427272/image-14.jpg)



![Dan Jurafsky Optionality and Repetition • /[Ww]oodchucks? / matches woodchucks, Woodchucks, woodchuck, Woodchuck • Dan Jurafsky Optionality and Repetition • /[Ww]oodchucks? / matches woodchucks, Woodchucks, woodchuck, Woodchuck •](https://slidetodoc.com/presentation_image_h/db41e42752fbfe18e1ee2a5a9e427272/image-18.jpg)










- Slides: 28

Dan Jurafsky Natural Language Processing (NLP) Lecture No 4 Institute of Southern Punjab Multan Department of Computer Science

Dan Jurafsky Basic Text Processing Regular Expression

Dan Jurafsky Spoken input For speech understanding Sequence of words Phonological / morphological analyser SYNTACTIC COMPONENT Syntactic structure (parse tree) Basic Process of NLU Phonological & morphological rules Grammatical Knowledge SEMANTIC INTERPRETER Semantic rules, Lexical semantics CONTEXTUAL REASONER Pragmatic & World Knowledge Logical form Meaning Representation Indicating relns (e. g. , mod) between words Thematic Roles Selectional restrictions

Dan Jurafsky NLP Representations • State Machines • FSAs, FSTs, HMMs, ATNs, RTNs • Rule Systems • CFGs, Unification Grammars, Probabilistic CFGs • Logic-based Formalisms • 1 st Order Predicate Calculus, Temporal and other Higher Order Logics • Models of Uncertainty • Bayesian Probability Theory

Dan Jurafsky Regular Expressions • Can be viewed as a way to specify: • Search patterns over text string • Design of a particular kind of machine, called a Finite State Automaton (FSA) • These are really equivalent 5

Dan Jurafsky Regular Expressions • Regular Expression: Formula in algebraic notation for specifying a set of strings • String: Any sequence of alphanumeric characters • Letters, numbers, spaces, tabs, punctuation marks • Regular Expression Search 6 • Pattern: specifying the set of strings we want to search for • Corpus: the texts we want to search through

Dan Jurafsky Regular Expressions The “foundational” operations Concatenation Disjunction abc a|b ab (a|bb)d ad bbd Kleene star a* ε a aa aaa. . . c(a|bb)* ca cbba Regular expressions / Finite-state automata are “closed under these operations” Pattern Matches The empty string 7

Dan Jurafsky Practical Applications of Reg. Ex’s • Web search • Word processing, find, substitute • Validate fields in a database (dates, email addr, URLs) • Searching corpus for linguistic patterns – and gathering stats. . .

Dan Jurafsky Regular expressions • A formal language for specifying text strings • How can we search for any of these? • • woodchucks Woodchucks
![Dan Jurafsky Regular Expressions Disjunctions Letters inside square brackets Pattern Matches w Dan Jurafsky Regular Expressions: Disjunctions • Letters inside square brackets [] Pattern Matches [w.](https://slidetodoc.com/presentation_image_h/db41e42752fbfe18e1ee2a5a9e427272/image-10.jpg)
Dan Jurafsky Regular Expressions: Disjunctions • Letters inside square brackets [] Pattern Matches [w. W]oodchuck Woodchuck, woodchuck [1234567890] Any digit • Ranges [A-Z] Pattern Matches [A-Z] An upper case letter Drenched Blossoms [a-z] A lower case letter my beans were impatient [0 -9] A single digit Chapter 1: Down the Rabbit Hole
![Dan Jurafsky Regular Expressions Negation in Disjunction Negations Ss Carat means negation Dan Jurafsky Regular Expressions: Negation in Disjunction • Negations [^Ss] • Carat means negation](https://slidetodoc.com/presentation_image_h/db41e42752fbfe18e1ee2a5a9e427272/image-11.jpg)
Dan Jurafsky Regular Expressions: Negation in Disjunction • Negations [^Ss] • Carat means negation only when first in [] Pattern Matches [^A-Z] Not an upper case letter Oyfn pripetchik [^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason” [^e^] Neither e nor ^ Look here a^b The pattern a carat b Look up a^b now

Dan Jurafsky Regular Expressions: More Disjunction • Woodchucks is another name for groundhog! • The pipe | for disjunction Pattern Matches groundhog|woodchuck yours|mine yours mine a|b|c = [abc] [g. G]roundhog|[Ww]oodchuck Photo D. Fletcher

Dan Jurafsky Regular Expressions: ? * Pattern Matches colou? r Optional previous char color oo*h! 0 or more of previous char oh! oooh! ooooh! o+h! 1 or more of previous char oh! oooh! ooooh! + . colour baa+ baaaaa beg. n begin begun beg 3 n Stephen C Kleene *, Kleene +
![Dan Jurafsky Regular Expressions Anchors Pattern Matches AZ Palo Alto AZaz 1 Dan Jurafsky Regular Expressions: Anchors ^ $ Pattern Matches ^[A-Z] Palo Alto ^[^A-Za-z] 1](https://slidetodoc.com/presentation_image_h/db41e42752fbfe18e1ee2a5a9e427272/image-14.jpg)
Dan Jurafsky Regular Expressions: Anchors ^ $ Pattern Matches ^[A-Z] Palo Alto ^[^A-Za-z] 1 . $ The end? The end! “Hello”

Dan Jurafsky Some Examples

Dan Jurafsky RE Description Uses? /a*/ Zero or more a’s Optional doubled modifiers /a+/ One or more a’s Non-optional. . . /a? / Zero or one a’s Optional. . . /cat|dog/ ‘cat’ or ‘dog’ Words modifying pets /^cat. $/ A line that contains only cat. ^anchors beginning, $ anchors end of line. ? ? /bunB/ Beginnings of longer strings Words prefixed by ‘un’ /pupp(y|ies)/ Morphological variants of ‘puppy’ / (. +)ier and 1 ier / happier and happier, fuzzier and fuzzier

Dan Jurafsky Example • Find me all instances of the word “the” in a text. the Misses capitalized examples [t. T]he Incorrectly returns other or theology [^a-z. A-Z][t. T]he[^a-z. A-Z]
![Dan Jurafsky Optionality and Repetition Wwoodchucks matches woodchucks Woodchucks woodchuck Woodchuck Dan Jurafsky Optionality and Repetition • /[Ww]oodchucks? / matches woodchucks, Woodchucks, woodchuck, Woodchuck •](https://slidetodoc.com/presentation_image_h/db41e42752fbfe18e1ee2a5a9e427272/image-18.jpg)
Dan Jurafsky Optionality and Repetition • /[Ww]oodchucks? / matches woodchucks, Woodchucks, woodchuck, Woodchuck • /colou? r/ matches color or colour • /he{3}/ matches heee • /(he){3}/ matches hehehe • /(he){3, } matches a sequence of at least 3 he’s 18

Dan Jurafsky Operator Precedence Hierarchy 1. Parentheses 2. Counters 3. Sequence of Anchors 4. Disjunction | Examples /moo+/ /try|ies/ 19 /and|or/ () * + ? {} the ^my end$

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z][t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with southern factions.

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z][t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z][t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z][t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z])[t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

Dan Jurafsky Errors • The process we just went through was based on fixing two kinds of errors • Matching strings that we should not have matched (there, then, other) • False positives (Type I) • Not matching things that we should have matched (The) • False negatives (Type II)

Dan Jurafsky Errors cont. • In NLP we are always dealing with these kinds of errors. • Reducing the error rate for an application often involves two antagonistic efforts: • Increasing accuracy or precision (minimizing false positives) • Increasing coverage or recall (minimizing false negatives).

Dan Jurafsky Summary • Regular expressions play a surprisingly large role • Sophisticated sequences of regular expressions are often the first model for any text processing text • For many hard tasks, we use machine learning classifiers • But regular expressions are used as features in the classifiers • Can be very useful in capturing generalizations 28