Dan Jurafsky Natural Language Processing NLP Lecture No

  • Slides: 28
Download presentation
Dan Jurafsky Natural Language Processing (NLP) Lecture No 4 Institute of Southern Punjab Multan

Dan Jurafsky Natural Language Processing (NLP) Lecture No 4 Institute of Southern Punjab Multan Department of Computer Science

Dan Jurafsky Basic Text Processing Regular Expression

Dan Jurafsky Basic Text Processing Regular Expression

Dan Jurafsky Spoken input For speech understanding Sequence of words Phonological / morphological analyser

Dan Jurafsky Spoken input For speech understanding Sequence of words Phonological / morphological analyser SYNTACTIC COMPONENT Syntactic structure (parse tree) Basic Process of NLU Phonological & morphological rules Grammatical Knowledge SEMANTIC INTERPRETER Semantic rules, Lexical semantics CONTEXTUAL REASONER Pragmatic & World Knowledge Logical form Meaning Representation Indicating relns (e. g. , mod) between words Thematic Roles Selectional restrictions

Dan Jurafsky NLP Representations • State Machines • FSAs, FSTs, HMMs, ATNs, RTNs •

Dan Jurafsky NLP Representations • State Machines • FSAs, FSTs, HMMs, ATNs, RTNs • Rule Systems • CFGs, Unification Grammars, Probabilistic CFGs • Logic-based Formalisms • 1 st Order Predicate Calculus, Temporal and other Higher Order Logics • Models of Uncertainty • Bayesian Probability Theory

Dan Jurafsky Regular Expressions • Can be viewed as a way to specify: •

Dan Jurafsky Regular Expressions • Can be viewed as a way to specify: • Search patterns over text string • Design of a particular kind of machine, called a Finite State Automaton (FSA) • These are really equivalent 5

Dan Jurafsky Regular Expressions • Regular Expression: Formula in algebraic notation for specifying a

Dan Jurafsky Regular Expressions • Regular Expression: Formula in algebraic notation for specifying a set of strings • String: Any sequence of alphanumeric characters • Letters, numbers, spaces, tabs, punctuation marks • Regular Expression Search 6 • Pattern: specifying the set of strings we want to search for • Corpus: the texts we want to search through

Dan Jurafsky Regular Expressions The “foundational” operations Concatenation Disjunction abc a|b ab (a|bb)d ad

Dan Jurafsky Regular Expressions The “foundational” operations Concatenation Disjunction abc a|b ab (a|bb)d ad bbd Kleene star a* ε a aa aaa. . . c(a|bb)* ca cbba Regular expressions / Finite-state automata are “closed under these operations” Pattern Matches The empty string 7

Dan Jurafsky Practical Applications of Reg. Ex’s • Web search • Word processing, find,

Dan Jurafsky Practical Applications of Reg. Ex’s • Web search • Word processing, find, substitute • Validate fields in a database (dates, email addr, URLs) • Searching corpus for linguistic patterns – and gathering stats. . .

Dan Jurafsky Regular expressions • A formal language for specifying text strings • How

Dan Jurafsky Regular expressions • A formal language for specifying text strings • How can we search for any of these? • • woodchucks Woodchucks

Dan Jurafsky Regular Expressions: Disjunctions • Letters inside square brackets [] Pattern Matches [w.

Dan Jurafsky Regular Expressions: Disjunctions • Letters inside square brackets [] Pattern Matches [w. W]oodchuck Woodchuck, woodchuck [1234567890] Any digit • Ranges [A-Z] Pattern Matches [A-Z] An upper case letter Drenched Blossoms [a-z] A lower case letter my beans were impatient [0 -9] A single digit Chapter 1: Down the Rabbit Hole

Dan Jurafsky Regular Expressions: Negation in Disjunction • Negations [^Ss] • Carat means negation

Dan Jurafsky Regular Expressions: Negation in Disjunction • Negations [^Ss] • Carat means negation only when first in [] Pattern Matches [^A-Z] Not an upper case letter Oyfn pripetchik [^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason” [^e^] Neither e nor ^ Look here a^b The pattern a carat b Look up a^b now

Dan Jurafsky Regular Expressions: More Disjunction • Woodchucks is another name for groundhog! •

Dan Jurafsky Regular Expressions: More Disjunction • Woodchucks is another name for groundhog! • The pipe | for disjunction Pattern Matches groundhog|woodchuck yours|mine yours mine a|b|c = [abc] [g. G]roundhog|[Ww]oodchuck Photo D. Fletcher

Dan Jurafsky Regular Expressions: ? * Pattern Matches colou? r Optional previous char color

Dan Jurafsky Regular Expressions: ? * Pattern Matches colou? r Optional previous char color oo*h! 0 or more of previous char oh! oooh! ooooh! o+h! 1 or more of previous char oh! oooh! ooooh! + . colour baa+ baaaaa beg. n begin begun beg 3 n Stephen C Kleene *, Kleene +

Dan Jurafsky Regular Expressions: Anchors ^ $ Pattern Matches ^[A-Z] Palo Alto ^[^A-Za-z] 1

Dan Jurafsky Regular Expressions: Anchors ^ $ Pattern Matches ^[A-Z] Palo Alto ^[^A-Za-z] 1 . $ The end? The end! “Hello”

Dan Jurafsky Some Examples

Dan Jurafsky Some Examples

Dan Jurafsky RE Description Uses? /a*/ Zero or more a’s Optional doubled modifiers /a+/

Dan Jurafsky RE Description Uses? /a*/ Zero or more a’s Optional doubled modifiers /a+/ One or more a’s Non-optional. . . /a? / Zero or one a’s Optional. . . /cat|dog/ ‘cat’ or ‘dog’ Words modifying pets /^cat. $/ A line that contains only cat. ^anchors beginning, $ anchors end of line. ? ? /bunB/ Beginnings of longer strings Words prefixed by ‘un’ /pupp(y|ies)/ Morphological variants of ‘puppy’ / (. +)ier and 1 ier / happier and happier, fuzzier and fuzzier

Dan Jurafsky Example • Find me all instances of the word “the” in a

Dan Jurafsky Example • Find me all instances of the word “the” in a text. the Misses capitalized examples [t. T]he Incorrectly returns other or theology [^a-z. A-Z][t. T]he[^a-z. A-Z]

Dan Jurafsky Optionality and Repetition • /[Ww]oodchucks? / matches woodchucks, Woodchucks, woodchuck, Woodchuck •

Dan Jurafsky Optionality and Repetition • /[Ww]oodchucks? / matches woodchucks, Woodchucks, woodchuck, Woodchuck • /colou? r/ matches color or colour • /he{3}/ matches heee • /(he){3}/ matches hehehe • /(he){3, } matches a sequence of at least 3 he’s 18

Dan Jurafsky Operator Precedence Hierarchy 1. Parentheses 2. Counters 3. Sequence of Anchors 4.

Dan Jurafsky Operator Precedence Hierarchy 1. Parentheses 2. Counters 3. Sequence of Anchors 4. Disjunction | Examples /moo+/ /try|ies/ 19 /and|or/ () * + ? {} the ^my end$

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z][t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with southern factions.

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z][t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z][t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z][t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances

Dan Jurafsky A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z])[t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

Dan Jurafsky Errors • The process we just went through was based on fixing

Dan Jurafsky Errors • The process we just went through was based on fixing two kinds of errors • Matching strings that we should not have matched (there, then, other) • False positives (Type I) • Not matching things that we should have matched (The) • False negatives (Type II)

Dan Jurafsky Errors cont. • In NLP we are always dealing with these kinds

Dan Jurafsky Errors cont. • In NLP we are always dealing with these kinds of errors. • Reducing the error rate for an application often involves two antagonistic efforts: • Increasing accuracy or precision (minimizing false positives) • Increasing coverage or recall (minimizing false negatives).

Dan Jurafsky Summary • Regular expressions play a surprisingly large role • Sophisticated sequences

Dan Jurafsky Summary • Regular expressions play a surprisingly large role • Sophisticated sequences of regular expressions are often the first model for any text processing text • For many hard tasks, we use machine learning classifiers • But regular expressions are used as features in the classifiers • Can be very useful in capturing generalizations 28