Natural Language Processing Regular Expressions Basic Regular Expression

Natural Language Processing Regular Expressions, Basic Regular Expression Patterns 3 rd Class 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Regular Languages • There are several formalisms for specifying tokens • Regular languages are the most popular • Simple and useful theory • Easy to understand • Efficient implementations 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Languages Defined as : Let Σ be a set of characters. A language over Σ is a set of strings of characters drawn from Σ 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Examples of Languages • Alphabet = English characters • Alphabet = ASCII • Language = English sentences • Language = C programs • Not every string of English characters is an English sentence • Note: ASCII character set is different from English character set 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Notation • Languages are sets of strings. • Need some notation for specifying which sets we want • Three equivalent formal ways to look at this approach Regular Expressions Regular Languages Finite State Automata 8/20/2018 Regular Grammars Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Regular Expression 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk 6

Atomic Regular Expressions • Single character 'c ' ={"c"} • Epsilon ε = {""} 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Compound Regular Expressions • 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Regular Expression Quick Guide ^ Matches the beginning of a line $ Matches the end of the line. Matches any character s Matches whitespace S Matches any non-whitespace character * Repeats a character zero or more times *? Repeats a character zero or more times (non-greedy) + Repeats a chracter one or more times +? Repeats a character one or more times (non-greedy) [aeiou] Matches a single character in the listed set [^XYZ] Matches a single character not in the listed set [a-z 0 -9] The set of characters can include a range ( Indicates where string extraction is to start ) Indicates where string extraction is to end 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Regular Expressions • 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Syntax vs. Semantics • 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Segue (an uninterrupted transition) • Regular expressions are simple, almost trivial • But they are useful! • Reconsider informal token descriptions. . . 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Example: Keyword: “else” or “if” or “begin” or … ‘else’ + ‘if’ + ‘begin’ +. . . Note: ‘else’ abbreviates 8/20/2018 ‘e’ ‘l’ ‘s’ ‘e’ Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Example: Integers Integer: a non-empty string of digits Digit = ‘ 0’ + ‘ 1’ + ‘ 2’ + ‘ 3’ + ‘ 4’ + ‘ 5’ + ‘ 6’ + ‘ 7’ + ‘ 8’ + '9’ integer = digit* Abbreviation: 8/20/2018 A+ =AA* Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Example: Identifier: strings of letters or digits, starting with a letter = ‘A’ +. . . + ‘Z’ + ‘a’ +. . . + ‘z’ identifier = letter (letter + digit)* Is (letter* + digit*) the same? 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Example: Whitespace: a non-empty sequence of blanks, newlines, and tabs (' ' + 'n' + 't')+ 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Example: Phone Numbers • Regular expressions are all around you! • Consider (02)-0760 -0000 Σ exchange phone area phone_number 8/20/2018 = = = digits {-, (, )} digit 4 digit 2 '(' area ')-' exchange '-' phone Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk

Example: Email Addresses 8/20/2018 Consider anyone@faculty. muet Σ = letters {. , @} name address = = letter+ name '@' name ‘. ’ name Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk
![Optionality and Repetition • /[Ww]oodchucks? / matches woodchucks, Woodchucks, woodchuck, Woodchuck • /colou? r/ Optionality and Repetition • /[Ww]oodchucks? / matches woodchucks, Woodchucks, woodchuck, Woodchuck • /colou? r/](http://slidetodoc.com/presentation_image_h2/9fb7956a1f25cc042a5b9a5a69c6e684/image-19.jpg)
Optionality and Repetition • /[Ww]oodchucks? / matches woodchucks, Woodchucks, woodchuck, Woodchuck • /colou? r/ matches color or colour • /he{3}/ matches heee • /(he){3}/ matches hehehe • /(he){3, } matches a sequence of at least 3 he’s 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui 19 isma. farah@faculty. muet. edu. pk

A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z][t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui 20 isma. farah@faculty. muet. edu. pk

A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z][t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with southern factions. 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui 21 isma. farah@faculty. muet. edu. pk

A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z][t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui 22 isma. farah@faculty. muet. edu. pk

A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z][t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui 23 isma. farah@faculty. muet. edu. pk

A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z][t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui 24 isma. farah@faculty. muet. edu. pk

A Simple Exercise • Write a regular expression to find all instances of the determiner “the”: /the/ /[t. T]he/ /b[t. T]heb/ /(^|[^a-z. A-Z])[t. T]he[^a-z. A-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui 25 isma. farah@faculty. muet. edu. pk

Summary • Regular expressions are a cryptic but powerful language for matching strings and extracting elements from those strings • Regular expressions have special characters that indicate intent 8/20/2018 Natural Language Processing Dr. Isma Farah Siddiqui isma. farah@faculty. muet. edu. pk
- Slides: 26