Advanced Algorithms 240 426 Semester 1 2021 2022

  • Slides: 55
Download presentation
Advanced Algorithms 240 -426 , Semester 1, 2021 -2022 11. Regular Expressions Regular expressions

Advanced Algorithms 240 -426 , Semester 1, 2021 -2022 11. Regular Expressions Regular expressions (REs, regexs) in Java. There's a "Regex Extras" folder in the "Background" folder on the web site Adv. Algs: 11. Regexs 1

https: //xkcd. com/208/ Adv. Algs: 11. Regexs 2

https: //xkcd. com/208/ Adv. Algs: 11. Regexs 2

Overview § 1. What are Regular Expressions? § 2. Basic RE Operators § 3.

Overview § 1. What are Regular Expressions? § 2. Basic RE Operators § 3. Builtin Character Classes § 4. Boundary Matchers § 5. (Greedy) Quantifiers § 6. Three Types of Quantifiers § 7. Capturing Groups Adv. Algs: 11. Regexs § § § 8. Escaping Metachars 9. Regexs in String Methods 10. Look-ahead & Behind 11. Implementing grep 12. More Info. 3

1. What are Regular Expressions? § A regular expression (RE; regex) is a pattern

1. What are Regular Expressions? § A regular expression (RE; regex) is a pattern used to search through text. § REs bring enormous power to string search (and editing) § Look back at my "Discrete Math" notes on REs and UNIX grep. Adv. Algs: 11. Regexs 4

A Test Rig public class Test. Regex { public static void main(String[] args) {

A Test Rig public class Test. Regex { public static void main(String[] args) { if (args. length != 2) { System. out. println("Usage: java Test. Regex string reg. Exp"); System. exit(0); } System. out. println("Input: "" + args[0] + """); System. out. println("Regular expression: "" + args[1] + ""n"); Pattern p = Pattern. compile(args[1]); Matcher m = p. matcher(args[0]); int count = 0; while (m. find()) { System. out. println("Match "" + m. group() + "" at positions "+ m. start() + "-" + (m. end()-1)); count++; } if (count == 0) System. out. println("No matches found"); } } // end of Test. Regex class N o w ☐ i s. . . 012 3 45 m. end() is the position after the end of the match 5

2. Basic RE Operators abc exactly this sequence of three letters a|b|c [abc] a,

2. Basic RE Operators abc exactly this sequence of three letters a|b|c [abc] a, b, or c any one of the letters a, b, or c . any one character (except 'n') a* a+ a 0 or more times a 1 or more times Adv. Algs: 11. Regexs . * means 0 or more of any chars. + means 1 or more of any chars 6

[a-z] any one character from a through z [a-z. A-Z 0 -9] any one

[a-z] any one character from a through z [a-z. A-Z 0 -9] any one letter or digit [^abc] any character except one of the letters a, b, or c § The set of characters defined by [ ] is called a character class. Adv. Algs: 11. Regexs 7

Examples Adv. Algs: 11. Regexs 8

Examples Adv. Algs: 11. Regexs 8

m. end()-1 is wrong when the match is empty Adv. Algs: 11. Regexs 9

m. end()-1 is wrong when the match is empty Adv. Algs: 11. Regexs 9

Adv. Algs: 11. Regexs 10

Adv. Algs: 11. Regexs 10

3. Builtin Character Classes d a digit: [0 -9] Da non-digit: [^0 -9] Notice

3. Builtin Character Classes d a digit: [0 -9] Da non-digit: [^0 -9] Notice the space s a whitespace character: [ tnx 0 Bfr] S a non-whitespace character: [^s] w W a word character: [a-z. A-Z_0 -9] a non-word character: [^w] carriage return formfeed vertical tab leftovers from the days of teletype writers 11

Examples Adv. Algs: 11. Regexs 12

Examples Adv. Algs: 11. Regexs 12

Using "\" in code § Inside Java code you need to "double escape" the

Using "\" in code § Inside Java code you need to "double escape" the RE backslashes: \d\D \s\W \S \w // match against a digit followed by a word Pattern p = Pattern. compile( "\d+\w+" ); Matcher m = p. matcher("this is the 1 st test string"); if(m. find()) System. out. println("matched [" + m. group() + "] from " + m. start() + " to " + m. end() ); else System. out. println("didn’t match"); matched [1 st] from 12 to 15 Adv. Algs: 11. Regexs 13

4. Some Boundary Matchers ^ the beginning of a line $ the end of

4. Some Boundary Matchers ^ the beginning of a line $ the end of a line b a word boundary B not a word boundary Adv. Algs: 11. Regexs written as \b, \B inside Java code 14

Start and End Line Examples The "^" and the "$" do not appear in

Start and End Line Examples The "^" and the "$" do not appear in the text. Adv. Algs: 11. Regexs 15

Word Boundary Examples § A word boundary is zero length. the word "man" the

Word Boundary Examples § A word boundary is zero length. the word "man" the text "man" in "superman" Adv. Algs: 11. Regexs 16

5. (Greedy) Quantifiers X represents some pattern: X? X* X occurs 0 or 1

5. (Greedy) Quantifiers X represents some pattern: X? X* X occurs 0 or 1 time X occurs zero or more times X+ X occurs one or more times X{n} X occurs exactly n times X{n, } X occurs n or more times X{n, m} X occurs at least n but not more than m times Adv. Algs: 11. Regexs 17

Example of Basic Quantifiers "or more" is greedy in the sense that the regex

Example of Basic Quantifiers "or more" is greedy in the sense that the regex matches the most Adv. Algs: 11. Regexs 18

Numerical Quantifiers greedy in the sense that the regex matches the most why only

Numerical Quantifiers greedy in the sense that the regex matches the most why only 1 match? Adv. Algs: 11. Regexs 19

6. Three Types of Quantifiers § 1. A greedy quantifier will match as much

6. Three Types of Quantifiers § 1. A greedy quantifier will match as much as it can, and back off if it needs to § see examples on previous slides § 2. A reluctant quantifier will match as little as possible, then take more if it needs to § you make a quantifier reluctant by adding a ? : X? ? X*? X+? X{n}? X{n, m}? Adv. Algs: 11. Regexs 20

§ 3. A possessive quantifier will match as much as it can, and never

§ 3. A possessive quantifier will match as much as it can, and never lets it go § you make a quantifier possessive by appending a +: X? + Adv. Algs: 11. Regexs X*+ X++ X{n}+ X{n, m}+ 21

Searching for an aardvark greedy reluctant possessive Adv. Algs: 11. Regexs 22

Searching for an aardvark greedy reluctant possessive Adv. Algs: 11. Regexs 22

Aardvark Explained § The text is "aardvark". "greedy with backtracking" § 1. Using the

Aardvark Explained § The text is "aardvark". "greedy with backtracking" § 1. Using the pattern a*ardvark (a* is greedy) § the a* will first match aa, but then ardvark won’t match § the a* then "backs off" and matches only a single a, allowing the rest of the pattern (ardvark) to succeed Adv. Algs: 11. Regexs 23

"not greedy with backtracking" § 2. Use the pattern a*? ardvark (a*? is reluctant)

"not greedy with backtracking" § 2. Use the pattern a*? ardvark (a*? is reluctant) § the a*? will first match zero characters (the null string), but then ardvark won’t match § the a*? then extends and matches the first a, allowing the rest of the pattern (ardvark) to succeed Adv. Algs: 11. Regexs 24

"greedy with no backtracking" § 3. Using the pattern a*+ardvark (a*+ is possessive) §

"greedy with no backtracking" § 3. Using the pattern a*+ardvark (a*+ is possessive) § the a*+ will match the aa, and will not back off, so ardvark never matches and the pattern match fails Adv. Algs: 11. Regexs 25

7. Capturing Groups § Parentheses are used for grouping, but they also capture (keep

7. Capturing Groups § Parentheses are used for grouping, but they also capture (keep for later use) anything matched by that part of the pattern. § Example: ([a-z. A-Z]*)([0 -9]*) matches any number of letters followed by any number of digits § If the match succeeds: § 1 holds the matched letters § 2 holds the matched digits § holds everything matched by the entire pattern Adv. Algs: 11. Regexs 26

Example public class Groups { public static void main(String args[]) { String line =

Example public class Groups { public static void main(String args[]) { String line = "James Bond is 007, or is he? "; String pattern = " (\D+)(\d+)(\D+)"; Pattern p = Pattern. compile(pattern); Matcher m = p. matcher(line); int group. Count = m. group. Count(); System. out. println("Number of groups = " + group. Count); if (m. find()) { for (int i = 0; i <= group. Count; i++) System. out. println("Group " + i + ": " + m. group(i)); } else System. out. println("No matches found"); } } Adv. Algs: 11. Regexs 27

Using Groups in a Regex Adv. Algs: 11. Regexs 28

Using Groups in a Regex Adv. Algs: 11. Regexs 28

8. Escaping Metacharacters § A lot of special characters – parentheses, brackets, braces, stars,

8. Escaping Metacharacters § A lot of special characters – parentheses, brackets, braces, stars, the plus, etc. – are used in REs § they are called metacharacters § To switch off a metacharacter, use "" § e. g. "+" § But inside Java code use, "\" § e. g. "\+" Adv. Algs: 11. Regexs 29

 Confusion § The problem is that '' is used in two different ways

Confusion § The problem is that '' is used in two different ways inside Java. § 1. Java strings uses '' as an escape character to switch off the special meaning of the next character § e. g. + means '+' § 2. Inside regexs '' is used to prefix special character classes § e. g. d means "a digit character" Adv. Algs: 11. Regexs 30

9. Regexs in String Methods § The String class contains some methods that make

9. Regexs in String Methods § The String class contains some methods that make it unnecessary to create Pattern and Matcher objects (as in Test. Regex. java): § matches() § split() § replace. All() and replace. First() Adv. Algs: 11. Regexs 31

matches() public class Matching { public static void main(String args[]) { String line =

matches() public class Matching { public static void main(String args[]) { String line = new String( "The cat sat on the mat"); System. out. println("Line: "" + line + """); String pat = "cat"; System. out. println("Pattern: "" + pat + "" matches: " + line. matches(pat)); pat = ". *cat. *"; System. out. println("Pattern: "" + pat + "" matches: " + line. matches(pat)); pat = ". *dog. *"; System. out. println("Pattern: "" + pat + "" matches: " + line. matches(pat)); } } tricky since matches() is NOT grep, and so the pattern must Adv. Algs: 11. Regexs specify the entire line . * means 0 or more of any character 32

split() One common use is to split a sentence into an array of words

split() One common use is to split a sentence into an array of words (tokens). public class Splitter { public static void main(String args[]) { String line = "James Bond is 007, or is he? "; String[] tokens = line. split("\s+"); // first approach // String[] tokens = line. split("\W+"); // second System. out. println("Number of tokens = " + tokens. length); for(String tok : tokens) System. out. println(tok); } } Adv. Algs: 11. Regexs 33

Example first: "\s+" Not the same!! Look for the punctuation second: "\W+" "James Bond

Example first: "\s+" Not the same!! Look for the punctuation second: "\W+" "James Bond is Adv. Algs: 11. Regexs 007, or is he? " 34

Replacing Text § In the String class: § String replace. All(String regex, String new-text);

Replacing Text § In the String class: § String replace. All(String regex, String new-text); § String replace. First(String regex, String replacement) Adv. Algs: 11. Regexs 35

Example public class Replacer { public static void main(String[] args) { String line =

Example public class Replacer { public static void main(String[] args) { String line = "Java provides java tutorials"; System. out. println("String: "" + line + """); String line 1 = line. replace. All("[Jj]ava", "JAVA"); // pattern, new text System. out. println("Modified string: "" + line 1 + """); } // end of main() } // end of Replacer class Adv. Algs: 11. Regexs 36

10. Look-ahead & Look-behind § A Look-ahead expression looks forward, starting from its location

10. Look-ahead & Look-behind § A Look-ahead expression looks forward, starting from its location in the pattern. § A Look-behind expression looks before, ending at its location in the pattern. § These patterns do not capture values. § They only succeed/fail if a match is possible/or not. Adv. Algs: 11. Regexs 37

Positive & Negative Look-ahead § )? =X ( positive look-ahead for X § )?

Positive & Negative Look-ahead § )? =X ( positive look-ahead for X § )? !X ( X) negative look-ahead for X (i. e. look for no § Examples: § q(? =u) § Is there a "q" that is followed by a "u". § "u" is not part of the match. § q(? !u) § Is there a "q" not followed by a "u". § "u" is not part of the match

Examples 39

Examples 39

Look-ahead (? =X) Further Adv. Algs: 11. Regexs 40

Look-ahead (? =X) Further Adv. Algs: 11. Regexs 40

Positive & Negative Look-behind § )? <=X ( positive look-behind for X § )?

Positive & Negative Look-behind § )? <=X ( positive look-behind for X § )? <!X ( negative look-behind for X (i. e. look for no X) § Examples: § (? <=a)b § Is there a "b" that is preceded by an "a". § "a" is not part of the match. § (? <!a)b § Is there a "b" not preceded by an "a". § "a" is not part of the match Note the ordering has changed

Examples 42

Examples 42

Look-behind (? <=X) Further Adv. Algs: 11. Regexs 43

Look-behind (? <=X) Further Adv. Algs: 11. Regexs 43

11. Implementing grep § Implementing grep is simple if we use Java's java. util.

11. Implementing grep § Implementing grep is simple if we use Java's java. util. regex package (or String. matches()). § But it is quite easy to implement a small set of regex operators by only testing charaters. § based on sgrep. c described in 'The Practice of Programming', ch 9. 2, by Brian W. Kernighan and Rob Pike § the code and an extract of this section (grep. Pop. pdf) are on the course website Adv. Algs: 11. Regexs 44

§ SGrep. java implements a regular expression matcher that supports the operators: § §

§ SGrep. java implements a regular expression matcher that supports the operators: § § § a. ^ $ * Adv. Algs: 11. Regexs matches the character a matches any character matches the beginning of the input string matches the end of the input string matches 0 or more occurrences of the previous character (e. g. a*) 45

Test file: data. txt Adv. Algs: 11. Regexs 46

Test file: data. txt Adv. Algs: 11. Regexs 46

Using SGrep. java Adv. Algs: 11. Regexs 47

Using SGrep. java Adv. Algs: 11. Regexs 47

Code Extracts private static void read. Lines(String regexp, Buffered. Reader br) throws IOException {

Code Extracts private static void read. Lines(String regexp, Buffered. Reader br) throws IOException { // apply regexp to each line of the input file String line; while ((line = br. read. Line()) != null) { if (match(regexp, 0, line, 0)) System. out. println(line); } } // end of read. Lines() Adv. Algs: 11. Regexs 48

loop through text by incrementing t. Idx Adv. Algs: 11. Regexs 49

loop through text by incrementing t. Idx Adv. Algs: 11. Regexs 49

deal with the operator in regexp at position r. Idx Adv. Algs: 11. Regexs

deal with the operator in regexp at position r. Idx Adv. Algs: 11. Regexs 50

a* is "greedy with backtracking". So t. Idx– is used to reduce the size

a* is "greedy with backtracking". So t. Idx– is used to reduce the size of the match made in the earlier while loop Adv. Algs: 11. Regexs 51

Execution Examples § Look at "11. sgrep. pdf" for three short examples of the

Execution Examples § Look at "11. sgrep. pdf" for three short examples of the execution of the match() methods. § To save on writing in the notes: § m() means match() § mh() means match. Here() § ms() means match. Star() Adv. Algs: 11. Regexs 52

12. More Information There's a "Regex Extras" folder in the "Background" folder on the

12. More Information There's a "Regex Extras" folder in the "Background" folder on the web site § I explained REs in the "Discrete Maths" subject (using grep). § Very useful test site: https: //regex 101. com/ § The Java tutorial on REs is very good: § https: //docs. oracle. com/javase/tutorial/essential/regex/ § Other tutorials: § http: //ocpsoft. com/opensource/ guide-to-regular-expressions-in-java-part-1/ § and part-2 Adv. Algs: 11. Regexs 53

§ There are two cheat sheets on regexs on the course website. § The

§ There are two cheat sheets on regexs on the course website. § The standard text on REs in different languages (including Java): § Mastering Regular Expressions Jeffrey E F Friedl O'Reilly, 2006 Adv. Algs: 11. Regexs 54

Regular Expression Puzzle (Optional) Find a 10 -letter word that uses english letters from

Regular Expression Puzzle (Optional) Find a 10 -letter word that uses english letters from the top row of the QWERTY keyboard. Hint: my answer does not use 'q', but uses 't' twice. Adv. Algs: 11. Regexs 55