Advanced Algorithms 240 426 Semester 1 2021 2022




![A Test Rig public class Test. Regex { public static void main(String[] args) { A Test Rig public class Test. Regex { public static void main(String[] args) {](https://slidetodoc.com/presentation_image_h2/161ae6d85a116c69a8cfb315a390c3ab/image-5.jpg)
![2. Basic RE Operators abc exactly this sequence of three letters a|b|c [abc] a, 2. Basic RE Operators abc exactly this sequence of three letters a|b|c [abc] a,](https://slidetodoc.com/presentation_image_h2/161ae6d85a116c69a8cfb315a390c3ab/image-6.jpg)
![[a-z] any one character from a through z [a-z. A-Z 0 -9] any one [a-z] any one character from a through z [a-z. A-Z 0 -9] any one](https://slidetodoc.com/presentation_image_h2/161ae6d85a116c69a8cfb315a390c3ab/image-7.jpg)



![3. Builtin Character Classes d a digit: [0 -9] Da non-digit: [^0 -9] Notice 3. Builtin Character Classes d a digit: [0 -9] Da non-digit: [^0 -9] Notice](https://slidetodoc.com/presentation_image_h2/161ae6d85a116c69a8cfb315a390c3ab/image-11.jpg)















![Example public class Groups { public static void main(String args[]) { String line = Example public class Groups { public static void main(String args[]) { String line =](https://slidetodoc.com/presentation_image_h2/161ae6d85a116c69a8cfb315a390c3ab/image-27.jpg)




![matches() public class Matching { public static void main(String args[]) { String line = matches() public class Matching { public static void main(String args[]) { String line =](https://slidetodoc.com/presentation_image_h2/161ae6d85a116c69a8cfb315a390c3ab/image-32.jpg)



![Example public class Replacer { public static void main(String[] args) { String line = Example public class Replacer { public static void main(String[] args) { String line =](https://slidetodoc.com/presentation_image_h2/161ae6d85a116c69a8cfb315a390c3ab/image-36.jpg)



















- Slides: 55

Advanced Algorithms 240 -426 , Semester 1, 2021 -2022 11. Regular Expressions Regular expressions (REs, regexs) in Java. There's a "Regex Extras" folder in the "Background" folder on the web site Adv. Algs: 11. Regexs 1

https: //xkcd. com/208/ Adv. Algs: 11. Regexs 2

Overview § 1. What are Regular Expressions? § 2. Basic RE Operators § 3. Builtin Character Classes § 4. Boundary Matchers § 5. (Greedy) Quantifiers § 6. Three Types of Quantifiers § 7. Capturing Groups Adv. Algs: 11. Regexs § § § 8. Escaping Metachars 9. Regexs in String Methods 10. Look-ahead & Behind 11. Implementing grep 12. More Info. 3

1. What are Regular Expressions? § A regular expression (RE; regex) is a pattern used to search through text. § REs bring enormous power to string search (and editing) § Look back at my "Discrete Math" notes on REs and UNIX grep. Adv. Algs: 11. Regexs 4
![A Test Rig public class Test Regex public static void mainString args A Test Rig public class Test. Regex { public static void main(String[] args) {](https://slidetodoc.com/presentation_image_h2/161ae6d85a116c69a8cfb315a390c3ab/image-5.jpg)
A Test Rig public class Test. Regex { public static void main(String[] args) { if (args. length != 2) { System. out. println("Usage: java Test. Regex string reg. Exp"); System. exit(0); } System. out. println("Input: "" + args[0] + """); System. out. println("Regular expression: "" + args[1] + ""n"); Pattern p = Pattern. compile(args[1]); Matcher m = p. matcher(args[0]); int count = 0; while (m. find()) { System. out. println("Match "" + m. group() + "" at positions "+ m. start() + "-" + (m. end()-1)); count++; } if (count == 0) System. out. println("No matches found"); } } // end of Test. Regex class N o w ☐ i s. . . 012 3 45 m. end() is the position after the end of the match 5
![2 Basic RE Operators abc exactly this sequence of three letters abc abc a 2. Basic RE Operators abc exactly this sequence of three letters a|b|c [abc] a,](https://slidetodoc.com/presentation_image_h2/161ae6d85a116c69a8cfb315a390c3ab/image-6.jpg)
2. Basic RE Operators abc exactly this sequence of three letters a|b|c [abc] a, b, or c any one of the letters a, b, or c . any one character (except 'n') a* a+ a 0 or more times a 1 or more times Adv. Algs: 11. Regexs . * means 0 or more of any chars. + means 1 or more of any chars 6
![az any one character from a through z az AZ 0 9 any one [a-z] any one character from a through z [a-z. A-Z 0 -9] any one](https://slidetodoc.com/presentation_image_h2/161ae6d85a116c69a8cfb315a390c3ab/image-7.jpg)
[a-z] any one character from a through z [a-z. A-Z 0 -9] any one letter or digit [^abc] any character except one of the letters a, b, or c § The set of characters defined by [ ] is called a character class. Adv. Algs: 11. Regexs 7

Examples Adv. Algs: 11. Regexs 8

m. end()-1 is wrong when the match is empty Adv. Algs: 11. Regexs 9

Adv. Algs: 11. Regexs 10
![3 Builtin Character Classes d a digit 0 9 Da nondigit 0 9 Notice 3. Builtin Character Classes d a digit: [0 -9] Da non-digit: [^0 -9] Notice](https://slidetodoc.com/presentation_image_h2/161ae6d85a116c69a8cfb315a390c3ab/image-11.jpg)
3. Builtin Character Classes d a digit: [0 -9] Da non-digit: [^0 -9] Notice the space s a whitespace character: [ tnx 0 Bfr] S a non-whitespace character: [^s] w W a word character: [a-z. A-Z_0 -9] a non-word character: [^w] carriage return formfeed vertical tab leftovers from the days of teletype writers 11

Examples Adv. Algs: 11. Regexs 12

Using "\" in code § Inside Java code you need to "double escape" the RE backslashes: \d\D \s\W \S \w // match against a digit followed by a word Pattern p = Pattern. compile( "\d+\w+" ); Matcher m = p. matcher("this is the 1 st test string"); if(m. find()) System. out. println("matched [" + m. group() + "] from " + m. start() + " to " + m. end() ); else System. out. println("didn’t match"); matched [1 st] from 12 to 15 Adv. Algs: 11. Regexs 13

4. Some Boundary Matchers ^ the beginning of a line $ the end of a line b a word boundary B not a word boundary Adv. Algs: 11. Regexs written as \b, \B inside Java code 14

Start and End Line Examples The "^" and the "$" do not appear in the text. Adv. Algs: 11. Regexs 15

Word Boundary Examples § A word boundary is zero length. the word "man" the text "man" in "superman" Adv. Algs: 11. Regexs 16

5. (Greedy) Quantifiers X represents some pattern: X? X* X occurs 0 or 1 time X occurs zero or more times X+ X occurs one or more times X{n} X occurs exactly n times X{n, } X occurs n or more times X{n, m} X occurs at least n but not more than m times Adv. Algs: 11. Regexs 17

Example of Basic Quantifiers "or more" is greedy in the sense that the regex matches the most Adv. Algs: 11. Regexs 18

Numerical Quantifiers greedy in the sense that the regex matches the most why only 1 match? Adv. Algs: 11. Regexs 19

6. Three Types of Quantifiers § 1. A greedy quantifier will match as much as it can, and back off if it needs to § see examples on previous slides § 2. A reluctant quantifier will match as little as possible, then take more if it needs to § you make a quantifier reluctant by adding a ? : X? ? X*? X+? X{n}? X{n, m}? Adv. Algs: 11. Regexs 20

§ 3. A possessive quantifier will match as much as it can, and never lets it go § you make a quantifier possessive by appending a +: X? + Adv. Algs: 11. Regexs X*+ X++ X{n}+ X{n, m}+ 21

Searching for an aardvark greedy reluctant possessive Adv. Algs: 11. Regexs 22

Aardvark Explained § The text is "aardvark". "greedy with backtracking" § 1. Using the pattern a*ardvark (a* is greedy) § the a* will first match aa, but then ardvark won’t match § the a* then "backs off" and matches only a single a, allowing the rest of the pattern (ardvark) to succeed Adv. Algs: 11. Regexs 23

"not greedy with backtracking" § 2. Use the pattern a*? ardvark (a*? is reluctant) § the a*? will first match zero characters (the null string), but then ardvark won’t match § the a*? then extends and matches the first a, allowing the rest of the pattern (ardvark) to succeed Adv. Algs: 11. Regexs 24

"greedy with no backtracking" § 3. Using the pattern a*+ardvark (a*+ is possessive) § the a*+ will match the aa, and will not back off, so ardvark never matches and the pattern match fails Adv. Algs: 11. Regexs 25

7. Capturing Groups § Parentheses are used for grouping, but they also capture (keep for later use) anything matched by that part of the pattern. § Example: ([a-z. A-Z]*)([0 -9]*) matches any number of letters followed by any number of digits § If the match succeeds: § 1 holds the matched letters § 2 holds the matched digits § holds everything matched by the entire pattern Adv. Algs: 11. Regexs 26
![Example public class Groups public static void mainString args String line Example public class Groups { public static void main(String args[]) { String line =](https://slidetodoc.com/presentation_image_h2/161ae6d85a116c69a8cfb315a390c3ab/image-27.jpg)
Example public class Groups { public static void main(String args[]) { String line = "James Bond is 007, or is he? "; String pattern = " (\D+)(\d+)(\D+)"; Pattern p = Pattern. compile(pattern); Matcher m = p. matcher(line); int group. Count = m. group. Count(); System. out. println("Number of groups = " + group. Count); if (m. find()) { for (int i = 0; i <= group. Count; i++) System. out. println("Group " + i + ": " + m. group(i)); } else System. out. println("No matches found"); } } Adv. Algs: 11. Regexs 27

Using Groups in a Regex Adv. Algs: 11. Regexs 28

8. Escaping Metacharacters § A lot of special characters – parentheses, brackets, braces, stars, the plus, etc. – are used in REs § they are called metacharacters § To switch off a metacharacter, use "" § e. g. "+" § But inside Java code use, "\" § e. g. "\+" Adv. Algs: 11. Regexs 29

Confusion § The problem is that '' is used in two different ways inside Java. § 1. Java strings uses '' as an escape character to switch off the special meaning of the next character § e. g. + means '+' § 2. Inside regexs '' is used to prefix special character classes § e. g. d means "a digit character" Adv. Algs: 11. Regexs 30

9. Regexs in String Methods § The String class contains some methods that make it unnecessary to create Pattern and Matcher objects (as in Test. Regex. java): § matches() § split() § replace. All() and replace. First() Adv. Algs: 11. Regexs 31
![matches public class Matching public static void mainString args String line matches() public class Matching { public static void main(String args[]) { String line =](https://slidetodoc.com/presentation_image_h2/161ae6d85a116c69a8cfb315a390c3ab/image-32.jpg)
matches() public class Matching { public static void main(String args[]) { String line = new String( "The cat sat on the mat"); System. out. println("Line: "" + line + """); String pat = "cat"; System. out. println("Pattern: "" + pat + "" matches: " + line. matches(pat)); pat = ". *cat. *"; System. out. println("Pattern: "" + pat + "" matches: " + line. matches(pat)); pat = ". *dog. *"; System. out. println("Pattern: "" + pat + "" matches: " + line. matches(pat)); } } tricky since matches() is NOT grep, and so the pattern must Adv. Algs: 11. Regexs specify the entire line . * means 0 or more of any character 32

split() One common use is to split a sentence into an array of words (tokens). public class Splitter { public static void main(String args[]) { String line = "James Bond is 007, or is he? "; String[] tokens = line. split("\s+"); // first approach // String[] tokens = line. split("\W+"); // second System. out. println("Number of tokens = " + tokens. length); for(String tok : tokens) System. out. println(tok); } } Adv. Algs: 11. Regexs 33

Example first: "\s+" Not the same!! Look for the punctuation second: "\W+" "James Bond is Adv. Algs: 11. Regexs 007, or is he? " 34

Replacing Text § In the String class: § String replace. All(String regex, String new-text); § String replace. First(String regex, String replacement) Adv. Algs: 11. Regexs 35
![Example public class Replacer public static void mainString args String line Example public class Replacer { public static void main(String[] args) { String line =](https://slidetodoc.com/presentation_image_h2/161ae6d85a116c69a8cfb315a390c3ab/image-36.jpg)
Example public class Replacer { public static void main(String[] args) { String line = "Java provides java tutorials"; System. out. println("String: "" + line + """); String line 1 = line. replace. All("[Jj]ava", "JAVA"); // pattern, new text System. out. println("Modified string: "" + line 1 + """); } // end of main() } // end of Replacer class Adv. Algs: 11. Regexs 36

10. Look-ahead & Look-behind § A Look-ahead expression looks forward, starting from its location in the pattern. § A Look-behind expression looks before, ending at its location in the pattern. § These patterns do not capture values. § They only succeed/fail if a match is possible/or not. Adv. Algs: 11. Regexs 37

Positive & Negative Look-ahead § )? =X ( positive look-ahead for X § )? !X ( X) negative look-ahead for X (i. e. look for no § Examples: § q(? =u) § Is there a "q" that is followed by a "u". § "u" is not part of the match. § q(? !u) § Is there a "q" not followed by a "u". § "u" is not part of the match

Examples 39

Look-ahead (? =X) Further Adv. Algs: 11. Regexs 40

Positive & Negative Look-behind § )? <=X ( positive look-behind for X § )? <!X ( negative look-behind for X (i. e. look for no X) § Examples: § (? <=a)b § Is there a "b" that is preceded by an "a". § "a" is not part of the match. § (? <!a)b § Is there a "b" not preceded by an "a". § "a" is not part of the match Note the ordering has changed

Examples 42

Look-behind (? <=X) Further Adv. Algs: 11. Regexs 43

11. Implementing grep § Implementing grep is simple if we use Java's java. util. regex package (or String. matches()). § But it is quite easy to implement a small set of regex operators by only testing charaters. § based on sgrep. c described in 'The Practice of Programming', ch 9. 2, by Brian W. Kernighan and Rob Pike § the code and an extract of this section (grep. Pop. pdf) are on the course website Adv. Algs: 11. Regexs 44

§ SGrep. java implements a regular expression matcher that supports the operators: § § § a. ^ $ * Adv. Algs: 11. Regexs matches the character a matches any character matches the beginning of the input string matches the end of the input string matches 0 or more occurrences of the previous character (e. g. a*) 45

Test file: data. txt Adv. Algs: 11. Regexs 46

Using SGrep. java Adv. Algs: 11. Regexs 47

Code Extracts private static void read. Lines(String regexp, Buffered. Reader br) throws IOException { // apply regexp to each line of the input file String line; while ((line = br. read. Line()) != null) { if (match(regexp, 0, line, 0)) System. out. println(line); } } // end of read. Lines() Adv. Algs: 11. Regexs 48

loop through text by incrementing t. Idx Adv. Algs: 11. Regexs 49

deal with the operator in regexp at position r. Idx Adv. Algs: 11. Regexs 50

a* is "greedy with backtracking". So t. Idx– is used to reduce the size of the match made in the earlier while loop Adv. Algs: 11. Regexs 51

Execution Examples § Look at "11. sgrep. pdf" for three short examples of the execution of the match() methods. § To save on writing in the notes: § m() means match() § mh() means match. Here() § ms() means match. Star() Adv. Algs: 11. Regexs 52

12. More Information There's a "Regex Extras" folder in the "Background" folder on the web site § I explained REs in the "Discrete Maths" subject (using grep). § Very useful test site: https: //regex 101. com/ § The Java tutorial on REs is very good: § https: //docs. oracle. com/javase/tutorial/essential/regex/ § Other tutorials: § http: //ocpsoft. com/opensource/ guide-to-regular-expressions-in-java-part-1/ § and part-2 Adv. Algs: 11. Regexs 53

§ There are two cheat sheets on regexs on the course website. § The standard text on REs in different languages (including Java): § Mastering Regular Expressions Jeffrey E F Friedl O'Reilly, 2006 Adv. Algs: 11. Regexs 54

Regular Expression Puzzle (Optional) Find a 10 -letter word that uses english letters from the top row of the QWERTY keyboard. Hint: my answer does not use 'q', but uses 't' twice. Adv. Algs: 11. Regexs 55