Regular Expression 1 Learning Objectives 1 2 3
Regular Expression (1) Learning Objectives: 1. 2. 3. To understand the concept of regular expression To learn commonly used operations involving regular expression / pattern matching To learn the special cases occurred in regular expression / pattern matching
COMP 111 Lecture 16 / Slide 2 Simple Uses of Regular Expressions § In Perl, we can make Shakespeare a regular expression by enclosing it in slashes: if(/Shakespeare/){ print $_; } § What is tested in the if-statement? Answer: $_. § Can you write a even shorter statement using &&?
COMP 111 Lecture 16 / Slide 3 Simple Uses of Regular Expressions if(/Shakespeare/){ print $_; } § The previous example tests only one line, and prints out the line if it contains Shakespeare. § To work on all lines, add a loop: while(<>){ if(/Shakespeare/){ print; } }
COMP 111 Lecture 16 / Slide 4 Simple Uses of Regular Expressions § What if we are not sure how to spell Shakespeare? § Certainly the first part is easy Shak, and there must be a r near the end. § How can we express our idea? grep: grep "Shak. *r" movie > result Perl: while(<>){ if(/Shak. *r/){ print; } } §. * means “zero or more of any character”.
COMP 111 Lecture 16 / Slide 5 Single-Character Patterns § The dot “. ” matches any single character except the newline (n). § For example, the pattern /a. / matches any two-letter sequence that starts with a and is not “an”. § Use . if you really want to match the period. $ cat test hi hi bob. $ cat sub 3 test #!/usr/local/bin/perl 5 -w while(<>){ if(/. /){ print; } } $ sub 3 test hi bob. $
COMP 111 Lecture 16 / Slide 6 Single-Character Groups (1) § If you want to specify one out of a group of characters to match use [ ]: /[abcde]/ This matches a string containing any one of the first 5 lowercase letters, while: /[aeiou. AEIOU]/ matches any of the 5 vowels in either upper or lower case.
COMP 111 Lecture 16 / Slide 7 Single-Character Groups (2) § If you want ] in the group, put a backslash before it, or put it as the first character in the list: /[abcde]]/ # matches [abcde] + ] /[abcde]]/ # okay /[]abcde]/ # also okay § Use - for ranges of characters (like a through z): /[0123456789]/ /[0 -9]/ # any single digit # same § If you want - in the list, put a backslash before it, or put it at the beginning/end: /[X-Z]/ /[X-Z]/ /[XZ-]/ /[-XZ]/ # # matches X, X, X, -, Y, -, Z, X, Z Z Z
COMP 111 Lecture 16 / Slide 8 Single-Character Groups (3) § More range examples: /[0 -9-]/ # match 0 -9, or minus /[0 -9 a-z]/ # match any digit or lowercase letter /[a-z. A-Z 0 -9_]/ # match any letter, digit, underscore § There is also a negated character group, which starts with a ^ immediately after the left bracket. This matches any single character not in the list. /[^0123456789]/ # match any single non-digit /[^0 -9]/ # same /[^aeiou. AEIOU]/ # match any single non-vowel /[^^]/ # match any single character except ^
COMP 111 Lecture 16 / Slide 9 Single-Character Groups (4) § For convenience, some common character groups are predefined: Predefined Group Negated Group d (a digit) w (word char) s (space char) [0 -9] [a-z. A-Z 0 -9_] [ tn] D (non-digit) [^0 -9] W (non-word) [^a-z. A-Z 0 -9_] S (non-space) [^ tn] d matches any digit § w matches any letter, digit, underscore § s matches any space, tab, newline § § You can use these predefined groups in other groups: /da-f. A-F/ # match any hexadecimal digit
COMP 111 Lecture 16 / Slide 10 Split (1) § The split function allows you to break a string into fields. § split takes a regular expression and a string, and breaks up the line wherever the pattern occurs. $ cat split 1 #!/usr/local/bin/perl 5 -w $line = "Bill Shakespeare in love with Bill Gates"; @fields = split(/ /, $line); # split $line using space as delimiter print "$fields[0] $fields[3] $fields[6]n"; $ split 1 Bill love Gates $
COMP 111 Lecture 16 / Slide 11 Split (2) § You can use $_ with split. § split defaults to look for space delimiters. $ cat split 2 #!/usr/local/bin/perl 5 -w $_ = "Bill Shakespeare in love with Bill Gates"; @fields = split; # split $line using space (default) as delimiter print "$fields[0] $fields[3] $fields[6]n"; $ split 2 Bill love Gates $
COMP 111 Lecture 16 / Slide 12 Pattern Memory (1) § How would we match a pattern that starts and ends with the same letter or word? § For this, we need to remember the pattern. § Use ( ) around any pattern to put that part of the string into memory (it has no effect on the pattern itself). § To recall memory, include a backslash followed by an integer. /Bill(. )Gates1/
COMP 111 Lecture 16 / Slide 13 Pattern Memory (2) § Example: /Bill(. )Gates1/ This example matches a string starting with Bill, followed by any single non-newline character, followed by Gates, followed by that same single character. § So, it matches: Bill!Gates! Bill-Gates- but not: Bill? Gates! Bill-Gates_ (Note that /Bill. Gates. / would match all four)
COMP 111 Lecture 16 / Slide 14 Pattern Memory (3) § More examples: /a(. )b(. )c2 d1/ § This example matches a string starting with a, a character (#1), followed by b, another single character (#2), c, the character #2, d, and the character #1. § So it matches: a-b!c!d-.
COMP 111 Lecture 16 / Slide 15 Pattern Memory (4) § The reference part can have more than a single character. § For example: /a(. *)b1 c/ § This example matches an a, followed by any number of characters (even zero), followed by b, followed by the same sequence of characters, followed by c. § So it matches: a. Billb. Billc and abc, but not: a. Billb. Bill. Gatesc.
COMP 111 Lecture 16 / Slide 16 Or § How about picking from a set of alternatives when there is more than one character in the patterns. § The following example matches either Gates or Clinton or Shakespeare: /Gates|Clinton|Shakespeare/ § For single character alternatives, /[abc]/ is the same as /a|b|c/.
COMP 111 Lecture 16 / Slide 17 Anchoring Patterns § Anchors requires that the pattern be at the beginning or end of the line. § ^ matches the beginning of the line (only if ^ is the first character of the pattern): /^Bill/ /^Gates/ /Bill^/ /^/ # # match lines that begin containing with Bill with Gates Bill^ somewhere ^ § $ matches the end of the line (only if $ is the last character of the pattern): /Bill$/ /Gates$/ /$Bill/ /$/ # # match lines that end with Bill lines that end with Gates with contents of scalar $Bill lines containing $
COMP 111 Lecture 16 / Slide 18 Using =~ (1) § What if you want to match a different variable than $_? § Answer: Use =~. § Examples: $name = "Bill Shakespeare"; $name =~ /^Bill/; # true $name =~ /(. )1/; # also true (matches ll) if($name =~ /(. )1/){ print "$namen"; }
COMP 111 Lecture 16 / Slide 19 Using =~ (2) § An example using =~ to match <STDIN>: $ cat match 1 #!/usr/local/bin/perl 5 -w print "Quit (y/n)? "; if(<STDIN> =~ /^[y. Y]/){ print "Quittingn"; exit; } print "Continuingn"; $ match 1 Quit (y/n)? y Quitting $
COMP 111 Lecture 16 / Slide 20 Ignoring Case § In the previous examples, we used [y. Y] and [n. N] to match either upper or lower case. § Perl has an “ignore case” option for pattern matching: /somepattern/i $ cat match 1 a #!/usr/local/bin/perl 5 -w print "Quit (y/n)? "; if(<STDIN> =~ /^y/i){ print "Quittingn"; exit; } print "Continuingn"; $ match 1 a Quit (y/n)? Y Quitting $
COMP 111 Lecture 16 / Slide 21 Slash and Backslash § If your pattern has a slash character (/), you must precede each with a backslash (): $ cat slash 1 #!/usr/local/bin/perl 5 -w print "Enter path: "; $path = <STDIN>; if($path =~ /^/usr/local/bin/){ print "Path is /usr/local/binn"; } $ slash 1 Enter path: /usr/local/bin Path is /usr/local/bin $
COMP 111 Lecture 16 / Slide 22 Different Pattern Delimiters § If your pattern has lots of slash characters (/), you can also use a different pattern delimiter with the form: m#somepattern# § The # can be any non-alphanumeric character. # $ cat slash 1 a #!/usr/local/bin/perl 5 -w print "Enter path: "; $path = <STDIN>; if($path =~ m#^/usr/local/bin#){ if($path =~ m@^/usr/local/bin@){ # also works print "Path is /usr/local/binn"; } $ slash 1 a Enter path: /usr/local/bin Path is /usr/local/bin $
COMP 111 Lecture 16 / Slide 23 Special Read-Only Variables (1) § After a successful pattern match, the variables $1, $2, $3, … are set to the same values as 1, 2, 3, … § You can use $1, $2, $3, … later in your program. $ cat read 1 #!/usr/local/bin/perl 5 -w $_ = "Bill Shakespeare in Love"; /(w+)W+(w+)/; # match first two words # $1 is now "Bill" and $2 is now "Shakespeare" print "The first name of $2 is $1n"; $ read 1 The first name of Shakespeare is Bill
COMP 111 Lecture 16 / Slide 24 Special Read-Only Variables (2) § You can also use $1, $2, $3, … by placing the match in a list context: $ cat read 2 #!/usr/local/bin/perl 5 -w $_ = "Bill Shakespeare in Love"; ($first, $last) = /(w+)W+(w+)/; print "The first name of $last is $firstn"; $ read 2 The first name of Shakespeare is Bill
COMP 111 Lecture 16 / Slide 25 Special Read-Only Variables (3) § Other read-only variables: $& is the part of the string that matched the pattern. § $` is the part of the string before the match § $’ is the part of the string after the match § $ cat read 3 #!/usr/local/bin/perl 5 -w $_ = "Bill Shakespeare in Love"; / in /; print "Before: $`n"; print "Match: $&n"; print "After: $'n"; $ read 3 Before: Bill Shakespeare Match: in After: Love
COMP 111 Lecture 16 / Slide 26 Repeat {n} § /(fred){5, 15}/ § Match from five to fifteen repetitions of “fred” § /a{5, }/ § Match five or more times repetitions of “a” § /w{8}/ § Match exactly 8 word characters.
- Slides: 26