Programming Language Concepts Perl Regular Expressions Adapted by
Programming Language Concepts Perl Regular Expressions Adapted by Carl Reynolds from materials by Sean P. Strout 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 1
What are Regular Expressions? • Regular expressions are a tiny language used for fast, flexible, and reliable string handling • Where are regex’s found? • Perl uses regexs for pattern matching, extraction and replacement on a given input string • The theory is simple: – Text editors like vi and emacs – Unix commands like grep, sed, awk and procmail – Web based search engines, email clients, etc. 1. There an infinite number of possible text strings 2. A given pattern divides this set into two groups: the ones that match and the ones that don’t 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 2
What are Regular Expressions? % grep ’flint. *stone’ some_file A piece of flint, a stone which may be used to start… Found obsidian, flint, granite, and small stones of… A flintrock rifle is poor condition. The sandstone mantle… • Don’t confuse a regex with globs, which are often used for directory operations my @java_src = glob “*. java”; foreach $file (@java_src) { system “javac $file”; } 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 3
Using Simple Patterns • To compare a pattern against the contents of $_ $_ = “yabba doo”; if (/abba/) { print “It matched!n”; } • The regex is enclosed in forward slashes • Spaces inside the backslash are not ignored • The evaluation returns true if the four letter string “abba” is found in $_ • Backslash escapes are available in patterns (just like double quoted strings) /coketsprite/ 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 4
Metacharacters • There a number of characters which have a special meaning in a regex • The first is the dot, . , and is known as the wildcard character, which matches any single character except a newline /bet. y/ # true = betty betsy bet=y bet. y # false = bety betsey • The second metacharacter is the backslash, , which makes any metacharacter nonspecial • How do I match against these strings? 3. 14159 bt 5/19/2021 /3. 14159/ /b\t/ PLC - Perl Regular Expresisons (v 2. 0) 5
Simple Quantifiers • Repetition in a pattern can be specified using the following three simplified quantifiers – The star, * , matches the preceding item zero or more times /fredt*barney/ # fredtbarney fredttbarney fredbarney /fred. *barney/ # fred + any old junk + barney – The plus, + , matches the preceding item one or more times /fred +barney/ # fred and barney separated by only spaces – The question mark, ? , matches the preceding item zero or one times /bamm-? bamm/ 5/19/2021 # bamm-bamm or bamm PLC - Perl Regular Expresisons (v 2. 0) 6
More Metacharacters • Use parentheses, () , for grouping a collection of items for a quantifier match /fred+/ /(fred)*/ # freddd # fredfred # What does this match? • Use the vertical bar, | , to match either the left side or the right side /fred|barney|betty/ /fred( |t)+barney/ # fred barney betty # “fred barney” “fred tbarney” • Rewrite the last pattern to match only if the characters between fred and barney are all spaces or all tabs /fred( +|t+)barney/ 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 7
Pattern Testing • Use the unix command, egrep , to test a regular expression against a file – some. txt: barney or fred and barney – % egrep ’fred (and|or) barney’ some. txt fred and barney • /usr/local/pub/sps/courses/plc/examples/perl/pattern_test • Write a pattern that matches any string containing: – Any string containing fred or Fred – At least one a followed by any number of b’s /fred|Fred/ /a+b*/ – At least one backslash followed by one or zero asterisks 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) /\+*? / 8
Character Classes • A character class is a list of possible characters inside square brackets, [] , matching any single character from within the class [abcwxyz] # matches one of those 7 chars • The minus, -, is used to specify a range of characters [a-cw-z] [a-z. A-Z] [65 -90] # same, uses hypen for range # 52 upper/lowercase letters # A-Z $_ = “The HAL-9000 requires authorization to continue. ”; if (/HAL-[0 -9]+/) { print “The string matches some model of HAL. n” } 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 9
Character Classes • The caret, ^ , is used to negate a character class. This is a way to specify characters not to match against [^def] # match any char except d, e, f • Since minus is special inside a character class, it must be backslashed to make it non-special [^n-z] # match any char except n, -, z • Write a regular expression that matches 3 letter strings: 1 st letter: a, b or c 2 nd letter: 3 rd letter: not a, b or c 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) /[abc]-[^abc]/ 10
Character Class Shortcuts • Some character classes appear so frequently that they have shortcuts d w s # a single digit, [0 -9] # a “word” character, [A-Za-z 0 -9_] # a “space” character, [ ftnr] • Add quantifiers to beef up your pattern testing w+ s* s+ 5/19/2021 # match a series of one or more word chars # any amount of whitespace (including none) # one or more whitespace characters PLC - Perl Regular Expresisons (v 2. 0) 11
Negating Character Class Shortcuts • To do the opposite of a character class shortcut, use negate, ^ [^d] [^w] [^s] # non-digit character, also D # non-word character, also W # non-space character, also S • What do these match? 1. 2. 3. 4. 5/19/2021 /[d. A-Fa-f]+/ /[dD]/ /[^dD]/ /{S+, S+}/ hex number any digit or nondigit-any character, even newline anything that’s not a digit or a nondigit, nothing {1. 22, 5, 8. 8} PLC - Perl Regular Expresisons (v 2. 0) 12
General Quantifiers • We’ve already seen the 3 simple quantifiers, *, + and ? • Use curly braces, {} , to specify the minimum and maximum repetitions allowed /a{5, 15}/ /(fred){3, }/ /w{8}/ # match from 5 to 15 a’s # match 3 or more fred in a row # match exactly 8 word chars • The simple quantifiers, therefore, are just shortcuts for this general form. Can you come up with them? 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 13
Anchors • The caret anchor, ^ , marks the beginning of a string • Yes, you’ve seen this used before – inside character classes for negation. The meaning is different outside a character class • The dollar sign, $, marks the end of string (as well as a trailing newline) /^fred/ /rock$/ /^fred$/ /^s*$/ 5/19/2021 # # matches fred only at start of string matches rock only at end of string only matches fred, alone as a string What does this match? Empty line PLC - Perl Regular Expresisons (v 2. 0) 14
Word Anchors • The word boundary, b , matches on either end of a word. /bfredb/ # # /bhunt/ # # /stoneb/ # # 5/19/2021 match the word fred but not frederick, alfred or manfred mann matches hunt, hunting or hunter but not shunt matches sandstone, flintstone but not capstones PLC - Perl Regular Expresisons (v 2. 0) 15
Word Anchors • The word boundary matches at the start or end of a group of w characters. Quotes and apostrophes don’t change the word boundaries • The non-word boundary, B , matches where b would not /bsearchB/ 5/19/2021 # match searches, searching, searched # but not search or researching PLC - Perl Regular Expresisons (v 2. 0) 16
Memory Parentheses • The parentheses are also used for remembering the substring matched by the pattern • Backreferences refer back to an earlier saved memory within the same regex /(. )1/ # match any char, followed by same char • Each pair of parentheses equals one regex memory – Ordering is outer to inner, left to right /((a|b) (c)) 1/ /((a|b) (c)) 2/ /((a|b) (c)) 3/ # matches a c or b c # matches a c a or b c b # matches a c c or b c c • Soon we will see how we can extract these memories into memory variables for future use 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 17
Precedence • There are four levels of precedence from high to low 1. Parentheses - used for grouping and memory 2. Quantifiers - the repeat operators 3. Anchors and Sequence - string and word boundaries, and sequences of items 4. Alternation – cuts pattern into two pieces /^fred|barney$/ /^(fred|barney)$/ /(wilma|pebbles? )/ /^(w+)s+(w+)$/ 5/19/2021 # # # fred at start or barney at end fred or barney at start/end wilma, pebbles, or pebble (as part of a string) What does this match? PLC - Perl Regular Expresisons (v 2. 0) 18
Regex Cheat Sheet Metacharacters Meaning . Match a single character (except /n) Make a metacharcter non-special () | [] Group items for matching or memory Alternative for matching either right or left side Character class for matching a single character in the class Anchors ^ Word anchor to start of string $ Word anchor to end of string b Word boundary B Non-word boundary Quantifiers Meaning * Match preceding item 0 or more times + Match preceding item 1 or more times ? Match preceding item 0 or 1 times {, } 5/19/2021 Meaning General form for range specification PLC - Perl Regular Expresisons (v 2. 0) 19
Regex Cheat Sheet Character Classes [] - Specify a range from item on left to item on right ^ Negate the characters in the class (don’t match on them) d Match a single digit character, [0 -9] w Match a single word character, [A-Za-z 0 -9_] s Match a single space character, [ftnr] D Match a single non-digit character W Match a single non-word character S Match a single non-space character Memory Variables 5/19/2021 Meaning # Backreference to previous saved memory in parentheses, #=1. . n $# Memory variable to previous saved memory in parentheses, #=1. . n PLC - Perl Regular Expresisons (v 2. 0) 20
Testing Your Understanding • Use egrep and write the regular expressions to match on the dictionary file: ~jdb/public_html/plc/perl. Lab/dictionary. txt 1. Words 2. Words 3. Words 4. Words 5. Words 6. Words 7. Words 8. Words 9. Words 10. Words 11. Words 12. Words 13. Words 14. Words 5/19/2021 starting with a that start with a or A with exactly four letters and begins with a with four or more letters that contain a capital letter that start with non-capital and include a capital with more than one capital the neither begin nor end with a that begin and end with a vowel that neither begin nor end with a vowel that contain the vowel sequence a-e-i-o-u that contain your initials in order that contain only the letters of your first name PLC - Perl Regular Expresisons (v 2. 0) 21
Matches with m// • Evaluating a regex, /fred/ , is actually a shortcut for the pattern match operator, m/fred/ • Any pair of delimeters may be used to quote the contents • To implicitly refer to the match operator, you must use the forward slash • Matching a URL might be easier with a different delimiter m%^http: //% 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 22
Option Modifiers • Option modifiers are activated by appending a flag to the closing delimiter • Case-insensitive matching is achieved with /i print “Would you like to play a game? “; chomp($_ = <STDIN>) { if (/byesb/i) { print “ 1. Global Thermonuclear War”; … } • The dot matches any character (including newline) with /s $_ = “I saw Barneynbowlingnwith Frednlast night. n”; if (/b. Barneyb. *b. Fredb/s) { print “The string mentions Fred after Barney!n”; } 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 23
Combining Option Modifiers • Concatenate flags together to apply multiple options to the same pattern if (/bbarneyb. *bfredb/si) { # both /s and /i print “The string mentions Fred after Barney!n”; } • Use man perlop to get the complete list of options 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 24
The Binding Operator • Matching against $_ is the default. To match against an arbitrary string on the left, use =~ my $some_other = “I dream of betty rubble. ”; if ($some_other =~ /brub/) { print “Aye, there’s the rub. n”; } • This is not an assignment, it is an evaluation that returns the result of the test my $likes_perl = <STDIN> =~ /byesb/i; if ($likes_perl) { print “Why don’t you marry it? ”; } 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 25
Interpolating into Patterns • A regex is double quote interpolated #!/usr/bin/perl –w my $what = “larry”; while (<>) { if (/^($what)/) { # pattern anchored at start print “Saw $what at beginning of $_”; } } 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 26
Match Variables • When using memories, (parens), the results are stored into scalar variables with names like $1 and $2 • These variables coincide with the order of the parentheses pairs in the regex and allow us to pull out parts of the string $_ = “Hello there, neighbor”; if (/s(w+), /) { # memorize word between space and comma print “the word was $1n”; # got the word “there” } if (/(S+), (S+)/) { print “words were $1 $2 $3n”; # $1=Hello } # $2=there # $3=neighbor 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 27
Match Variables • Continually match an expression against a string to find multiple occurrences of the same pattern $_ = "word 1 word 2 word 3 word 4"; while (/(S+)(. *)/) { print "$1n"; $_ = $2; } • A memory variable may contain an empty string $dino = “A million years. ”; if ($dino =~ /(d*) years/) { # $1=(), #2=undef, #3=undef… } 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 28
The Persistence of Memory • The match variables generally stay around until the next successful pattern match • An unsuccessful match leaves the previous match intact $wilma =~ /(w+)/; # match fails print “Wilma’s word was $1… or was it? n”; • For maintainability, don’t use a match variable more than a few lines after the pattern match. Copy it into an ordinary variable if ($wilma =~ /(w+)/ { my $wilma_word = $1; } 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 29
Automatic Match Variables • There are three more free variables from a pattern match – Whatever came before the match is in $` – Whatever matched is stored in $& – Whatever came after the match is in $’ if (“Hello there, neighbor” =~ /s(w+), /) { print “That was ($`)($&)($’). n”; } # (Hello)( there, )(neighbor) # $1=there (note the difference between $1 and $&) • Revisit the pattern_test program 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 30
Substitution with s/// • Replace whatever part of a variable matches a pattern with a replacement string $_ = “Hey, Barney!”; s/Barney/Fred/; print “$_n”; # Hey, Fred! • If the match fails, nothing happens and the variable is untouched s/Wilma/Betty/; # $_=“Hey, Fred!” • Memory variables can also be used s/Hey, (w+)/Yo, $1/; 5/19/2021 # $_=“Yo, Fred!” PLC - Perl Regular Expresisons (v 2. 0) 31
Substitution with s/// • Follow through with this series of substitutions $_ = “green scaly dino”; s/(w+)/$2, $1/; # “scaly, green dino” s/^/huge, /; # “huge, scaly, green dino” s/, . *een//; # “huge dino” (greedy match) s/green/red; # false: “huge dino” s/w+$/($`!)$&/; # “huge (huge !)dino” s/s+(!W+)/$1 /; # “huge (huge!) dino” s/huge/gigantic; # “gigantic (huge!) dino” • The return value is true if the substitution was successful 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 32
Global Replacement with /g • The /g modifier tells s/// to make all possible nonoverlapping replacements $_ = “home, sweet home!”; s/home/cave/g; # $_=“cave, sweet cave!” • Collapsing whitespace $_ = “Input datat may have extra whitespace. ”; s/s+/ /g; # “Input data may have extra whitespace. ” • Stripping whitespace s/^s+//; # Replace leading whitespace with nothing s/s+$//; # Replace trailing whitespace with nothing 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 33
The Binding Operator and Case Shifting • Choose a different target for s/// by using the binding operator $file = “/usr/bin/perl”; $file =~ s#^. */##s; # $file=perl # (s means dot matches any character) • Use U and L to force what follows to uppercase and lowercase $_ = “I saw Barney with Fred. ”; s/(fred|barney)/U$1/gi; # “I saw BARNEY with FRED. ” s/(fred|barney)/L$1/gi; # “I saw barney with fred. ” • Turn off case shifting with E 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 34
The split Operator • Use split to break a string according to a separator @fields = split /: /, “abc: def: g: h”; #gives (“abc”, “def”, “g”, “h”) @fields = split /: /, “abc: def: : g: h”; #gives (“abc”, “def”, “g”, “h”) • Leading empty fields are returned but trailers are not @fields = split /: /, “: : : a: b: c: : : ”; #gives (“”, “”, “a”, “b”, “c”) 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 35
The split Operator • Whitespace runs are equivalent to a single space my $some_input = “This is a t test. n”; my @args = split /s+/, $some_input; #(“This”, “a”, “test. ”) • The default for split is to break up $_ on whitespace my @fields = split; # like: split /s+/, $_; 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 36
The join Function • The join function doesn’t use patterns, but it is considered the opposite of split my $x = join “: ”, 4, 6, 8, 10, 12; # $x=4: 6: 8: 10: 12 my $y = join “foo”, “bar” # gives just “bar” (foo not needed as glue) my @empty; my $empty = join “baz”, @empty; # no items, it’s an empty string my @values = split /: /, $x; my $z = join “-”, @values; 5/19/2021 # @values=(4, 6, 8, 10, 12) # $z=“ 4 -6 -8 -10 -12” PLC - Perl Regular Expresisons (v 2. 0) 37
Testing Your Understanding • Make a pattern that will match three consecutive copies of whatever is currently contained in $what. That is, if $what is fred, your pattern should match fredfred. If $what is fred|barney, your pattern should match fredbarney or barneyfred or barneybarney or many other variations. 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 38
Revision History • Revision History – v 1. 00, 10/19/2003 11: 27 PM, sps Initial revision. – v 1. 01, 1/26/2004 3: 12 PM, sps 20032 updates. -- v 2. 0, 1/17/2005, chr 5/19/2021 PLC - Perl Regular Expresisons (v 2. 0) 39
- Slides: 39