Software Tools Regular Expressions Slide 2 What is

  • Slides: 45
Download presentation
Software Tools Regular Expressions

Software Tools Regular Expressions

Slide 2 What is a Regular Expression? l A regular expression is a pattern

Slide 2 What is a Regular Expression? l A regular expression is a pattern to be matched against a string. For example, the pattern Bill. l Matching either succeeds or fails. Sometimes you may want to replace a matched pattern with another string. Regular expressions are used by many other Unix commands and programs, such as grep, sed, awk, vi, emacs, and even some shells. l l

Slide 3 l Simple Uses of Regular Expressions If we are looking for all

Slide 3 l Simple Uses of Regular Expressions If we are looking for all the lines in a file that contain the string Shakespeare, we could use the grep command: $ grep Shakespeare movie > result l l Here, Shakespeare is the regular expression that grep looks for in the file movie. Lines that match are redirected to result.

Slide 4 l Simple Uses of Regular Expressions In Perl, we can make Shakespeare

Slide 4 l Simple Uses of Regular Expressions In Perl, we can make Shakespeare a regular expression by enclosing it in slashes: if(/Shakespeare/){ print $_; } l What is tested in the if-statement? Answer: $_. l When a regular expression is enclosed in slashes, $_ is tested against the regular expression, returning true if there is a match, false otherwise.

Slide 5 Simple Uses of Regular Expressions if(/Shakespeare/){ print $_; } l l The

Slide 5 Simple Uses of Regular Expressions if(/Shakespeare/){ print $_; } l l The previous example tests only one line, and prints out the line if it contains Shakespeare. To work on all lines, add a loop: while(<>){ if(/Shakespeare/){ print; } }

Slide 6 l l Simple Uses of Regular Expressions What if we are not

Slide 6 l l Simple Uses of Regular Expressions What if we are not sure how to spell Shakespeare? Certainly the first part is easy Shak, and there must be a r near the end. How can we express our idea? grep: grep "Shak. *r" movie > result Perl: while(<>){ if(/Shak. *r/){ print; } } . * means “zero or more of any character”.

Slide 7 grep: Simple Uses of Regular Expressions grep "Shak. *r" movie > result

Slide 7 grep: Simple Uses of Regular Expressions grep "Shak. *r" movie > result l The double quotes in this grep example are needed to prevent the shell from interpreting * as “all files”. l Since Shakespeare ends in “e”, shouldn’t it be: Shak. *r. * Answer: No need. Any character can come before or after the pattern. Shak. *r is the same as. *Shak. *r. *

Slide 8 Substitution l l Another simple regular expression is the substitute operator. It

Slide 8 Substitution l l Another simple regular expression is the substitute operator. It replaces part of a string that matches the regular expression with another string. s/Shakespeare/Bill Gates/; l $_ is matched against the regular expression (Shakespeare). l If the match is successful, the part of the string that matched is discarded and replaced by the replacement string (Bill Gates). l If the match is unsuccessful, nothing happens.

Slide 9 Substitution l The program: $ cat movie Titanic Saving Private Ryan Shakespeare

Slide 9 Substitution l The program: $ cat movie Titanic Saving Private Ryan Shakespeare in Love Life is Beautiful $ cat sub 1 #!/usr/local/bin/perl 5 -w while(<>){ if(/Shakespeare/){ s/Shakespeare/Bill Gates/; print; } } $ sub 1 movie Bill Gates in Love $

Slide 10 Substitution l An even shorter way to write it: $ cat sub

Slide 10 Substitution l An even shorter way to write it: $ cat sub 2 #!/usr/local/bin/perl 5 -w while(<>){ if(s/Shakespeare/Bill Gates/){ print; } } $ sub 2 movie Bill Gates in Love $

Slide 11 Patterns l A regular expression is a pattern. l Some parts of

Slide 11 Patterns l A regular expression is a pattern. l Some parts of the pattern match a single character (a). l Other parts of the pattern match multiple characters (. *).

Slide 12 Single-Character Patterns l l l The dot “. ” matches any single

Slide 12 Single-Character Patterns l l l The dot “. ” matches any single character except the newline (n). For example, the pattern /a. / matches any twoletter sequence that starts with a and is not “an”. Use . if you really want to match the period. $ cat test hi hi bob. $ cat sub 3 #!/usr/local/bin/perl 5 -w while(<>){ if(/. /){ print; } } $ sub 3 test hi bob. $

Slide 13 Single-Character Groups l If you want to specify one out of a

Slide 13 Single-Character Groups l If you want to specify one out of a group of characters to match use [ ]: /[abcde]/ This matches a string containing any one of the first 5 lowercase letters, while: /[aeiou. AEIOU]/ matches any of the 5 vowels in either upper or lower case.

Slide 14 Single-Character Groups l If you want ] in the group, put a

Slide 14 Single-Character Groups l If you want ] in the group, put a backslash before it, or put it as the first character in the list: /[abcde]]/ # matches [abcde] + ] /[abcde]]/ # okay /[]abcde]/ # also okay l Use - for ranges of characters (like a through z): /[0123456789]/ /[0 -9]/ l # any single digit # same If you want - in the list, put a backslash before it, or put it at the beginning/end: /[X-Z]/ /[X-Z]/ /[XZ-]/ /[-XZ]/ # # matches X, X, X, -, Y, -, Z, X, Z Z Z

Slide 15 Single-Character Groups l More range examples: /[0 -9-]/ # match 0 -9,

Slide 15 Single-Character Groups l More range examples: /[0 -9-]/ # match 0 -9, or minus /[0 -9 a-z]/ # match any digit or lowercase letter /[a-z. A-Z 0 -9_]/ # match any letter, digit, underscore l There is also a negated character group, which starts with a ^ immediately after the left bracket. This matches any single character not in the list. /[^0123456789]/ # match any single non-digit /[^0 -9]/ # same /[^aeiou. AEIOU]/ # match any single non-vowel /[^^]/ # match any single character except ^

Slide 16 Single-Character Groups l For convenience, some common character groups are predefined: Predefined

Slide 16 Single-Character Groups l For convenience, some common character groups are predefined: Predefined Group Negated Group d (a digit) w (word char) s (space char) [0 -9] [a-z. A-Z 0 -9_] [ tn] D (non-digit) [^0 -9] W (non-word) [^a-z. A-Z 0 -9_] S (non-space) [^ tn] d matches any digit n w matches any letter, digit, underscore n s matches any space, tab, newline n l You can use these predefined groups in other groups: /[da-f. A-F]/ # match any hexadecimal digit

Slide 17 Multipliers l l Multipliers allows you to say “one or more of

Slide 17 Multipliers l l Multipliers allows you to say “one or more of these” or “up to four” of these. ” * means zero or more of the immediately previous character (or character group). + means one or more of the immediately previous character (or character group). ? means zero or one of the immediately previous character (or character group).

Slide 18 Multipliers l Example: /Ga+te? s/ l matches a G followed by one

Slide 18 Multipliers l Example: /Ga+te? s/ l matches a G followed by one or more a’s followed by t, followed by an optional e, followed by s. *, +, and ? are greedy, and will match as many characters as possible: $_ = "Bill xxxxx Gates"; s/x+/Cheap/; # gives: Bill Cheap Gates

Slide 19 General Multiplier l How do you say “five to ten x’s”? /xxxxxx?

Slide 19 General Multiplier l How do you say “five to ten x’s”? /xxxxxx? x? x? / /x{5, 10}/ l # works, but ugly # nicer How do you say “five or more x’s”? /x{5, }/ l How do you say “exactly five x’s”? /x{5}/ l How do you say “up to five x’s”? /x{0, 5}/

Slide 20 General Multiplier l How do you say “c followed by any 5

Slide 20 General Multiplier l How do you say “c followed by any 5 characters (which can be different) and ending with d”? /c. {5}d/ l l l * is the same as {0, } + is the same as {1, } ? is the same as {0, 1}

Slide 21 Pattern Memory l l How would we match a pattern that starts

Slide 21 Pattern Memory l l How would we match a pattern that starts and ends with the same letter or word? For this, we need to remember the pattern. Use ( ) around any pattern to put that part of the string into memory (it has no effect on the pattern itself). To recall memory, include a backslash followed by an integer. /Bill(. )Gates1/

Slide 22 Pattern Memory l Example: /Bill(. )Gates1/ l This example matches a string

Slide 22 Pattern Memory l Example: /Bill(. )Gates1/ l This example matches a string starting with Bill, followed by any single non-newline character, followed by Gates, followed by that same single character. So, it matches: Bill!Gates! Bill-Gates- but not: Bill? Gates! Bill-Gates_ (Note that /Bill. Gates. / would match all four)

Slide 23 Pattern Memory l More examples: /a(. )b(. )c2 d1/ l l This

Slide 23 Pattern Memory l More examples: /a(. )b(. )c2 d1/ l l This example matches a string starting with a, a character (#1), followed by b, another single character (#2), c, the character #2, d, and the character #1. So it matches: a-b!c!d-.

Slide 24 Pattern Memory l l The reference part can have more than a

Slide 24 Pattern Memory l l The reference part can have more than a single character. For example: /a(. *)b1 c/ l l This example matches an a, followed by any number of characters (even zero), followed by b, followed by the same sequence of characters, followed by c. So it matches: a. Billb. Billc and abc, but not: a. Billb. Bill. Gatesc.

Slide 25 Alteration l l How about picking from a set of alternatives when

Slide 25 Alteration l l How about picking from a set of alternatives when there is more than one character in the patterns. The following example matches either Gates or Clinton or Shakespeare: /Gates|Clinton|Shakespeare/ l For single character alternatives, /[abc]/ is the same as /a|b|c/.

Slide 26 Anchoring Patterns l l Anchors requires that the pattern be at the

Slide 26 Anchoring Patterns l l Anchors requires that the pattern be at the beginning or end of the line. ^ matches the beginning of the line (only if ^ is the first character of the pattern): /^Bill/ /^Gates/ /Bill^/ /^/ l # # match lines that begin containing with Bill with Gates Bill^ somewhere ^ $ matches the end of the line (only if $ is the last character of the pattern): /Bill$/ /Gates$/ /$Bill/ /$/ # # match lines that end with Bill lines that end with Gates with contents of scalar $Bill lines containing $

Slide 27 Precedence l So what happens with the pattern: Is this (a|b)* or

Slide 27 Precedence l So what happens with the pattern: Is this (a|b)* or a|(b*) ? l Precedence of patterns from highest to lowest: l Name Parentheses Multipliers Sequence & anchoring Alternation l a|b* Representation ( ) ? + * {m, n} abc ^ $ | By the table, * has higher precedence than |, so it is interpreted as a|(b*).

Slide 28 Precedence l l What if we want the other interpretation in the

Slide 28 Precedence l l What if we want the other interpretation in the previous example? Answer: Simple, just use parentheses: (a|b)* Use parentheses in ambiguous cases to improve clarity, even if not strictly needed. When you use parentheses for precedence, they also go into memory (1, 2, 3).

Slide 29 Precedence l More precedence examples: abc* # matches ab, abcc, abccc, …

Slide 29 Precedence l More precedence examples: abc* # matches ab, abcc, abccc, … (abc)* # matches "", abcabc, abcabcabc, … ^a|b # matches a at beginning of line, or b anywhere ^(a|b) # matches either a or b at the beginning of line a|bc|d # a, or bc, or d (a|b)(c|d) # ac, ad, bc, or bd (Bill Gates)|(Bill Clinton) # Bill Gates, Bill Clinton Bill (Gates|Clinton) # Bill Gates, Bill Clinton (Mr. Bill)|(Bill (Gates|Clinton)) # Mr. Bill, Bill Gates, Bill Clinton (Mr. )? Bill( Gates| Clinton)? # Bill, Mr. Bill, Bill Gates, Bill Clinton, # Mr. Bill Gates, Mr. Bill Clinton

Slide 30 =~ l What if you want to match a different variable than

Slide 30 =~ l What if you want to match a different variable than $_? Answer: Use =~. l Examples: l $name = "Bill Shakespeare"; $name =~ /^Bill/; # true $name =~ /(. )1/; # also true (matches ll) if($name =~ /(. )1/){ print "$namen"; }

Slide 31 =~ l An example using =~ to match <STDIN>: $ cat match

Slide 31 =~ l An example using =~ to match <STDIN>: $ cat match 1 #!/usr/local/bin/perl 5 -w print "Quit (y/n)? "; if(<STDIN> =~ /^[y. Y]/){ print "Quittingn"; exit; } print "Continuingn"; $ match 1 Quit (y/n)? y Quitting $

Slide 32 =~ l Another example using =~ to match <STDIN>: $ cat match

Slide 32 =~ l Another example using =~ to match <STDIN>: $ cat match 2 #!/usr/local/bin/perl 5 -w print "Wakeup (y/n)? "; while(<STDIN> =~ /^[n. N]/){ print "Sleepingn"; print "Wakeup (y/n)? "; } $ match 2 Wakeup (y/n)? n Sleeping Wakeup (y/n)? N Sleeping Wakeup (y/n)? y $

Slide 33 Ignoring Case l In the previous examples, we used [y. Y] and

Slide 33 Ignoring Case l In the previous examples, we used [y. Y] and [n. N] to match either upper or lower case. l Perl has an “ignore case” option for pattern matching: /somepattern/i $ cat match 1 a #!/usr/local/bin/perl 5 -w print "Quit (y/n)? "; if(<STDIN> =~ /^y/i){ print "Quittingn"; exit; } print "Continuingn"; $ match 1 a Quit (y/n)? Y Quitting $

Slide 34 Slash and Backslash l If your pattern has a slash character (/),

Slide 34 Slash and Backslash l If your pattern has a slash character (/), you must precede each with a backslash (): $ cat slash 1 #!/usr/local/bin/perl 5 -w print "Enter path: "; $path = <STDIN>; if($path =~ /^/usr/local/bin/){ print "Path is /usr/local/binn"; } $ slash 1 Enter path: /usr/local/bin Path is /usr/local/bin $

Slide 35 Different Pattern Delimiters l l # If your pattern has lots of

Slide 35 Different Pattern Delimiters l l # If your pattern has lots of slash characters (/), you can also use a different pattern delimiter with the form: m#somepattern# The # can be any non-alphanumeric character. $ cat slash 1 a #!/usr/local/bin/perl 5 -w print "Enter path: "; $path = <STDIN>; if($path =~ m#^/usr/local/bin#){ if($path =~ m@^/usr/local/bin@){ # also works print "Path is /usr/local/binn"; } $ slash 1 a Enter path: /usr/local/bin Path is /usr/local/bin $

Slide 36 Special Read-Only Variables l l After a successful pattern match, the variables

Slide 36 Special Read-Only Variables l l After a successful pattern match, the variables $1, $2, $3, … are set to the same values as 1, 2, 3, … You can use $1, $2, $3, … later in your program. $ cat read 1 #!/usr/local/bin/perl 5 -w $_ = "Bill Shakespeare in Love"; /(w+)W+(w+)/; # match first two words # $1 is now "Bill" and $2 is now "Shakespeare" print "The first name of $2 is $1n"; $ read 1 The first name of Shakespeare is Bill

Slide 37 Special Read-Only Variables l You can also use $1, $2, $3, …

Slide 37 Special Read-Only Variables l You can also use $1, $2, $3, … by placing the match in a list context: $ cat read 2 #!/usr/local/bin/perl 5 -w $_ = "Bill Shakespeare in Love"; ($first, $last) = /(w+)W+(w+)/; print "The first name of $last is $firstn"; $ read 2 The first name of Shakespeare is Bill

Slide 38 Special Read-Only Variables l Other read-only variables: $& is the part of

Slide 38 Special Read-Only Variables l Other read-only variables: $& is the part of the string that matched the pattern. n $` is the part of the string before the match n $’ is the part of the string after the match n $ cat read 3 #!/usr/local/bin/perl 5 -w $_ = "Bill Shakespeare in Love"; / in /; print "Before: $`n"; print "Match: $&n"; print "After: $'n"; $ read 3 Before: Bill Shakespeare Match: in After: Love

Slide 39 More on Substitution l If you want to replace all matches instead

Slide 39 More on Substitution l If you want to replace all matches instead of just the first match, use the g option for substitution: $ cat sub 3 #!/usr/local/bin/perl 5 -w $_ = "Bill Shakespeare in s/Bill/William/; print "Sub 1: $_n"; $_ = "Bill Shakespeare in s/Bill/William/g; print "Sub 2: $_n"; $ sub 3 Sub 1: William Shakespeare Sub 2: William Shakespeare $ love with Bill Gates"; in love with Bill Gates in love with William Gates

Slide 40 More on Substitution l You can use variable interpolation in substitutions: $

Slide 40 More on Substitution l You can use variable interpolation in substitutions: $ cat sub 4 #!/usr/local/bin/perl 5 -w $find = "Bill"; $replace = "William"; $_ = "Bill Shakespeare in love with Bill Gates"; s/$find/$replace/g; print "$_n"; $ sub 4 William Shakespeare in love with William Gates $

Slide 41 More on Substitution l Pattern characters in the regular expression allows patterns

Slide 41 More on Substitution l Pattern characters in the regular expression allows patterns to be matched, not just fixed characters: $ cat sub 5 #!/usr/local/bin/perl 5 -w $_ = "Bill Shakespeare in love with Bill Gates"; s/(w+)/<$1>/g; print "$_n"; $ sub 5 <Bill> <Shakespeare> <in> <love> <with> <Bill> <Gates> $

Slide 42 More on Substitution l Substitution also allows you to: ignore case n

Slide 42 More on Substitution l Substitution also allows you to: ignore case n use alternate delimiters n use =~ n $ cat sub 6 #!/usr/local/bin/perl 5 -w $line = "Bill Shakespeare in love with bill Gates"; $line =~ s#bill#William#gi; $line =~ s@Shakespeare@Gates@gi; print "$linen"; $ sub 6 William Gates in love with William Gates $

Slide 43 split l l The split function allows you to break a string

Slide 43 split l l The split function allows you to break a string into fields. split takes a regular expression and a string, and breaks up the line wherever the pattern occurs. $ cat split 1 #!/usr/local/bin/perl 5 -w $line = "Bill Shakespeare in love with Bill Gates"; @fields = split(/ /, $line); # split $line using space as delimiter print "$fields[0] $fields[3] $fields[6]n"; $ split 1 Bill love Gates $

Slide 44 split l l You can use $_ with split defaults to look

Slide 44 split l l You can use $_ with split defaults to look for space delimiters. $ cat split 2 #!/usr/local/bin/perl 5 -w $_ = "Bill Shakespeare in love with Bill Gates"; @fields = split; # split $_ using space (default) as delimiter print "$fields[0] $fields[3] $fields[6]n"; $ split 2 Bill love Gates $

Slide 45 join l The join function allows you to glue strings in a

Slide 45 join l The join function allows you to glue strings in a list together. $ cat join 1 #!/usr/local/bin/perl 5 -w @list = qw(Bill Shakespeare dislikes Bill Gates); $line = join(" ", @list); print "$linen"; $ join 1 Bill Shakespeare dislikes Bill Gates $ l Note that the glue string is not a regular expression, just a normal string.