Patterns Patterns and More Patterns Exploiting Perls builtin
Patterns, Patterns and More Patterns Exploiting Perl's built-in regular expression technology
Pattern Basics ● What is a regular expression? /even/ eleventually even Stevens # matches at end of word # matches at start of word # matches twice: an entire word and within a word heaven EVEN eve. N leave Steve not here # # # 'a' breaks the pattern uppercase 'E' breaks the pattern all uppercase breaks the pattern uppercase 'N' breaks the pattern not even close! space between 'Steve' and 'not' breaks the pattern
What makes regular expressions so special? my $pattern = "even"; my $string = "do the words heaven and eleven match? "; if ( find_it( $pattern, $string ) ) { print "A match was found. n"; } else { print "No match was found. n"; }
find_it the Perl way my $string = "do the words heaven and eleven match? "; if ( $string =~ /even/ ) { print "A match was found. n"; } else { print "No match was found. n"; }
Maxim 7. 1 Use a regular expression to specify what you want to find, not how to find it
Introducing The Pattern Metacharacters
The + repetition metacharacter /T+/ T TTTTTT TT t this and that hello ttttt
More repetition /ela+/ elation elaaaa /(ela)+/ elaela ela /(ela)+/ (ela)))))) (ela(ela
The | alternation metacharacter /0|1|2|3|4|5|6|7|8|9/ 0123456789 there's a 0 in here somewhere My telephone number is: 212 -555 -1029 /a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z/ /A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z/
Metacharacter shorthand character classes /0|1|2|3|4|5|6|7|8|9/ /[0123456789]/ /[aeiou]/ /a|e|i|o|u/ /[^aeiou]/ /[0123456789]/ /[0 -9]/ /[a-z]/ /[A-Z]/ /[-A-Z]/ /[BCFHST][aeiou][mty]/ Bat Hit Tot Cut Say Hog Can May bat /[Bb. Cc. Ff. Hh. Ss. Tt][aeiou][mty]/
More metacharacter shorthand /[0 -9]/ /d/ /[a-z. A-Z 0 -9_]/ /w/ /s/ /[^ tnrf]/ /D/ /[0 -9][^ tnrf][a-z. A-Z 0 -9_][^0 -9]/ /dswwD/
Maxim 7. 2 Use regular expression shorthand to reduce the risk of error
More repetition /w+/ /dsw+D/ /dsw{2}D/ /dsw{2, 4}D/ /dsw{2, }D/
The ? and * optional metacharacters /[Bb]art? / bar Bar bart Bart /[Bb]art*/ bar Bart barttt Bartttttttttt!!! /p*/
The any character metacharacter /[Bb]ar. / barb barking embarking barn Bart Barry /[Bb]ar. ? /
Anchors
The b word boundary metacharacter /bbarkb/ That dog sure has a loud bark, doesn't it? That dog's barking is driving me crazy! /BbarkB/
The ^ start-of-line metacharacter /^Bioinformatics/ Bioinformatics, Biocomputing and Perl is a great book. For a great introduction to Bioinformatics, see Moorhouse, Barry (2004).
The $ end-of-line metacharacter /Perl$/ My favourite programming language is Perl Is Perl your favourite programming language? /^$/
The Binding Operators #! /usr/bin/perl -w # The 'simplepat' program - simple regular expression example. while ( <> ) { print "Got a blank line. n" if /^$/; print "Line has a curly brace. n" if /[}{]/; print "Line contains 'program'. n" if /bprogramb/; }
Results from simplepat. . . $ perl simplepat Got a blank line. Line contains 'program'. Got a blank line. Line has a curly brace. Line contains 'program'. Line has a curly brace.
To Match or Not To Match. . . if ( $line =~ /^$/ ) if ( $line !~ /^$/ )
Remembering What Was Matched /(ela)+/ #! /usr/bin/perl -w # The 'grouping' program - demonstrates the effect # of parentheses. while ( my $line = <> ) { $line =~ /w+ (w+)/; print "Second word: '$1' on line $. . n" if defined $1; print "Fourth word: '$2' on line $. . n" if defined $2; }
Results from grouping. . . This is a sample file for use with the grouping program that is included with the Patterns and More Patterns chapter from Bioinformatics, Biocomputing and Perl. $ perl grouping test. group. data Second Fourth word: word: 'is' on line 1. 'sample' on line 1. 'grouping' on line 2. 'that' on line 2. 'and' on line 4. 'Patterns' on line 4.
The grouping 2 program #! /usr/bin/perl -w # The 'grouping 2' program - demonstrates the effect of # more parentheses. while ( my $line = <> ) { $line =~ /w+ ((w+) w+ (w+))/; print "Three words: '$1' on line $. . n" if defined $1; print "Second word: '$2' on line $. . n" if defined $2; print "Fourth word: '$3' on line $. . n" if defined $3; }
Results from grouping 2. . . Three words: Second word: Fourth word: 'is a sample' on line 1. 'is' on line 1. 'sample' on line 1. 'grouping program that' on line 2. 'grouping' on line 2. 'that' on line 2. 'and More Patterns' on line 4. 'and' on line 4. 'Patterns' on line 4.
Maxim 7. 3 When working with nested parentheses, count the opening parentheses, starting with the leftmost, to determine which parts of the pattern are assigned to which after-match variables
Greedy By Default /(. +), Bart/ Get over here, now, Bart! Do you hear me, Bart? Get over here, now, Bart! Do you hear me /(. +? ), Bart/ Get over here, now
Alternative Pattern Delimiters /usr/bin/perl //w+/w+/ //w+/w+/ //(w+)/(w+)/ m#/w+/w+# m#/(w+)/(w+)# m{ m< m[ m( } > ] ) /even/ m/even/
Another Useful Utility sub biodb 2 mysql { # # Given: a date in DD-MMM-YYYY format. # Return: a date in YYYY-MM-DD format. # my $original = shift; $original =~ /(dd)-(www)-(dd)/; my ( $day, $month, $year ) = ( $1, $2, $3 );
biodb 2 mysql subroutine, cont. $month $month $month = = = '01' '02' '03' '04' '05' '06' '07' '08' '09' '10' '11' '12' if if if $month $month $month eq eq eq 'JAN'; 'FEB'; 'MAR'; 'APR'; 'MAY'; 'JUN'; 'JUL'; 'AUG'; 'SEP'; 'OCT'; 'NOV'; 'DEC'; return $year. '-'. $month. '-'. $day; }
Alternate biodb 2 mysql patterns /(d{2})-(w{3})-(d{4})/ /(d+)-(w+)-(d+)/
Substitutions: Search And Replace s/these/those/ Give me some of these, these and these. Thanks. Give me some of those, these and these. Thanks. s/these/those/g Give me some of those, those and those. Thanks. s/these/those/gi
Substituting for whitespace s/^s+// s/s+$// s/s+/ /g
Finding A Sequence gccacagatt agaaaagaac tcatgcacct. . . gcatctgtct gtacgtgtag ctttttcaat tgatttaaaa acaggaagtc atatttttag acctaaatca ctatcctcta tctttcagca atctacttgg tttcgttccc tatccaagat tcagatggtg aaacgagtga gatgaacgtg caaaaccaca gtcaagccat gacaaccccg atctacagtt 60 120 180 gtatccgcaa agcaagactt ttgtataacg gtttaagatt 6660 6720 6780 6838 cctaaaatca aaatttgtac tataacgtat catgtattta gtgctttaga gtgaaactaa ataatgttaa tattttatgg if ( $sequence =~ /acttaaatttgtacgtg/ ) s/s*d+$// s/s*//g agccgtggac aagccagttg ttttagattt ggggacatga attgatttag tatgcattag tcttacaact atagatct
The prepare_embl program #! /usr/bin/perl -w # The 'prepare_embl' program - getting embl. data # ready for use. while ( <> ) { s/s*d+$//; s/s*//g; print; } $ perl prepare_embl. data > embl. data. out $ wc embl. data. out 0 1 6838 embl. data. out
The match_embl program #! /usr/bin/perl -w # The 'match_embl' program - check a sequence against # the EMBL database entry stored in the # embl. data. out data-file. use constant TRUE => 1; open EMBLENTRY, "embl. data. out" or die "No data-file: have you executed prepare_embl? n"; my $sequence = <EMBLENTRY>; close EMBLENTRY; print "Length of sequence is: ", length $sequence, " characters. n"; while ( TRUE ) {
The match_embl program, cont. print "n. Please enter a sequence to check. n Type 'quit' to end: "; my $to_check = <>; chomp( $to_check ); $to_check = lc $to_check; if ( $to_check =~ /^quit$/ ) { last; } if ( $sequence =~ /$to_check/ ) { print "The EMBL data extract contains: $to_check. n"; } else { print "No match found for: $to_check. n"; } }
Results from match_embl. . . $ perl match_embl Length of sequence is: 6838 characters. Please enter a Type 'quit' to No match found. . . Please enter a Type 'quit' to No match found sequence to check. end: aaatttgggccc for: aaatttgggccc. sequence to check. end: ca. GGGGGgg for: caggggggg. Please enter a sequence to check. Type 'quit' to end: tcatgcacctgatgaacgtgcaaaaccacagtcaagccatga The EMBL data extract contains: tcatgcacctgatgaacgtgcaaaaccacagtcaagccatga. Please enter a sequence to check. Type 'quit' to end: quit
Where To From Here
- Slides: 40