Advanced Perl for Bioinformatics Lecture 5 Regular expressions

Advanced Perl for Bioinformatics Lecture 5

Regular expressions - review • You can put the pattern you want to match between //, bind the pattern to the variable with =~, then use it within a conditional: if ($dna =~ /CAATTG/) {print “Eco RIn”; } • Square brackets within the match expression allow for alternative characters: if ($dna =~ /CAG[AT]CAG/) • A vertical line means “or”; it allows you to look for either of two completely different patterns: if ($dna =~ /GAAT|ATTC/)

Reading and writing files, review • Open a file for reading: open INPUT, ”/home/class 30/input. txt”; • Or writing open OUTPUT, ”>/home/class 30/output. txt”; • Make sure you can open it! open INPUT, ”input. txt” or die “Can’t open filen”;

Test time Last one…

Hashes Perl has another super useful data structure called a hash, for want of a better name. A hash is an associative array – i. e. it is an array of variables that are associated with each other.

Making a hash of it • You can think of a hash just as if it were a set of questions and answers my %matthash = (“first_name” => “Matt”, “surname” => “Hudson”, “age” => “secret”, “height” => 187, #cm “hairstyle” => “D minus” );

Making a hash of it • Pseudocode: Create an associative array where these keys are associated with these values: Key first_name surname age height hairstyle Value Matt Hudson secret 187 D minus (note in text – cm)

Getting the hash back my %matthash = (“first_name” => “Matt”, “surname” => “Hudson”, “age” => “secret”, “height” => 187, #cm “hairstyle” => “D minus” ) print “my name is “, $matthash{first_name}; print “ “, $matthash{surname}, “n”; You can store a lot of information and recover it easily and quickly without knowing in what order you added it, unlike an array.

Getting the hash back • Pseudocode Output text “My name is “ Then value for key “first name” in matthash Then value for key “last name” in matthash Then newline character

Hashes as an array • You can get the “keys” of the hash and use them like an array: foreach my $info (keys %matthash){ print “$info = $matthash{$info}”; }

Why are hashes useful? Exercise. • Many of you might have noticed in the exercise on restriction sites, that there was no way to keep track of which sites were which using arrays • Modify your script using a hash like this one: my %enzymehash = ( “Eco. RI” => “CAATTG”, “Bam. HI” => “GGATCC”, “Hind. III” => “AAGCTT”);

(an) answer foreach my $name (keys %enzymehash){ if ($sequence =~ /$enzymehash{$name}/) { print “I found a site for $name, $enzymehash{$name}”; } }

pseudocode For every key in the hash %enzymehash If the sequence in $sequence contains the value for that key: print “I found a site for (key), (value in %enzymehash for key)”

Putting data in a hash my %hash; while (<FILE>) { /stuff(important stuff) more stuff (best stuff)/; $hash{$1} = $2; } Or…. while ($line = <FILE>) { my @tmp = split /t/, $line; $hash{$tmp[0]} = $tmp[1]; }

pseudocode Create an empty hash %hash For every line in the file FILE: if the line matches the regex: stuff(important stuff) more stuff (best stuff) then store (important stuff) as a hash key and (best stuff) as a value for that key

Advanced regex • The fun isn’t over yet. • • • You can match precise numbers of characters Any number of characters Positions in a line Precise formatting (spaces, tabs etc) You can get bits of the string you matched out and store them in variables • You can use regexes to substitute or to translate

Grabbing bits of the regex • The fun isn’t over yet. my $blastline = “Query= AT 1 g 34399 gene CDS”; $blastline =~ /Query= (. +) gene/; my $atgnumber = $1; print “The accession number is $atgnumbern”; You can store the contents of the bit within brackets, within the regex, as the special variable $1. Then use it for other stuff. If you put another pair of brackets in, it will be stored in $2.

Using modules • You can use other peoples modules, including those that come with Perl. These provide extra commands, or change the way your Perl script behaves. E. g. use strict; use warnings; use Bio: : Perl; You will see these stacked up at the beginning of more complicated Perl scripts. Some modules come with perl (strict, warnings) #man perlmod others you need to download and add in yourself.

Using strict • We have talked about using “my” the first time you use a variable • I recommend you always have use strict; At the top of your script. That way if you mistype a variable and use my, you will know.

A last exercise? . . . • So: how might hashes help you solve this? • Open up a BLAST output file • Spit out the name of the query sequence, the top hit, and how many hits there were.

Programming projects • Now it’s time to think of your programming projects. • Hopefully you have an idea – we’ll discuss how feasible they are in the time available • If not, here are some suggestions

Suggested program functions • Translate a c. DNA into protein, and then check it against the pfam database for HMM hits. • Make a real restriction map of a DNA sequence, with predicted fragment sizes • Align proteins of a favorite family, open the alignment and find residues that are totally conserved. • Perform BLAST against the latest version of the database files for a particular organism – which will check whether the user has the latest files, and if not will download them • Design PCR primers, to make a fragment size chosen by the user, for a sequence input from a fasta file. • Check whether primer sites are unique in a sequenced, or partially sequenced, genome, and gives an “electronic PCR” result. • Output an XML formatted version of a BLAST or HMMER text file. • Analyze codon usage in a protein coding DNA sequence and calculate the Ka/Ks ratio