Perl for Bioinformatics Lecture 4 Variables review A
Perl for Bioinformatics Lecture 4
Variables - review • • • A variable name starts with a $ It contains a number or a text string Use my to define a variable Use = to assign a value Use to stop the variable being interpolated • Take care with variable names and with changing the contents of variables
Conditional Blocks, review • An if test can be used to control a command in a conditional block, according to the outcome of a decision made by comparing variables. • It’s important to keep track of whether variables are strings or numbers. Numbers are compared with ==, strings with eq. • It’s usual to indent the block to make it easier to read the code
Arrays • An array can store multiple pieces of data. • They are essential for the most useful functions of Perl. They can store data such as: – the lines of a text file (e. g. primer sequences) – a list of numbers (e. g. BLAST e values) • Arrays are designated with the symbol @ my @bases = (“A”, “C”, “G”, “T”);
Converting a variable to an array splits a variable into parts and puts them in an array. my $dnastring = "ACGTGCTA"; my @dnaarray = split //, $dnastring; @dnaarray is now (A, C, G, T, G, C, T, A) @dnaarray = split /T/, $dnastring; @dnaarray is now (ACG, GC, A)
Converting an array to a variable • join combines the elements of an array into a single scalar variable (a string) $dnastring = join('', @dnaarray); spacer (empty here) which array
Loops • A loop repeats a bunch of functions until it is done. The functions are placed in a BLOCK – some code delimited with curly brackets {} • Loops are really useful with arrays. • The “foreach” loop is probably the most useful of all: foreach my $base (@dnaarray) { print "$base “; }
Comparing strings • String comparison (is the text the same? ) • eq (equal ) • ne (not equal ) There are others but beware of them!
Getting part of a string • substr takes characters out of a string $letter = substr($dnastring, $position, 1) which string where in the string how many letters to take
Combining strings • Strings can be concatenated (joined). • Use the dot. operator $seq 1= “ACTG”; $seq 2= “GGCTA”; $seq 3= $seq 1. $seq 2; print $seq 3; ACTGGGCTA
Making Decisions - review • The if operator is generally used together with numerical or string comparison operators, inside an (expression). numerical: strings: ==, !=, >, <, ≥, ≤ eq, ne • You can make decisions on each member of an array using a loop which puts each part of the array through the test, one at a time
More healthy exercise • Write a program that asks the user for a DNA restriction site, and then tells them whether that particular sequence matches the site for the restriction enzyme Eco. RI, or Bam HI, or Hind III. • Site for Eco. R 1: GAATTC • Bam H 1: GGATCC • Hind III: AAGCTT
pseudocode • • • Read in restriction site to variable Remove newline character Check if variable contains “GAATTC” Check if variable contains “GGATCC”. . etc.
What about longer sequences? • Read in sequence to variable • Remove newline character • Split sequence in variable to array using “GAATTC”. • Count and report number of fragments. • Measure length of fragments and report site positions, adding six for missing sites
Arrays and loops - review • An array starts with @. It contains multiple bits of data in a list-like format. • @bases = (“A”, “C”, “G”, “T”); • You can make decisions on each member of an array using a foreach loop which puts each part of the array through the test, one at a time
Test time, again • Remember – keep track of what’s in a variable don’t over-write a variable with another value, unless you intend to syntax and case are critical lines end with a semicolon brackets and quotes must match.
Opening and closing files • So we can input large amounts of data, Perl has to read data out of files, and write results into output files • This is done in two steps • First, you must give the file a name within the script - this is known as a filehandle • Use the open command: open MYFILE, ‘exampleprotein. txt’;
Reading a file • Once the file is open, you can read from it, line by line, using the readline <> operator again – (put the filehandle between the angle brackets) • Perl reads files one line at a time, each time you input data from the file, the next line is read: open FILE 1, ’exampleprotein. txt’; $line 1 = <FILE 1>; chomp $line 1; $line 2 = <FILE 1>;
Using loops to read in a file • The while loop just keeps doing an expression while it’s true. So it will keep reading lines from the file until it runs out. my $longsequence; open FILE, ‘exampleprotein. txt’; while (my $line = <FILE>){ chomp $line; $longsequence = $longsequence. $line; } close FILE; • This reads the whole file, and puts each line into the variable $longsequence one at a time. • Think about what happens to the ID line….
Now More Fun Excercises • Read a DNA sequence from a fasta format file • Calculate the GC content. • What about the non-DNA characters in the file? >header lines with the name of the sequence carriage returns !! You know this one. blank spaces N’s or X’s or unexpected letters
Pseudocode • Open file • Load ID line and sequence into different variables • Split the sequence into an array of individual letters • For each letter in the array: – if A, increment “A” counting number variable – If T, increment “T” counting variable – …etc
Writing to a File • Writing to a file is similar to reading from it • Use the > operator to open a file for writing: open OUTPUT, ‘>/home/class 30/output. txt’; • This creates a new file with that name, or overwrites an existing file • Use >> to append text to an existing file • print to the file using the filehandle: print OUTPUT $myoutputdata;
Some more stuff you need to know else • Instead of just letting the script go on if it fails an if test, you can get it to execute a second block of code if the statement in brackets isn’t true.
elsif • You can string a lot of “if”s together using elsif if ($site eq “GAATTC” { print “Eco. R 1 siten”; } elsif ($site eq “CCATGG” { print “Bam. HI siten”; } elsif ($site eq “AAGCTT”) { print “Hind. III siten”; } else { #only happens if none of the preceeding are true die “I can’t find any of the sites I known”; }
Subscripts • Bioinformatics data often can be made into array format: – multi-line sequence files – Microarray or statistics data in “tab delimited” format • You can address part of the array as if it was a variable using a subscript @numbers = (8, 8, 8, 23984092, 8); print “$numbers[3]n”; Please note – the first element is number zero! Second is 1!
Regular Expressions • Sounds odd, doesn’t it? It means a pattern that the computer can match, in a standard format. • Very useful in bioinformatics work • DNA patterns • restriction sites • promoters/transcription factor binding sites • intron splice site • Protein patterns • conserved domains (motifs) • active sites • structural motifs (membrane spanning, signal peptide, etc. )
The Binding and Match Operators: =~ / / • The =~ operator binds functions together • The // operator matches things to patterns It can be translated as “contains” • The forward slashes contain the pattern to be matched, like this: if ($dnaseq =~ /GAATTC/) {print “Eco. RI site foundn”}
A regular expression. . is a joy forever. And a pattern to match: can be just a text string, such as: /GATC/ it can have alternative characters: /G[AT]TC/ or contain a wildcard that matches any character: /G. TC/ Or be something bizzare: //[^/]*/. . /
Perl Regular Expressions • It never ceases to amaze me what people can do with regular expressions, but you can match pretty much anything you can think of and a lot you can’t: #man perlrequick
Alternative Characters • Square brackets within the match expression allow for alternative characters: if ($dna =~ /CAG[AT]CAG/) • This will match an DNA string that starts with CAG; has A or T in the 4 th position, followed by another CAG. • A vertical line within the /expression/ means “or”; it allows you to look for either of two completely different patterns: if ($dna =~ /GAAT|ATTC/)
Special characters • Perl has a large set of special characters to use in regular expressions: – the dot (. ) matches any character – d matches any digit (a number from 0 -9) – w matches any “word” character (a letter or a number, not punctuation or space) – s matches white space (any amount) – t matches a tab (useful for tab delimited files) – ^ matches the beginning of a line – $ matches the end of a line – Knowing this makes you lots of fun at parties.
“Special” characters • What if you need to match text that contains a special character? • Aren’t there dots at the end of sentences? • Now you have to use a backslash () to “escape” the special meaning of that character: if $onewordsentence =~ /w+ . / -This would match any text that has one or more text characters, followed by a dot. Finished.
Bringing it together • So now, when you think about it, you can: Open a file Check whether each line of the file contains a particular pattern Recover part of that line Write it out to another file So. . isn’t that what you wanted to know? But really, it’s very useful combined with the UNIX command line.
A last exercise? . . . • Now we’re getting up to speed with Perl, lets try something more fun: • Open up a BLAST output file • Spit out the name of the query sequence, the top hit, and how many hits there were.
Only the beginning • Sadly, there is much, much more than this to the Power of Perl. • You can make, create and download other people’s websites • Make Linux and Windows graphical programs • Do almost anything on the internet • Interact with databases • And much more
Why won’t I teach you more stuff? • Whoa! • Programming takes time to learn properly • You’ve got the tools now to get started on a programming project • We will go through some more Perl functions in the later classes, especially modules such as Bioperl.
Practice makes perfect • You can now practice your Perl skills and understand a lot of the books and help files, which are probably more useful. #man perlintro #man perlrequick #perldoc bioperl • Also, check out Radhika’s resource page http: //compbio. sph. harvard. edu/chb/training-resources
- Slides: 37