Subroutines and Files Bioinformatics Ellen Walker Hiram College
Subroutines and Files Bioinformatics Ellen Walker Hiram College
Why Subroutines? • Saves typing • Saves potential copy/paste errors • Collect common algorithm in one place for reuse
Built-In Subroutines • Provide common useful functions, e. g. – Index – Length – Substr • Call with arguments, – Index($string, $pat) #$string and $pat are arguments • Different arguments produce different results
Finding Predefined Subroutines • Textbooks (Safari Online has several) • Google (include “Perl” in your string) • Online documentation – http: //www. gotapi. com/perl is nicely searchable
How a Subroutine Works • Sub length ACA “ACA” • my $string = shift(@_) • my $length = 0; • my $code = “ACA”; • …code to count … • print length($code); • return $length; • print “goodbyen”; 3
Key Components • sub name – Declares this as a subroutine and names it • shift @_ – Pulls the arguments out of the list (in parentheses, one at a time, left to right) – Example: somesub(“ACT”, 1) – $a = shift@_ ($a is “ACT) – $b = shift@_ ($b is 1) • return value – Ends the subroutine & gives it a value
Example (p. 122) # find all GC-rich 4 -7 mers and determine their complements my $GCmatch; while ($some. DNA =~m/([GC]{4, 7})/g ){ $GCmatch = $1; print “ 5’ $GCmatch 3’nn”; $compl = complement($GCmatch); print “ 3’ $compl 5’”n”; }
Subroutine (p. 123) #book version has good documentation sub complement { my $dna = shift(@_); #get first arg my $anti = $dna; $anti =~ tr/ACGTacgt/TGCAtgca/; return $anti; }
Download These (Ch. 7) • Counting nucleotides – count. Nucleotides( $str, “C”); – count. Nucleotides( $str, “[CG]”); • Printing sequences with fixed line width – print. Sequence($str, 80);
Variable Scope • Variables exist from when they are declared (“my”) until the end of the block (closing brace). • Variables in subroutines exist only during the subroutine • Each call to a subroutine re-initializes the variables
Files and Programs • Files are stored on the computer’s hard drive and maintained by the operating system. • Programs are connected to files via special subroutines – “open” creates a file handle – “close” releases the file (important!)
Basic File Manipulation • Open a file and read – my $HANDLE; – open ($HANDLE, ‘<‘, $filename); – $line = <$HANDLE>; • Open a file and write – My $HANDLE; – open($HANDLE, ‘>’, $filename); – print $HANDLE “Hello world!”; • Close a file – close($HANDLE);
Allowing for Errors • If you try to read a file that doesn’t exist, or write a file that does, the open() command will return false • The rest of your program won’t work. • To fix this add: or die(“some message $file : $!”) to the end of the command ($! Contains the system error messages)
Complete Open Examples open ($HANDLE, ‘<‘, $filename) or die(“Cannot open file: $filename: $!); open ($HANDLE, ‘>‘, $filename) or die(“Cannot write file: $filename: $!);
Reading lines • Subroutine chomp removes the ‘n’ character at the end of each line • $line = <$HANDLE> puts the next line in $line • When there are no more lines, the result is false • Example: put the whole file in one sequence while ($line = <$HANDLE>) { chomp $line $seq = $seq. $line }
Printing to a file • The print commands (print and printf) can optionally be followed with a file handle before the string to print • Examples: – print $HANDLE “Hellon”; – printf $HANDLE “GC percent is %. 1 fn”, $GCcount * 100. 0 / $total;
Read. In. DNA • Subroutine to read FASTA formatted file (p. 141) • Returns sequence as one long string • Removes whitespace, lines that begin with # (comments), and all digits
FASTA File Format • One header line, begins with > • Many lines of text, sometimes capitalized, sometimes with spaces after every n characters • (Read. In. DNA handles these variations)
Getting a FASTA File • Go to NCBI http: //www. ncbi. nlm. nih. gov/ • Search for what you want and download the file to your current machine • Send the file to your directory of cs. hiram. edu (Demo to be provided)
Assignment • Using subroutines from your text, determine the GC content of the given genomes. (Examples to be provided)
- Slides: 20