BMB 6216 Algorithms for Biology Class 1 Andy
BMB 6216 – Algorithms for Biology - Class 1 Andy Kudlicki Office: BSB 547 Phone: 772 -2253, 771 -1011 cell a. kudlicki@utmb. edu
BMB 6216 – Algorithms for Biology Welcome! Imagine doing science without computers? It can (almost all) be done: – Paper file folders – Xeroxing – Photographs on film – Actually going to the library to browse journals – Abstract collections – Telephone, Snail-mail, Telegrams – Typewriters
BMB 6216 – Algorithms for Biology The one exception: Science is quantitative, and has always been.
BMB 6216 – Algorithms for Biology This course: – Using computers for computing. – Aspects useful in biology / bioinformatics • Simple tasks ( 2 * 71. 12 = ? ) • Simple repetitive tasks (few or many repetitions) • Somewhat complicated tasks • Typical problems of high complexity – BLAST, genome assembly, motif discovery, . . .
BMB 6216 – Algorithms for Biology This course: – Using computers for computing. – Aspects useful in biology / bioinformatics • Simple tasks ( 2 * 71. 12 = ? ) spreadsheets • Simple repetitive tasks (few or many repetitions) • Somewhat complicated tasks • Typical problems of high complexity ( Solved, software available ) – BLAST, genome assembly, motif discovery, . . .
BMB 6216 – Algorithms for Biology – Class 1 Course Overview Class 1 Introduction to the course and to the Perl programming language Class 2 Computational complexity and numerical stability of algorithms Class 3 Data Structures and Containers in PERL and other languages 1. Tables, lists, queues, hashes and when to use them 2. When PERL is not enough: A quick look at R and C++ Class 4 Matrix operations; Principal Component Analysis; ICA Class 5 Network / graph algorithms 1. Interaction Networks 2. Regulation networks 3. Graphs for enumerating hypotheses
BMB 6216 – Algorithms for Biology Course Overview Class 6 Strings and Regular Expressions 1. In silico enzyme digestion 2. Gene translation Class 7 Randomization and Monte Carlo simulations 1. Randomization by permutation 2. Modeling the null-hypothesis probability distribution Class 8 1. Custom vector graphics: generating SVG from your data Create and re-create the killer graph for your paper Class 9 Class 10 Visualization of multidimensional data Web tools 1. The components of a web page, elements of HTML. 2. Extracting data from webpages and other documents. 3. Connect to Gen. Bank using Bio. Perl
BMB 6216 – Algorithms for Biology Course Overview Class 11 Cgi-bin: Creating dynamic web-based tools for data analysis. Class 12 Relational databases and SQL 1. Relational Model, normalization 2. Basic SQL 3. Examples: Experimental results, Class 13 Databases and WWW Class 14 Clustering 1. Hierarchical 2. K-means 3. friends-of-Friends Class 15 Timecourses and spectral analysis; Convolution.
BMB 6216 – Algorithms for Biology Format: Mixed – lecture with hands-on assignments. Computer environment: Linux Perl, also C/C++, R, shell, awk, sed, . . . , when needed Supplementary reading: Larry Wall et al: Programming Perl Wing-Kin Sung: Algorithms in Bioinformatics James Tisdall: Beginning Perl for Bioinformatics James Tisdall: Mastering Perl for Bioinformatics Stroustrup: The C++ Programmming Language Special requests: Welcome !
BMB 6216 – Algorithms for Biology Format: Mixed – lecture with hands-on assignments. Computer environment: Linux * Rich in standard tools, mostly open-source * Industry standard – * Very similar to Mac. OS, Android, i. OS, BSD, Chrome. OS, etc. – Has many flavors created for specific purposes
BMB 6216 – Algorithms for Biology Using your laptop in class: To get a *nix environment: * linux laptop (or unix console on Mac) – Live CD distribution * cygwin * virtual machine * remote session (preferred, guaranteed to work)
Remote session: Use – “Remote Desktop Connection” from win* – Server: 129. 109. 88. 185 From mac – install “Remote Desktop Connection Client for Mac” From Linux “rdesktop 129. 109. 88. 185” Also works from off campus • (mycitrix. utmb. edu -> remote desktop session) Other options: – ssh (pu. TTY on windows) , no graphics though, only on-campus – NX No. Machine
BMB 6216 – Algorithms for Biology Login to: 129. 109. 54. 80 Username: Password:
BMB 6216 – Algorithms for Biology Unix / linux shell / command line: – List files: ls – Directory: ls -a cd ls -1 ls -l pwd – Copy, move, delete, link: cp mv rm – Machine status: ps w /sbin/ifconfig date – Text editors: joe – Pager: ls -lrt more less; ln uptime top df du nano also: – Misc: echo tr sed man emacs (c-x c-f) whoami vi cat, head, tail, tac wc chmod
BMB 6216 – Algorithms for Biology Simple data flow / spreadsheet-like • Find in file : grep [grep -v; grep -f; egrep] • Select top/bottom lines from file: head, tail • Select columns: awk '{print $2, $3, $5+$6}' • Merge lines: cat • Merge columns: paste • Sort • Data flow: > >> < | tee tac
BMB 6216 – Algorithms for Biology Exercise: The file /data/students/classes/remastercycle. csv contains gene expression data arranged as time-series in columns. (affy-id, name, gene-id, data*36) • How many named genes are there? • What is the average expression at timepoint 1? In how many genes it is above average? • What is the average expression at t 1 of named genes, unnamed genes, non-genes? (genes have systematic names like YLR 405 W) • List 200 named genes that have the highest (t 7+t 19+t 31)(t 1+t 13+t 25)
BMB 6216 – Algorithms for Biology Log in to your account (on 129. 109. 88. 185) – Make a fresh directory, e. g. mkdir bmb 6216 cd bmb 6216 mkdir class_1; cd class_1 cp /data/students/classes/hello. pl. * Cat it. * Less it. * Run it. • Backup: cp hello. pl hello-0. pl • Edit it: vi hello. pl
BMB 6216 – Algorithms for Biology Editing with vi – I / i (insert) – A / a (append) – X / x / dd (delete) – R (eplace) / r (eplace 1 character) – {n} W / w / B / b / hjkl -move around – [ESC] – back from insert to command – ZZ / : w / : q / : wq / : x / : q! - exit / save / quit – xp – swap chars. ddp – swap lines
BMB 6216 – Algorithms for Biology Exercise: The file /home/students/classes/Class_1/remastercycle. csv contains gene expression data arranged as time-series in columns. (affy-id, name, gene-id, data*36) • How many named genes are there? • What is the average expression at timepoint 1? In how many genes it is above average? • What is the average expression at t 1 of named genes, unnamed genes, non-genes? (genes have systematic names like YLR 405 W), named genes also have a common name in column 2. • List 200 named genes that have the highest (t 7+t 19+t 31)(t 1+t 13+t 25)
BMB 6216 – Algorithms for Biology PERL Why PERL? Practical Extraction and Report Language Pathologically Eclectic Rubbish Lister • Versatile, portable • Widely used in bioinformatics and web applications • There's more than one way to do it • Not the most elegant language, great for dirty hacks • Easily integrated with anything
BMB 6216 – Algorithms for Biology Warning: PERL 6 ain't PERL
BMB 6216 – Algorithms for Biology PERL HELLO WORLD: print ''Hello n'';
BMB 6216 – Algorithms for Biology PERL HELLO WORLD: > perl print ''Hello n''; ^D
BMB 6216 – Algorithms for Biology PERL HELLO WORLD: >perl -e 'print ''Hello n''; '
BMB 6216 – Algorithms for Biology PERL HELLO WORLD: hello. pl ========= #!/usr/bin/perl print ''Hello n''; ========= > perl hello. pl Or >. /hello. pl (after chmod +x hello. pl)
BMB 6216 – Algorithms for Biology VARIABLES: Scalar: $dna = 'ATTTGCCCATT'; $mouse_tail_inches = 2. 13; $RNA = ''GGGUUCAAUAUAUGGC''; $seven = -6; Default variable: $_ No need to declare variables. If not specified, $_ is assumed.
BMB 6216 – Algorithms for Biology VARIABLES: No need to declare variables. Risky though: $my_variable = 51; $something = $my_variable + 3; $something_else = $myvariable + 4; use strict;
BMB 6216 – Algorithms for Biology OPERATIONS: String: $dna = “ATAGAGGTA”. “CATATC”; $at_repeat = “AT” x 50; substr() sub-string length() Binding: print $dna if $dna =~ /ATA/; chop (last char) chomp (end of line) Special characters: t n
BMB 6216 – Algorithms for Biology The different quotations $x=6; print ''x= $x n''; print 'x= $x n';
BMB 6216 – Algorithms for Biology OPERATIONS: Arithmetic: $a + $b $a - $b $a * $b $a % $b $a ** $b
BMB 6216 – Algorithms for Biology OPERATIONS: Incrementation (C-like) $a ++ $a *= 4 $repeat = 'AT'; $repeat x=36;
BMB 6216 – Algorithms for Biology LISTS/TABLES: @a = (4, 6, 3. 21, 7, 'cat', ''dog''); $a[0] = 6; $#a @a + 0 address of last element size of array OPERATIONS: * join / split * push / pop / shift / unshift
BMB 6216 – Algorithms for Biology LISTS/TABLES: @a = (4, 6, 3. 21, 7, 'cat', ''dog''); $a[0] = 6; $#a @a + 0 address of last element size of array OPERATIONS: * join / split * push / pop / shift / unshift
BMB 6216 – Algorithms for Biology HASHES: The most important data type in biology! $expression{''RPS 16''} = 4. 65; %expression = ( RPL 12 => 1. 23, CDC 28 => 5. 31, STAT 1 => ''experiment gone south” );
BMB 6216 – Algorithms for Biology FLOW CONTROL: if ( $a > 4 ) { print sqrt ($a), “n”; }; while ( $x > 0 ) { print --$x , “n”}; $x>0 or $x = 6; for $z (1. . 333) {print $z, ' '; }; for ($i=0; $i<=1000; ++$i) { next unless $a[$i] > 0 };
BMB 6216 – Algorithms for Biology TRUE or FALSE false strings: – ''0'' – '''' Every other string is true! ''0. 00'' is true ''0. 00'' + 0 is false – if ( 'Elvis is alive' ) { print 4+5, “n”; }; – undef() is false
BMB 6216 – Algorithms for Biology SUBROUTINES sub addit { my ($x 1, $x 2) = @_; return $x 1 + $x 2; };
BMB 6216 – Algorithms for Biology Input / Output: while (<>) { chomp; $sum += $_; };
BMB 6216 – Algorithms for Biology Input: open BLABLA, “data. csv”; $firstline = <BLABLA>; @headers = split “t”, $firstline; while (<BLABLA>) {something}; close BLABLA;
BMB 6216 – Algorithms for Biology Output: – print $x, ''n''; – printf ''format'', $x; – print + join '' '', @list; open BLABLA, “>outdata. csv”; print BLABLA $x, $y, ''n''; close BLABLA; #no comma!!!
BMB 6216 – Algorithms for Biology Exercises: 1. repeat in PERL the awk/sort exercise from last hour 2. a-S_cer_TANAY_1000 upstream. fasta contains the sequences out UTRs of genes. What is the correlation between the position of GATGAGA sequence and avg expression of the gene?
BMB 6216 – Algorithms for Biology Simple data flow / spreadsheet-like • Find in file : grep [grep -v; grep -f; egrep] • Select top/bottom lines from file: head, tail • Select columns: awk '{print $2, $3, $5+$6}' • Merge lines: cat • Merge columns: paste • Sort • Data flow: > >> < | tee tac
BMB 6216 – Algorithms for Biology C / C++ -> for total control ============== Hello. C ====== #include <iostream> using namespace std; int main () { cout << "Hello : ) " << 5+4 << endl; };
- Slides: 45