Programming for Engineers in Python Biopython 1 Classes

Programming for Engineers in Python Biopython 1

Classes class <classname>: statement_1. . statement_n The methods of a class get the instance as the first parameter traditionally named self The method __init__ is called upon object construction (if available) 2

Classes Reminder: type = data representation + behavior. Classes are user-defined types. class <classname>: statement_1. . statement_n Like a mini-program: • Variables. • Function Definitions. • Even arbitrary commands. Objects of a class are called class instances. 3

Classes – Attributes and Methods class Vector 2 D: def __init__ (self, x, y): self. x, self. y = x, y Instance def size (self): return (self. x ** 2 + self. y ** 2) ** 0. 5 Methods 4 Attributes (each instance has its own copy)

Classes – Instantiate and Use >>> v = Vector 2 D(3, 4) # Make instance. >>> v <__main__. Vector 2 D object at 0 x 0000031 A 2828> >>> v. size() # Call method on instance. 5. 0

Example – Multimap A dictionary with more than one value for each key We already needed it once or twice and used: >>> lst = d. get(key, [ ]) >>> lst. append(value) >>> d[key] = lst We will now write a new class that will be a wrapper around a dict The class will have methods that allow us to keep multiple values for each key 6

Multimap. partial code class Multimap: def __init__(self): '''Create an empty Multimap''' self. inner = inner def get(self, key): '''Return list of values associated with key''' return self. inner. get(key, []) def put(self, key, value): '''Adds value to the list of values associated with key''' value_list = self. get(key) if value not in value_list: value_list. append(value) self. inner[key] = value_list 7

Multimap put_all and remove def put_all(self, key, values): for v in values: self. put(key, v) def remove(self, key, value): value_list = self. get(key) if value in value_list: value_list. remove(value) self. inner[key] = value_list return True return False 8

Multimap. Partial code def __len__(self): '''Returns the number of keys in the map''' return len(self. inner) def __str__(self): '''Converts the map to a string''' return str(self. inner) def __cmp__(self, other): '''Compares the map with another map''' return self. inner. cmp(other) def __contains__(self, key): '''Returns True if key exists in the map''' return self. has_key(k) 9

Multimap Use case – a dictionary of countries and their 10 cities: >>> m = Multimap() >>> m. put('Israel', 'Tel-Aviv') >>> m. put('Israel', 'Jerusalem') >>> m. put('France', 'Paris') >>> m. put_all('England', ('London', 'Manchester', 'Moscow')) >>> m. remove('England', 'Moscow') >>> print m. get('Israel') ['Tel-Aviv', 'Jerusalem']

11

Biopython An international association of developers of freely available Python (http: //www. python. org) tools for computational molecular biology Provides tools for Parsing files (fasta, clustalw, Gen. Bank, …) Interface to common softwares Operations on sequences Simple machine learning applications BLAST And many more 12

Installing Biopython Go to http: //biopython. org/wiki/Download Windows Unix Select python 2. 7 Num. Py is required 13

Seq. IO The standard Sequence Input/Output interface for Bio. Python Provides a simple uniform interface to input and output assorted sequence file formats Deals with sequences as Seq. Record objects There is a sister interface Bio. Align. IO for working directly with sequence alignment files as Alignment objects 14

Parsing a FASTA file # Parse a simple fasta file from Bio import Seq. IO for seq_record in Seq. IO. parse("ls_orchid. fasta", "fasta"): print seq_record. id print repr(seq_record. seq) print len(seq_record) Why repr and not str? 15

16

Gen. Bank files # genbank files from Bio import Seq. IO for seq_record in Seq. IO. parse("ls_orchid. gbk", "genbank"): print seq_record # added to print just one record example break 17

Gen. Bank files from Bio import Seq. IO for seq_record in Seq. IO. parse("ls_orchid. gbk", "genbank"): print seq_record. id print repr(seq_record. seq) print len(seq_record) 18

Sequence objects Support similar methods as standard strings Provide additional methods Translate Reverse complement Support different alphabets AGTAGTTAAA can be DNA Protein 19

Sequences and alphabets Bio. Alphabet. IUPAC provides basic definitions for proteins, DNA and RNA, but additionally provides the ability to extend and customize the basic definitions For example: Adding ambiguous symbols Adding special new characters 20

Example – generic alphabet >>> from Bio. Seq import Seq >>> my_seq = Seq("AGTACACTGGT") >>> my_seq Seq('AGTACACTGGT', Alphabet()) Non-specific >>> my_seq. alphabet Alphabet() 21

Example – specific sequences >>> from Bio. Seq import Seq >>> from Bio. Alphabet import IUPAC >>> my_seq = Seq("AGTACACTGGT", IUPAC. unambiguous_dna) >>> my_seq Seq('AGTACACTGGT', IUPACUnambiguous. DNA()) >>> my_seq. alphabet IUPACUnambiguous. DNA() >>> from Bio. Seq import Seq >>> from Bio. Alphabet import IUPAC >>> my_prot = Seq("AGTACACTGGT", IUPAC. protein) >>> my_prot Seq('AGTACACTGGT', IUPACProtein()) >>> my_prot. alphabet IUPACProtein() 22

Sequences act like strings Access elements >>> print my_seq[0] #first letter G >>> print my_seq[2] #third letter T >>> print my_seq[-1] #last letter G Count without overlaps >>> from Bio. Seq import Seq >>> "AAAA". count("AA") 2 >>> Seq("AAAA"). count("AA") 2 23

Calculate GC content >>> from Bio. Seq import Seq >>> from Bio. Alphabet import IUPAC >>> from Bio. Seq. Utils import GC >>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC. unambiguous_dna) >>> GC(my_seq) 46. 875 24

Slicing Simple slicing >>> from Bio. Seq import Seq >>> from Bio. Alphabet import IUPAC >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC. unambiguous_dna) >>> my_seq[4: 12] Seq('GATGGGCC', IUPACUnambiguous. DNA()) Start, stop, stride >>> my_seq[0: : 3] Seq('GCTGTAGTAAG', IUPACUnambiguous. DNA()) >>> my_seq[1: : 3] Seq('AGGCATC', IUPACUnambiguous. DNA()) >>> my_seq[2: : 3] Seq('TAGCTAAGAC', IUPACUnambiguous. DNA()) 25

Concatenation Simple addition as in Python But, alphabets must fit >>> from Bio. Alphabet import IUPAC >>> from Bio. Seq import Seq >>> protein_seq = Seq("EVRNAK", IUPAC. protein) >>> dna_seq = Seq("ACGT", IUPAC. unambiguous_dna) >>> protein_seq + dna_seq Traceback (most recent call last): … 26

Changing case >>> from Bio. Seq import Seq >>> from Bio. Alphabet import generic_dna >>> dna_seq = Seq("acgt. ACGT", generic_dna) >>> dna_seq Seq('acgt. ACGT', DNAAlphabet()) >>> dna_seq. upper() Seq('ACGT', DNAAlphabet()) >>> dna_seq. lower() Seq('acgt', DNAAlphabet()) 27

Changing case Case is important for matching >>> "GTAC" in dna_seq False >>> "GTAC" in dna_seq. upper() True IUPAC names are upper case >>> from Bio. Seq import Seq >>> from Bio. Alphabet import IUPAC >>> dna_seq = Seq("ACGT", IUPAC. unambiguous_dna) >>> dna_seq Seq('ACGT', IUPACUnambiguous. DNA()) >>> dna_seq. lower() Seq('acgt', DNAAlphabet()) 28

Reverse complement >>> from Bio. Seq import Seq >>> from Bio. Alphabet import IUPAC >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC. unambiguous_dna) >>> my_seq. complement() Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguous. DNA()) >>> my_seq. reverse_complement() Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguous. DNA()) 29

Transcription >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC. unambiguous_dna) >>> template_dna = coding_dna. reverse_complement() >>> messenger_rna = coding_dna. transcribe() >>> messenger_rna Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguous. RNA()) As you can see, all this does is switch T → U, and adjust the alphabet. 30

Translation Simple example >>> from Bio. Seq import Seq >>> from Bio. Alphabet import IUPAC >>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC. unambiguous_rna) >>> messenger_rna Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguous. RNA()) >>> messenger_rna. translate() Seq('MAIVMGR*KGAR*', Has. Stop. Codon(IUPACProtein(), '*')) 31 Stop codon!

Translation from the DNA >>> from Bio. Seq import Seq >>> from Bio. Alphabet import IUPAC >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC. unambiguous_dna) >>> coding_dna Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguous. DNA()) >>> coding_dna. translate() Seq('MAIVMGR*KGAR*', Has. Stop. Codon(IUPACProtein(), '*')) 32

Using different translation tables In several cases we may want to use different translation tables Translation tables are given IDs in Gen. Bank (standard=1) Vertebrate Mitochondrial is table 2 More details in http: //www. ncbi. nlm. nih. gov/Taxonomy/Utils/wprintgc. cgi 33

Using different translation tables >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC. unambiguous_dna) >>> coding_dna. translate() Seq('MAIVMGR*KGAR*', Has. Stop. Codon(IUPACProtein(), '*')) >>> coding_dna. translate(table="Vertebrate Mitochondrial") Seq('MAIVMGRWKGAR*', Has. Stop. Codon(IUPACProtein(), '*')) >>> coding_dna. translate(table=2) Seq('MAIVMGRWKGAR*', Has. Stop. Codon(IUPACProtein(), '*')) 34

Translation tables in biopython 35

Translate up to the first stop in frame >>> coding_dna. translate() Seq('MAIVMGR*KGAR*', Has. Stop. Codon(IUPACProtein(), '*')) >>> coding_dna. translate(to_stop=True) Seq('MAIVMGR', IUPACProtein()) >>> coding_dna. translate(table=2) Seq('MAIVMGRWKGAR*', Has. Stop. Codon(IUPACProtein(), '*')) >>> coding_dna. translate(table=2, to_stop=True) Seq('MAIVMGRWKGAR', IUPACProtein()) 36

Comparing sequences Standard “==“ comparison is done by comparing the references (!), hence: >>> seq 1 = Seq("ACGT", IUPAC. unambiguous_dna) >>> seq 2 = Seq("ACGT", IUPAC. unambiguous_dna) >>> seq 1==seq 2 Warning (from warnings module): … Future. Warning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception)… please use str(seq 1)==str(seq 2) to make your code explicit and to avoid this warning. False >>> seq 1==seq 1 True 37

Mutable vs. Immutable Like strings standard seq objects are immutable If you want to create a mutable object you need to write it by either: Use the “tomutable()” method Use the mutable constructor mutable_seq = Mutable. Seq("GCCATTGTAATGGGCCGCTGAAAG GGTGCCCGA", IUPAC. unambiguous_dna) 38

Unknown sequences example In many biological cases we deal with unknown sequences >>> from Bio. Seq import Unknown. Seq >>> from Bio. Alphabet import IUPAC >>> unk_dna = Unknown. Seq(20, alphabet=IUPAC. ambiguous_dna) >>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC. unambiguous_dna) >>> unk_dna+my_seq Seq('NNNNNNNNNNGCCATTGTAATGGGCCGCT GAAAGGGTGCCCGA', IUPACAmbiguous. DNA()) 39

MS A 40

Read MSA Use Bio. Align. IO. read(file, format) File – the file path Format support: “stockholm” “fasta” “clustal” … Use help(Align. IO) for details 41

Example We want to parse this file from PFAM 42

Example from Bio import Align. IO alignment = Align. IO. read("PF 05371. sth", "stockholm") print alignment 43

Alignment object example >>> from Bio import Align. IO >>> alignment = Align. IO. read("PF 05371_seed. sth", "stockholm") >>> print alignment[1] ID: Q 9 T 0 Q 8_BPIKE/1 -52 Name: Q 9 T 0 Q 8_BPIKE Description: Q 9 T 0 Q 8_BPIKE/1 -52 Number of features: 0 /start=1 /end=52 /accession=Q 9 T 0 Q 8. 1 Seq('AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKK FVSRA', Single. Letter. Alphabet()) 44

Alignment object example >>> print "Alignment length %i" % alignment. get_alignment_length() Alignment length 52 >>> for record in alignment: print "%s - %s" % (record. seq, record. id) AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA COATB_BPIKE/30 -81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA Q 9 T 0 Q 8_BPIKE/1 -52 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA COATB_BPI 22/32 -83 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA COATB_BPM 13/24 -72 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA COATB_BPZJ 2/1 -49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA Q 9 T 0 Q 9_BPFD/1 -49 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA COATB_BPIF 1/22 -73 45

Cross-references example Did you notice in the raw file above that several of the sequences include database cross-references to the PDB and the associated known secondary structure? >>> for record in alignment: if record. dbxrefs: print record. id, record. dbxrefs COATB_BPIKE/30 -81 ['PDB; 1 ifl ; 1 -52; '] COATB_BPM 13/24 -72 ['PDB; 2 cpb ; 1 -49; ', 'PDB; 2 cps ; 1 -49; '] Q 9 T 0 Q 9_BPFD/1 -49 ['PDB; 1 nh 4 A; 1 -49; '] COATB_BPIF 1/22 -73 ['PDB; 1 ifk ; 1 -50; '] 46

Comments Remember that almost all MSA formats are supported When you have more than one MSA in your files use Align. IO. parse() Common example is PHYLIP’s output Use Align. IO. parse("resampled. phy", "phylip") The result is an iterator object that contains all MSAs 47

Write alignment to file from Bio. Alphabet import generic_dna Bio. Seq import Seq Bio. Seq. Record import Seq. Record Bio. Align import Multiple. Seq. Alignment align 1 = Multiple. Seq. Alignment([ Seq. Record(Seq("ACTGCTAG", generic_dna), id="Alpha"), Seq. Record(Seq("ACT-CTAG", generic_dna), id="Beta"), Seq. Record(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"), ]) from Bio import Align. IO. write(align 1, "my_example. phy", "phylip") 48 3 12 Alpha ACTGCTAGCT AG Beta ACT-CTAGCT AG Gamma ACTGCTAGDT AG 39 Delta GTCAGC-AG Epislon GACAGCTAG Zeta GTCAGCTAG 3 13 Eta ACTAGTACAG CTG Theta ACTAGTACAG CT-

Slicing Alignments work like numpy matrices >>> print alignment[2, 6] T # You can pull out a single column as a string like this: >>> print alignment[: , 6] TTT---T >>> print alignment[3: 6, : 6] Single. Letter. Alphabet() alignment with 3 rows and 6 columns AEGDDP COATB_BPM 13/24 -72 AEGDDP COATB_BPZJ 2/1 -49 AEGDDP Q 9 T 0 Q 9_BPFD/1 -49 49 >>> print alignment[: , : 6] Single. Letter. Alphabet() alignment with 7 rows and 6 columns AEPNAA COATB_BPIKE/30 -81 AEPNAA Q 9 T 0 Q 8_BPIKE/1 -52 DGTSTA COATB_BPI 22/32 -83 AEGDDP COATB_BPM 13/24 -72 AEGDDP COATB_BPZJ 2/1 -49 AEGDDP Q 9 T 0 Q 9_BPFD/1 -49 FAADDA COATB_BPIF 1/22 -73

External applications How do we call MSA algorithms on unaligned set of sequences? Biopython provides wrappers The idea: Create a command line object with the algorithm options Invoke the command (Python uses subprocesses) Bio. Align. Applications module: >>> import Bio. Align. Applications >>> dir(Bio. Align. Applications) ['Clustalw. Commandline', 'Dialign. Commandline', 'Mafft. Commandline', 50 'Muscle. Commandline', 'Prank. Commandline', 'Probcons. Commandline', 'TCoffee. Commandline' ]

Clustal. W example First step: download Clustal. W from ftp: //ftp. ebi. ac. uk/pub/software/clustalw 2/2. 1/ Second step: install Third step: look for clustal exe files Now you can run Clustal. W from your Python code 51

Run example >>> >>> >>> import os from Bio. Align. Applications import Clustalw. Commandline clustalw_exe = r"C: Program Filesnew clustalclustalw 2. exe" clustalw_cline = Clustalw. Commandline(clustalw_exe, infile="opuntia. fasta") assert os. path. isfile(clustalw_exe), "Clustal W executable missing" stdout, stderr = clustalw_cline() The command line is actually a function we can run! 52

Clustal. W >>> from Bio import Align. IO >>> align = Align. IO. read("opuntia. aln", "clustal") >>> print align Single. Letter. Alphabet() alignment with 7 rows and 906 columns TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273285|gb|AF 191659. 1|AF 191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273284|gb|AF 191658. 1|AF 191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273287|gb|AF 191661. 1|AF 191 TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273286|gb|AF 191660. 1|AF 191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273290|gb|AF 191664. 1|AF 191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273289|gb|AF 191663. 1|AF 191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273291|gb|AF 191665. 1|AF 191 53

Clustal. W - tree In case you are interested, the opuntia. dnd file Clustal. W creates is just a standard Newick tree file, and Bio. Phylo can parse these: >>> from Bio import Phylo >>> tree = Phylo. read("opuntia. dnd", "newick") >>> Phylo. draw_ascii(tree) 54

BLAST 55

Running BLAST over the internet We use the function qblast() in the Bio. Blast. NCBIWWW module. This has three nonoptional arguments: The blast program to use for the search, as a lower case string: works with blastn, blastp, blastx, tblast and tblastx. The databases to search against. The options for this are available on the NCBI web pages at http: //www. ncbi. nlm. nih. gov/BLAST/blast_databases. s html. A string containing your query sequence. This can either be the sequence itself, the sequence in fasta format, or an identifier like a GI number. 56

qblast additional parameters qblast can receive other parameters, analogous to the parameters of the actual server Important examples: format_type: "HTML", "Text", "ASN. 1", or "XML". The default is "XML", as that is the format expected by the parser (see next examples) expect sets the expectation or e-value threshold. 57

Step 1: call BLAST >>> from Bio. Blast import NCBIWWW # Option 1 - Use GI ID >>> result_handle = NCBIWWW. qblast("blastn", "nt", "8332116") # Option 2 – read a fasta file >>> fasta_string = open("m_cold. fasta"). read() >>> result_handle = NCBIWWW. qblast("blastn", "nt", fasta_string) # option 3 – parse file to seq object >>> record = Seq. IO. read(open("m_cold. fasta"), format="fasta") >>> result_handle = NCBIWWW. qblast("blastn", "nt", record. seq) 58

Step 2: parse the results >>> from Bio. Blast import NCBIXML >>> blast_record = NCBIXML. read(result_handle) Read can be used only once! blast_record object keeps the actual results 59

Remarks Basically, Biopython supports reading BLAST results from HTMLs and text files. These methods are not stable and sometimes fail because the servers change the format. XML is stable You can save XML files In the server From result_handle objects (next slide) 60

Save results as XML >>> save_file = open("my_blast. xml", "w") >>> save_file. write(result_handle. read()) >>> save_file. close() Read can be used only once! >>> result_handle. close() 61

BLAST records A BLAST Record contains everything you might ever want to extract from the BLAST output. Example: 62 >>> E_VALUE_THRESH = 0. 04 >>> for alignment in blast_record. alignments: for hsp in alignment. hsps: if hsp. expect < E_VALUE_THRESH: print '****Alignment****' print 'sequence: ', alignment. title print 'length: ', alignment. length print 'e value: ', hsp. expect print hsp. query[0: 75] + '' print hsp. match[0: 75] + '' print hsp. sbjct[0: 75] + ''

BLAST records 63

More functions We cover here very basic functions To get more details use >>> import Bio. Blast. Record >>> help(Bio. Blast. Record) Help on module Bio. Blast. Record in Bio. Blast: NAME Bio. Blast. Record - Record classes to hold BLAST output. FILE d: python 27libsite-packagesbioblastrecord. py 64 DESCRIPTION Classes: Blast Holds all the information from a blast search. PSIBlast Holds all the information from a psi-blast search. Header Holds information from the header. Description Holds information about one hit description. Alignment Holds information about one alignment hit. HSP Holds information about one HSP. Multiple. Alignment Holds information about a multiple alignment. Database. Report Holds information from the database report. Parameters Holds information from the parameters.

Accessing NCBI’s Entrez Databases 65

Bio. Entrez Module for programmatic access to Entrez Example: search Pub. Med or download Gen. Bank records from within a Python script Makes use of the Entrez Programming Utilities http: //www. ncbi. nlm. nih. gov/entrez/utils/ Makes sure that the correct URL is used for the queries, and that not more than one request is made every three seconds, as required by NCBI Note! If the NCBI finds you are abusing their systems, they can and will ban your access! 66

ESearch example >>> handle = Entrez. esearch(db="nucleotide", term="Cypripedioideae[Orgn] AND mat. K[Gene]") >>> record = Entrez. read(handle) # Each of the IDs is a Gen. Bank identifier. >>> print (record["Id. List"]) ['126789333', '442591189', '442591187', '442591185', '442591183', '442591181', '442591179', '442591177', '442591175', '442591173', '442591171', '442591169', '442591167', '442591165', '442591163', '442591161', '442591159', '442591157', '442591155', '442591153'] 67

Explanation Entrez. read Transforms the actual results (retrieved as XML) to a usable object of type Bio. Entrez. Parser. Dictionary. Element >>> record {u'Count': '158', u'Ret. Max': '20', u'Id. List': ['126789333', '442591189', '442591187', '442591185', '442591183', '442591181', '442591179', '442591177', '442591175', '442591173', '442591171', '442591169', '442591167', '442591165', '442591163', '442591161', '442591159', '442591157', '442591155', '442591153'], u'Translation. Stack': [{u'Count': '2482', u'Field': 'Organism', u'Term': '"Cypripedioideae"[Organism]', u'Explode': 'Y'}, {u'Count': '71514', u'Field': 'Gene', u'Term': 'mat. K[Gene]', u'Explode': 'N'}, 'AND'], u'Translation. Set': [{u'To': '"Cypripedioideae"[Organism]', u'From': 'Cypripedioideae[Orgn]'}], u'Ret. Start': '0', u'Query. Translation': '"Cypripedioideae"[Organism] AND mat. K[Gene]'} 68

Database options 'pubmed', 'protein', 'nucleotide', 'nuccore', 'nucgss', 'nucest', 'structure', 'genome', 'books', 'cancerchromosomes', 'cdd', 'gap', 'domains', 'gene', 'genomeprj', 'gensat', 'geo', 'gds', 'homologene', 'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omia', 'omim', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'pccompound', 'pcsubstance', 'snp', 'taxonomy', 'toolkit', 'unigene', 'unists' 69

Download a full record >>> from Bio import Entrez # Always tell NCBI who you are >>> Entrez. email = A. N. Other@example. com # rettype: get a Gen. Bank record >>> handle = Entrez. efetch(db="nucleotide", id="186972394", rettype="gb", retmode="text") >>> print handle. read() 70

71

Change ‘gb’ to ‘fasta’ 72

Read directly to Seq. IO object >>> from Bio import Entrez, Seq. IO >>> handle = Entrez. efetch(db="nucleotide", id="186972394", rettype="gb", retmode="text") >>> record = Seq. IO. read(handle, "genbank") >>> handle. close() >>> print record ID: EU 490707. 1 Name: EU 490707 Description: Selenipedium aequinoctiale maturase K (mat. K) gene, partial cds; chloroplast. Number of features: 3. . . Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCT AGTTTAGTA. . . GAA', IUPACAmbiguous. DNA()) 73

Download directly from a URL Suppose we know how the database URLs look like Example: GEO (gene expression omnibus) "http: //www. ncbi. nlm. nih. gov/geo/download/? ac c=GSE 6609&format=file" 74

Use the urlib 2 module >>> import urllib 2 >>> u = urllib 2. urlopen('http: //www. ncbi. nlm. nih. gov/geo/downloa d/? acc=GSE 6609&format=file') >>> local. File = open('gse 6609_raw. tar', 'w') >>> for x in u: local. File. write(x) >>> local. File. close() 75

More details We covered only a few concepts For more details on Biopython options, including dealing with specialized parsers, see http: //biopython. org/DIST/docs/tutorial/Tutorial. ht ml#sec: parsing-blast Chapter 9 Look at the urllib 2 manual http: //docs. python. org/2/library/urllib 2. html 76

Sequence Motifs 77

Gene expression regulation Transcription is regulated mainly by transcription factors (TFs) - proteins that bind to DNA subsequences, called binding sites (BSs) TFBSs are located mainly in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS) TFs can promote or repress transcription Other regulators: micro-RNAs (mi. RNAs)

Ab-initio motif discovery You are given a set of strings You want to find a motif that is significantly represented in the strings For example: TFmi. RNA binding site 79

TFBS models The BSs of a particular TF share a common pattern, or motif, which is often modeled using: Degenerate string GGWATB (W={A, T}, B={C, G, T}) ATCGGAATTCTGCAG GGCAATTCGGGAATG AGGTATTCTCAGATTA PWM = Position weight matrix 1 2 3 4 5 6 A 0. 1 0. 8 0 0. 7 0. 2 0 C 0 0. 1 0. 5 0. 1 0. 4 0. 6 G 0 0 0. 5 0. 1 0. 4 0. 1 T 0. 9 0. 1 0 0. 3 ØCutoff = 0. 009 AGCTACACCCATTTAT 0. 06 AGTAGAGCCTTCGTG 0. 06 CGATTCTACAATATGA 0. 01

Motif discovery: The typical two-step pipeline Promoter/3’UTR sequences Co-regulated gene set Cluster I Gene expression microarrays Clustering Cluster III Location analysis (Ch. IP-chip, …) Functional group (e. g. , GO term) Motif discovery

Motif discovery: Goals and challenges Goal: Reverse-engineer the transcriptional regulatory network Challenges: BSs are short and degenerate (non-specific) Promoters are long + complex (hard to model) Search space is huge (motif and sequence) Data is noisy What to look for? (enriched? , localized? , conserved? ) Problem is still considered very difficult despite extensive research

Biopython motif objects from Bio import motifs from Bio. Seq import Seq instances = [Seq("TACAA"), Seq("TACGC"), Seq("TACAC"), Seq("TACCC"), Seq("AATGC"), Seq("AATGC")] m = motifs. create(instances) print m TACAA TACGC TACAC TACCC AATGC 83

Biopython motif objects >>> print m. counts 0 1 2 3 4 A: 3. 00 7. 00 0. 00 2. 00 1. 00 C: 0. 00 5. 00 2. 00 6. 00 G: 0. 00 3. 00 0. 00 T: 4. 00 0. 00 2. 00 0. 00 84

Biopython motif objects >>> m. consensus Seq('TACGC', IUPACUnambiguous. DNA()) #The anticonsensus sequence, corresponding to the smallest values in the columns of the. counts matrix: >>> m. anticonsensus Seq('GGGTG', IUPACUnambiguous. DNA()) 85

Motif database (http: //jaspar. genereg. net/) 86

87

88

89

90

Read records from Bio import motifs arnt = motifs. read(open("Arnt. sites"), "sites") print arnt. counts 0 1 2 3 4 5 A: 4. 00 19. 00 0. 00 C: 16. 00 0. 00 20. 00 G: 0. 00 1. 00 0. 00 20. 00 T: 0. 00 20. 00 91

MEME is a tool for discovering motifs in a group of related DNA or protein sequences. It takes as input a group of DNA or protein sequences and outputs as many motifs as requested. Therefore, in contrast to JASPAR files, MEME output files typically contain multiple motifs. 92

Assumptions The number of motifs is known Assume this number is 1 The size of the motif is known Biologically, we have estimates for the size for TFs and mi. RNA Missing information PWM of the motif PWM of the background Motif locations 93

Assumptions Given a sequence X and a PWM Y, of the same length we can calculate P(X|Y) Assume independence of motif positions 94

Assumptions Given a sequence X and a PWM Y, of the same length we can calculate P(X|Y) Assume independence of motif positions Given a PWM we can now calculate for each position K in each sequence J the probability the motif starts at K in the sequence J. 95

Expectation Maximization (EM) Algorithm • Start with initial guess for the PWMs • The EM algorithm consists of the two steps, which are repeated consecutively. • Step 1, estimate the probability of finding the site at any position in each of the sequences. These probabilities are used to provide new information as to expected base or aa distribution for each column in the site. • Step 2, the maximization step, the new counts for bases or aa for each position in the site found in the step 1 are substituted for the previous set.

Expectation Maximization (EM) Algorithm OOOOOOOOXXXXOOOOOOOO o o o o o o OOOOOOOOXXXXOOOOOOOO IIIIIIII Columns defined by a preliminary alignment of the sequences provide initial estimates of frequencies of aa in each motif column IIIIIII Columns not in motif provide background frequencies Bases Background Site column 1 Site column 2 …… G 0. 27 0. 4 0. 1 …… C 0. 25 0. 4 0. 1 …… A 0. 25 0. 2 0. 1 …… T 0. 23 0. 2 0. 7 …… Total 1. 00 ……

Expectation Maximization (EM) Algorithm XXXXOOOOOOOO XXXX A IIIIIIIIII OXXXXOOOOOOOO XXXX B IIII I IIIIIIII Use previous estimates of aa or nucleotide frequencies for each column in the motif to calculate probability of motif in this position, and multiply by……. . X …background frequencies in the remaining positions. The resulting score gives the likelihood that the motif matches positions A, B or other in seq 1. Repeat for all other positions and find most likely locator. Then repeat for the remaining seq’s.

EM Algorithm 2 nd optimisation step: calculations • The site probabilities for each seq calculated at the 1 st step are then used to create a new table of expected values for base counts for each of the site positions using the site probabilities as weights. • Suppose that P (site 1 in seq 1) = Psite 1, seq 1 / (Psite 1, seq 1 + Psite 2, seq 1 + …+ Psite 78, seq 1 ) = 0. 01 and P (site 2 in seq 1) = 0. 02. • Then this values are added to the previous table as shown in the table below. • This procedure is repeated for every other possible first columns in seq 1 and then the process continues for all other sequences resulting in a new version of the table. • The expectation and maximization steps are repeated until the estimates of base frequencies do not change. Bases Background Site column 1 Site column 2 …… G 0. 27 + … 0. 4 + … 0. 1 + … …… C 0. 25 + … 0. 4 + … 0. 1 + … …… A 0. 25 + … 0. 2 + 0. 01 0. 1 + … …… T 0. 23 + … 0. 2 + … 0. 7 + 0. 02 …… Total/ weighted 1. 00 ……

Run MEME ( 100 http: //meme. nbcr. net/meme/cgi-bin/meme. cgi )

Results 101

Parse results >>> handle = open("meme. dna. oops. txt") >>> record = motifs. parse(handle, "meme") >>> handle. close() >>> len(record) 2 >>> motif = record[0] >>> print motif. consensus TTCACATGCCGC >>> print motif. degenerate_consensus TTCACATGSCNC 102

Motif attributes >>> 7 >>> 12 >>> 0. 2 >>> motif. num_occurrences motif. length evalue = motif. evalue print "%3. 1 g" % evalue motif. name 'Motif 1' 103

Where the motif was found >>> motif = record['Motif 1'] # Each motif has an attribute. instances with the sequence instances in which the motif was found, providing some information on each instance >>> len(motif. instances) 7 >>> motif. instances[0] Instance('TTCACATGCCGC', IUPACUnambiguous. DNA()) >>> motif. instances[0]. start 620 >>> motif. instances[0]. strand '-' >>> motif. instances[0]. length 12 >>> pvalue = motif. instances[0]. pvalue >>> print "%5. 3 g" % pvalue 1. 85 e-08 104

Amadeus Advanced algorithms improve upon MEME This is an algorithm for motif finding Appears to be one of the top algorithms in many tests Java based tool Easy to use GUI Supports analysis of TFs and mi. RNAs Developed here in TAU 105

Amadeus A Motif Algorithm for Detecting Enrichment in m. Ultiple Species Supports diverse motif discovery tasks: 1. 2. 3. Finding over-represented motifs in one or more given sets of genes. Identifying motifs with global spatial features given only the genomic sequences. Simultaneous inference of motifs and their associated expression profiles given genome-wide expression datasets. How? A general pipeline architecture for enumerating motifs. Different statistical scoring schemes of motifs for different motif discovery tasks.

Input: ~350 genes expressed in the human G 2+M cell-cycle phases [Whitfield et al. ’ 02] Pairs analysis CHR NF-Y (CCAATbox)

Clustering analysis 108

Clustering - reminder Cluster analysis is the grouping of items into clusters based on the similarity of the items to each other. Bio. Cluster module Kmeans SOM Hierarchical clustering PCA 109

K-means clustering Mac. Queen, 65 Input: a set of observations (x 1, x 2, …, xn) For example, each observation is a gene, and x is the values Goal: partition the observation to K clusters S = {S 1, S 2, …, Sk} Objective function: 110

K-means clustering Mac. Queen, 65 Initialize an arbitrary partition P into k clusters C 1 , …, C k. For cluster Cj, element i Cj, EP(i, Cj) = cost of soln. if i is moved to cluster Cj. Pick EP(r, Cs) if the new partition is better Repeat until no improvement possible Requires knowledge of k 111

K-means variations Compute a centroid cp for each cluster Cp, e. g. , gravity center = average vector Solution cost: clusters p i in cluster pd(vi, cp) Parallel version: move each to the cluster with the closest centroid simultaneously Sequential version: one at a time “moving centers” approach Objective = homogeneity only (k fixed) 112

113

114

Data representation The data to be clustered are represented by a n × m Numerical Python array data. Within the context of gene expression data clustering, typically the rows correspond to different genes whereas the columns correspond to different experimental conditions. The clustering algorithms in Bio. Cluster can be applied both to rows (genes) and to columns (experiments). 115

DistanceSimilarity functions 'e': Euclidean distance 'c': Pearson correlation coefficient 'a': Absolute value of the Pearson correlation coefficient 'u': cosine of the angle between two data vectors 'x': Absolute uncentered Pearson correlation 's': Spearman’s rank correlation 116

Calculating distance matrices >>> from Bio. Cluster import distancematrix >>> matrix = distancematrix(data) data - required Additional options: transpose (default: 0) Determines if the distances between the rows of data are to be calculated (transpose==0), or between the columns of data (transpose==1). dist (default: 'e', Euclidean distance) 117

Distancematrix To save space Biopython keeps only the lowerupper triangle of the matrix 118

Partitioning algorithms Algorithms that receive the number of clusters K as an argument Kmeans Kmedians Often referred to as EM variations 119

Analysis example 120

Analysis example # Read the data import csv file = open('ge_data_example. txt', 'rb') data = csv. reader(file, delimiter='t') table = [row for row in data] >>> len(table) 100 >>> table[1][1] '9. 412' >>> table[0][0] 'sample' >>> len(table[1]) 17 121

Analysis example # Transform the data to numpy matrix from numpy import * mat = matrix(table[1: ], dtype='float') print len(mat) # Create the distance matrix from Bio. Cluster import distancematrix dist_matrix = distancematrix(mat) # Cluster from Bio. Cluster import kclusterid, error, nfound = kcluster(mat) 122

Analysis example # Cluster from Bio. Cluster import kclusterid, error, nfound = kcluster(mat) Clusterid: array with cluster assignments Error: the within cluster sum of distances Nfound: the number of times the returned solution was found 123

Analysis example 124 >>> clusterid array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) >>> error 15988. 118370804612 >>> nfound 1

Kcluster: other options nclusters (default: 2): the number of clusters k. transpose (default: 0): Determines if rows (transpose is 0) or columns (transpose is 1) are to be clustered. npass (default: 1): the number of times the k-means/-medians clustering algorithm is performed method (default: a): describes how the center of a cluster is found: method=='a': arithmetic mean (k-means clustering); method=='m': median (k-medians clustering). dist (default: 'e', Euclidean distance) initialid (default: None) Specifies the initial clustering to be used for the algorithm. 125

Hierarchical clustering from Bio. Cluster import treecluster tree 1 = treecluster(mat) # Can be applied to a precalculated distance matrix tree 2 = treecluster(distancematrix=dist_matrix) # Get the cluster assignments clusterid = tree 1. cut(3) 126

Hierarchical clustering using Sci. Py Better visualizations! # Create a distance matrix X=mat D = scipy. zeros([len(x), len(x)]) for i in range(len(x)): for j in range(len(x)): D[i, j] = sum(abs(x[i] - x[j])) 127

Hierarchical clustering using Sci. Py # Compute and plot first dendrogram. fig = pylab. figure(figsize=(8, 8)) # Add an axes at position rect [left, bottom, width, height] where all quantities are in fractions of figure width and height. ax 1 = fig. add_axes([0. 09, 0. 1, 0. 2, 0. 6]) # Clustering analysis Y = sch. linkage(D, method='centroid') Z 1 = sch. dendrogram(Y, orientation='right') ax 1. set_xticks([]) ax 1. set_yticks([]) 128

Hierarchical clustering using Sci. Py # Plot distance matrix. axmatrix = fig. add_axes([0. 3, 0. 1, 0. 6]) idx 1 = Z 1['leaves'] D = D[idx 1, : ] im = axmatrix. matshow(D, aspect='auto', origin='lower', cmap=pylab. cm. Yl. Gn. Bu) axmatrix. set_xticks([]) axmatrix. set_yticks([]) 129

Hierarchical clustering using Sci. Py # Plot colorbar. axcolor = fig. add_axes([0. 91, 0. 02, 0. 6]) pylab. colorbar(im, cax=axcolor) fig. show() 130

Phylogeneti c trees 131

Remember the Newick format? Simple example without branch length (((A, B), (C, D)), (E, F, G)) 132

Visualizing trees >>> >>> 133 local. File. close() from Bio import Phylo tree = Phylo. read("simple. dnd", "newick") print tree Tree(weight=1. 0, rooted=False) Clade(branch_length=1. 0) Clade(branch_length=1. 0, name='A') Clade(branch_length=1. 0, name='B') Clade(branch_length=1. 0, name='C') Clade(branch_length=1. 0, name='D') Clade(branch_length=1. 0, name='E') Clade(branch_length=1. 0, name='F') Clade(branch_length=1. 0, name='G')

Visualizing trees 134

Use matplotlib >>> import matplotlib >>> tree. rooted = True >>> Phylo. draw(tree) 135

Phylo IO Phylo. read() reads a tree with exactly one tree If you have many trees use a loop over the returned object of Phylo. parse() Write to file using Phylo. write(tree. Obj, format) Popular formats: “nwk”, “xml” Convert tree formats using Phylo. convert("tree 1. xml", "phyloxml", "tree 1. dnd", "newick") 136