Sequence File Parsing using Biopython BCHB 524 Lecture

Sequence File Parsing using Biopython BCHB 524 Lecture 12 BCHB 524 - Edwards

Review l Modules in the standard-python library: l l l sys, os. path – access files, program environment zipfile, gzip – access compressed files directly urllib – access web-resources (URLs) as files csv – read delimited line based records from files Plus lots, lots more. BCHB 524 - Edwards 2

Bio. Python l Additional modules that make many common bioinformatics tasks easier l l Have to install separately l l File parsing (many formats) & web-retrieval Formal biological alphabets, codon tables, etc Lots of other stuff… Not part of standard python, or Enthought biopython. org BCHB 524 - Edwards 3

Biopython: Fasta format l l Most common biological sequence data format Header/Description line l l Multi-accession sometimes represented l l >accession description accession 1|accession 2|accession 3 lots of variations, no standardization No prescribed format for the description Other lines l l sequence, one chunk per line. Usually all lines, except the last, are the same length. BCHB 524 - Edwards 4

Bio. Python: Bio. Seq. IO import sys # Check the input if len(sys. argv) < 2: print >>sys. stderr, "Please provide a sequence file" sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio. Seq. IO. parse(seqfile, "fasta"): # Print out the various elements of the Seq. Record print "n------NEW SEQRECORD------n" print "seq_record. id: nt", seq_record. id print "seq_record. description: nt", seq_record. description print "seq_record. seq: nt", seq_record. seqfile. close() BCHB 524 - Edwards 5

Biopython: Other formats l Genbank format l l Uni. Prot/Swiss. Prot flat-file format l l From Uni. Prot for Swiss. Prot and Tr. EMBL Uni. Prot-XML format: l l From NCBI, also format for Ref. Seq sequence From Uni. Prot for Swiss. Prot and Tr. EMBL Use the gzip module to handle compressed sequence databases BCHB 524 - Edwards 6

Bio. Python: Bio. Seq. IO import sys # Check the input if len(sys. argv) < 2: print >>sys. stderr, "Please provide a sequence file" sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio. Seq. IO. parse(seqfile, "genbank"): # Print out the various elements of the Seq. Record print "n------NEW SEQRECORD------n" print "seq_record. id: nt", seq_record. id print "seq_record. description: nt", seq_record. description print "seq_record. seq: nt", seq_record. seqfile. close() BCHB 524 - Edwards 7

Bio. Python: Bio. Seq. IO import sys # Check the input if len(sys. argv) < 2: print >>sys. stderr, "Please provide a sequence file" sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio. Seq. IO. parse(seqfile, "swiss"): # Print out the various elements of the Seq. Record print "n------NEW SEQRECORD------n" print "seq_record. id: nt", seq_record. id print "seq_record. description: nt", seq_record. description print "seq_record. seq: nt", seq_record. seqfile. close() BCHB 524 - Edwards 8

Bio. Python: Bio. Seq. IO import sys # Check the input if len(sys. argv) < 2: print >>sys. stderr, "Please provide a sequence file" sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio. Seq. IO. parse(seqfile, "uniprot-xml"): # Print out the various elements of the Seq. Record print "n------NEW SEQRECORD------n" print "seq_record. id: nt", seq_record. id print "seq_record. description: nt", seq_record. description print "seq_record. seq: nt", seq_record. seqfile. close() BCHB 524 - Edwards 9

Bio. Python: Bio. Seq. IO and gzip import Bio. Seq. IO import sys import gzip # Check the input if len(sys. argv) < 2: print >>sys. stderr, "Please provide a sequence file" sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip. open(seqfilename) for seq_record in Bio. Seq. IO. parse(seqfile, "fasta"): # Print out the various elements of the Seq. Record print "n------NEW SEQRECORD------n" print "seq_record. id: nt", seq_record. id print "seq_record. description: nt", seq_record. description print "seq_record. seq: nt", seq_record. seqfile. close() BCHB 524 - Edwards 10

What about the other "stuff" l Bio. Python makes it easy to get access to non -sequence information stored in "rich" sequence databases l l Annotations Cross-References Sequence Features Literature BCHB 524 - Edwards 11

Bio. Python: Bio. Seq. IO import sys import gzip # Check the input if len(sys. argv) < 2: print >>sys. stderr, "Please provide a sequence file" sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip. open(seqfilename) for seq_record in Bio. Seq. IO. parse(seqfile, "uniprot-xml"): # What else is available in the Seq. Record? print "n------NEW SEQRECORD------n" print "repr(seq_record)nt", repr(seq_record) print "dir(seq_record)nt", dir(seq_record) break seqfile. close() BCHB 524 - Edwards 12

Bio. Python: Bio. Seq. Record import Bio. Seq. IO import sys import gzip # Check the input if len(sys. argv) < 2: print >>sys. stderr, "Please provide a sequence file" sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip. open(seqfilename) for seq_record in Bio. Seq. IO. parse(seqfile, "uniprot-xml"): # Print out the various elements of the Seq. Record print "n------NEW SEQRECORD------n" print "seq_record. annotationsnt", seq_record. annotations print "seq_record. featuresnt", seq_record. features print "seq_record. dbxrefsnt", seq_record. dbxrefs print "seq_record. format('fasta')n", seq_record. format('fasta') break seqfile. close() BCHB 524 - Edwards 13

Bio. Python: Random access l Sometimes you want to access the sequence records "randomly"… l l Why not make a dictionary, with accessions as keys, and Seq. Record values? l l …to pick out the ones you want (by accession) Use Seq. IO. to_dict(…) What if you don't want to hold it all in memory l Use Seq. IO. index(…) BCHB 524 - Edwards 14

Bio. Python: Bio. Seq. IO. to_dict(…) import Bio. Seq. IO import sys # Check the input if len(sys. argv) < 2: print >>sys. stderr, "Please provide a sequence file" sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the sequence database seqfile = open(seqfilename) # Use to_dict to make a dictionary of sequence records sprot_dict = Bio. Seq. IO. to_dict(Bio. Seq. IO. parse(seqfile, "uniprot-xml")) # Close the file seqfile. close() # Access and print a sequence record print sprot_dict['Q 6 GZV 8'] BCHB 524 - Edwards 15

Bio. Python: Bio. Seq. IO. index(…) import Bio. Seq. IO import sys # Check the input if len(sys. argv) < 2: print >>sys. stderr, "Please provide a sequence file" sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Use index to make an out of core dict of seq records sprot_index = Bio. Seq. IO. index(seqfilename, "uniprot-xml") # Access and print a sequence record print sprot_index['Q 6 GZV 8'] BCHB 524 - Edwards 16

Exercises 0. Read through and try the examples from Chapters 2 -5 of Bio. Python's Tutorial. 1 a. Download human proteins from Ref. Seq and compute amino-acid frequencies for the (Ref. Seq) human proteome. l Which amino-acid occurs the most? The least? l Hint: access Ref. Seq human proteins in human. protein. fasta. gz from the course data-folder. 1 b. Download human proteins from Swiss. Prot and compute amino-acid frequencies for the Swiss. Prot human proteome. l Which amino-acid occurs the most? The least? l Hint: access Swiss. Prot human proteins from http: //www. uniprot. org/downloads -> “Taxonomic divisions” 1 c. How similar are the human amino-acid frequencies of in Ref. Seq and Swiss. Prot? BCHB 524 - Edwards 17

Homework 7 l Due Monday, October 22 rd. l Exercise 1 from Lecture 12 BCHB 524 - Edwards 18