Sequence File Parsing using Biopython BINF 524 Lecture

Sequence File Parsing using Biopython BINF 524 Lecture 12 BINF 524 - Edwards

Review l Modules in the standard-python library: l l l sys, os. path – access files, program environment zipfile, gzip – access compressed files directly urllib – access web-resources (URLs) as files csv – read delimited line based records from files Plus lots, lots more. BINF 524 - Edwards 2

Bio. Python l Additional modules that make many common bioinformatics tasks easier l l Have to install separately l l l File parsing (many formats) & web-retrieval Formal biological alphabets, codon tables, etc Lots of other stuff… Not part of standard python, or Anaconda python To install: use pip, or built in package manager biopython. org BINF 524 - Edwards 3

Biopython: Fasta format l l Most common biological sequence data format Header/Description line l l Multi-accession sometimes represented l l >accession description accession 1|accession 2|accession 3 lots of variations, no standardization No prescribed format for the description Other lines l l sequence, one chunk per line. Usually all lines, except the last, are the same length. BINF 524 - Edwards 4

Bio. Python: Bio. Seq. IO import sys # Check the input if len(sys. argv) < 2: print("Please provide a sequence file", file=sys. stderr) sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio. Seq. IO. parse(seqfile, "fasta"): # Print out the various elements of the Seq. Record print("n------NEW SEQRECORD------n") print("seq_record. id: nt", seq_record. id) print("seq_record. description: nt", seq_record. description) print("seq_record. seq: nt", seq_record. seq) seqfile. close() BINF 524 - Edwards 5

Biopython: Other formats l Genbank format l l Uni. Prot/Swiss. Prot flat-file format l l From Uni. Prot for Swiss. Prot and Tr. EMBL Uni. Prot-XML format: l l From NCBI, also format for Ref. Seq sequence From Uni. Prot for Swiss. Prot and Tr. EMBL Use the gzip module to handle compressed sequence databases BINF 524 - Edwards 6

Bio. Python: Bio. Seq. IO import sys # Check the input if len(sys. argv) < 2: print("Please provide a sequence file", file=sys. stderr) sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio. Seq. IO. parse(seqfile, "genbank"): # Print out the various elements of the Seq. Record print("n------NEW SEQRECORD------n") print("seq_record. id: nt", seq_record. id) print("seq_record. description: nt", seq_record. description) print("seq_record. seq: nt", seq_record. seq) seqfile. close() BINF 524 - Edwards 7

Bio. Python: Bio. Seq. IO import sys # Check the input if len(sys. argv) < 2: print("Please provide a sequence file", file=sys. stderr) sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio. Seq. IO. parse(seqfile, "swiss"): # Print out the various elements of the Seq. Record print("n------NEW SEQRECORD------n") print("seq_record. id: nt", seq_record. id) print("seq_record. description: nt", seq_record. description) print("seq_record. seq: nt", seq_record. seq) seqfile. close() BINF 524 - Edwards 8

Bio. Python: Bio. Seq. IO import sys # Check the input if len(sys. argv) < 2: print("Please provide a sequence file", file=sys. stderr) sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the FASTA file and iterate through its sequences seqfile = open(seqfilename) for seq_record in Bio. Seq. IO. parse(seqfile, "uniprot-xml"): # Print out the various elements of the Seq. Record print("n------NEW SEQRECORD------n") print("seq_record. id: nt", seq_record. id) print("seq_record. description: nt", seq_record. description) print("seq_record. seq: nt", seq_record. seq) seqfile. close() BINF 524 - Edwards 9

Bio. Python: Bio. Seq. IO import sys import gzip # Check the input if len(sys. argv) < 2: print("Please provide a sequence file", file=sys. stderr) sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the FASTA file and iterate through its sequences # import io # seqfile = io. Text. IOWrapper(gzip. open(seqfilename), encoding="utf-8") seqfile = gzip. open(seqfilename, mode="rt") for seq_record in Bio. Seq. IO. parse(seqfile, "fasta"): # Print out the various elements of the Seq. Record print("n------NEW SEQRECORD------n") print("seq_record. id: nt", seq_record. id) print("seq_record. description: nt", seq_record. description) print("seq_record. seq: nt", seq_record. seq) BINF 524 - Edwards seqfile. close() 10

What about the other "stuff" l Bio. Python makes it easy to get access to non -sequence information stored in "rich" sequence databases l l Annotations Cross-References Sequence Features Literature BINF 524 - Edwards 11

Bio. Python: Bio. Seq. IO import sys import gzip # Check the input if len(sys. argv) < 2: print >>sys. stderr, "Please provide a sequence file" sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip. open(seqfilename) for seq_record in Bio. Seq. IO. parse(seqfile, "uniprot-xml"): # What else is available in the Seq. Record? print("n------NEW SEQRECORD------n") print("repr(seq_record)nt", repr(seq_record)) print("dir(seq_record)nt", dir(seq_record)) break seqfile. close() BINF 524 - Edwards 12

Bio. Python: Bio. Seq. Record import Bio. Seq. IO import sys import gzip # Check the input if len(sys. argv) < 2: print >>sys. stderr, "Please provide a sequence file" sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the FASTA file and iterate through its sequences seqfile = gzip. open(seqfilename) for seq_record in Bio. Seq. IO. parse(seqfile, "uniprot-xml"): # Print out the various elements of the Seq. Record print("n------NEW SEQRECORD------n") print("seq_record. annotationsnt", seq_record. annotations) print("seq_record. featuresnt", seq_record. features) print("seq_record. dbxrefsnt", seq_record. dbxrefs) print("seq_record. format('fasta')n"+seq_record. format('fasta')) break seqfile. close() BINF 524 - Edwards 13

Bio. Python: Random access l Sometimes you want to access the sequence records "randomly"… l l Why not make a dictionary, with accessions as keys, and Seq. Record values? l l …to pick out the ones you want (by accession) Use Seq. IO. to_dict(…) What if you don't want to hold it all in memory l Use Seq. IO. index(…) BINF 524 - Edwards 14

Bio. Python: Bio. Seq. IO. to_dict(…) import Bio. Seq. IO import sys # Check the input if len(sys. argv) < 2: print >>sys. stderr, "Please provide a sequence file" sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Open the sequence database seqfile = open(seqfilename) # Use to_dict to make a dictionary of sequence records sprot_dict = Bio. Seq. IO. to_dict(Bio. Seq. IO. parse(seqfile, "uniprot-xml")) # Close the file seqfile. close() # Access and print a sequence record print(sprot_dict['Q 6 GZV 8']) BINF 524 - Edwards 15

Bio. Python: Bio. Seq. IO. index(…) import Bio. Seq. IO import sys # Check the input if len(sys. argv) < 2: print >>sys. stderr, "Please provide a sequence file" sys. exit(1) # Get the sequence filename seqfilename = sys. argv[1] # Use index to make an out of core dict of seq records sprot_index = Bio. Seq. IO. index(seqfilename, "uniprot-xml") # Access and print a sequence record print(sprot_index['Q 6 GZV 8']) BINF 524 - Edwards 16

Exercises 0. Read through and try the examples from Chapters 2 -5 of Bio. Python's Tutorial. 1 a. Download human proteins from Ref. Seq and compute amino-acid frequencies for the (Ref. Seq) human proteome. l Which amino-acid occurs the most? The least? l Hint: access Ref. Seq human proteins in human. protein. fasta. gz from the course data-folder. 1 b. Download human proteins from Swiss. Prot and compute amino-acid frequencies for the Swiss. Prot human proteome. l Which amino-acid occurs the most? The least? l Hint: access Swiss. Prot human proteins from http: //www. uniprot. org/downloads -> “Taxonomic divisions” 1 c. How similar are the human amino-acid frequencies of in Ref. Seq and Swiss. Prot? BINF 524 - Edwards 17

Homework 6 l Due Monday, October 5 th. l Exercise 1 from Lecture 11 l Exercise 1 from Lecture 12 BINF 524 - Edwards 18