Computational Biology Part 6 Sequence File Formats and

Sequence file formats Two characteristics of file formats text or binary minimal or annotated Text files use IUB codes and are readable by a word processor (e. g. , Simple. Text, Microsoft Word) or text editor (e. g. , emacs) Binary files are usually readable only by the program that created them (e. g. , Mac. Vector) Annotated files preserve information known about the sequence (coding region start/stop, protein features, literature references, etc. )

Sequence file formats ASCII (text) Minimal Line, Plain Text Staden FASTA Bionet (allows comments) Annotated Gen. Bank GCG Binary (usually annotated) Mac. Vector

Examples of ASCII sequence file formats Line (Mac. Vector), Plain Text (Assembly. LIGN) CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTC CTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACC ATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGG ACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACA GATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTC CTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCC TGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCT GCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC

Examples of ASCII sequence file formats Fasta (Entrez) >gi|995614|dbj|D 49653|RATOBESE Rat m. RNA for obese. CCAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTC CTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACC ATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGG ACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACA GATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTC CTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCC TGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCT GCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC

Examples of ASCII sequence file formats Gen. Bank (Entrez, Mac. Vector) LOCUS DEFINITION ACCESSION KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS TITLE JOURNAL STANDARD COMMENT RATOBESE 539 bp ss. ROD 23 -SEP-1995 ss-m. RNA Rat m. RNA for obese. D 49653. Rattus norvegicus (strain OLETF, LETO and Zucker, ) differentiated adipose c. DNA to m. RNA. Rattus norvegicus Eukaryotae ; mitochondrial eukaryotes ; Metazoa; Chordata; Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi ; Myomorpha; Muridae; Murinae; Rattus. 1 (bases 1 to 539) Murakami, T. and Shima, K. Cloning of rat obese c. DNA and its expression in obese rats Biochem. Biophys. Res. Commun. 209, 944 -952 (1995) full automatic Submitted (10 -Mar-1995) to DDBJ by: Takashi Murakami Department of Laboratory Medicine School of Medicine University of Tokushima Kuramotocho 3 -chome Tokushima 770 Japan Phone: +81 -886 -33 -7184 Fax: +81 -886 -31 -9495. [continued]

Examples of ASCII sequence file formats Gen. Bank [continued] NCBI gi: 995614 FEATURES source CDS Location/Qualifiers 1. . 539 /organism=" Rattus norvegicus " /strain="OLETF, LETO and Zucker" / dev_stage="differentiated" /sequenced_ mol=" c. DNA to m. RNA" mol="c. DNA m. RNA" /tissue_type="adipose" 30. . 533 /partial /note="NCBI gi: 995615" / codon_start=1 /product="obese" /translation="MCWRPLCRFLWLWSYLSYVQAVPIHKVQDDTKTLIKTIVTRIND ISHTQSVSARQRVTGLDFIPGLHPILSLSKMDQTLAVYQQILTSLPSQNVLQIAHDLE NLRDLLHLLAFSKSCSLPQTRGLQKPESLDGVLEASLYSTEVVALSRLQGSLQDILQQ LDLSPEC" 121 a 167 c 133 g 118 t BASE COUNT ORIGIN 1 ccaagaagaa 61 ggctttggtc 121 ccaaaaccct 181 tatccgccag 241 gtttgtccaa 301 cccaaaacgt 361 tggccttctc 421 tggatggcgt 481 agggctctct // gaagacccca ctatctgtcc catcaagacc gcagagggtc gatggaccag gctgcagata caagagctgc cctggaagcc gcaggacatt gcgaggaaaa tatgttcaag attgtcacca accggtttgg accctggcag gctcatgacc tccctgccgc tcgctctact cttcaacagt tgtgctggag ctgtgcctat ggatcaatga acttcattcc tctatcaaca tggagaacct agacccgtgg ccacagaggt tggaccttag acccctgtgc ccacaaagtc catttcacac cgggcttcac gatcctcacc gcgagacctc cctgcagaag ggtggctctg ccctgaatgc cggttcctgt caggatgaca acgcagtcgg cccattctga agcttgcctt ctccatctgc ccagagagcc agcaggctgc tgaggtttc

Examples of ASCII sequence file formats GCG (Mac. Vector, GCG) LOCUS DEFINITION ACCESSION KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS TITLE JOURNAL COMMENT RATOBESE. G 539 BP SS-RNA ENTERED 09/23/95 Rat m. RNA for obese. Rattus norvegicus ; Norway rat Eukaryotae; mitochondrial eukaryotes ; Metazoa; Chordata; Vertebrata; Sarcopterygii ; Mammalia; Eutheria; Rodentia; Sciurognathi; Myomorpha ; Muridae; Murinae; Rattus [1] Murakami, T. & Shima, K. Cloning of rat obese c. DNA and its expression in obese rats. Biochem. Biophys. Res. Commun. , 209, 3, 944 -952, (1995) Database Reference: DDBJ RATOBESE Accession: D 49653 ------Submitted (10 -Mar-1995) to DDBJ by: Takashi Murakami Department of Laboratory Medicine School of Medicine University of Tokushima Kuramotocho 3 -chome Tokushima 770 Japan Phone: +81 -886 -33 -7184 Fax: +81 -886 -31 -9495 [continued]

Examples of ASCII sequence file formats GCG [continued] FEATURES pept ? ? From 30 1 To/Span 533 539 Description obese source; /organism= Rattus norvegicus ; /strain=OLETF, LETO and Zucker; / dev_stage=differentiated; /sequenced_ mol=c. DNA to m. RNA; /tissue_type=adipose 133 G 118 T 0 OTHER BASE COUNT 121 A 167 C ORIGIN ? RATOBESE. G Length: 539 Jan 30, 1996 - 05: 32 PM 1 CCAAGAAGAA GAAGACCCCA GCGAGGAAAA TGTGCTGGAG 61 GGCTTTGGTC CTATCTGTCC TATGTTCAAG CTGTGCCTAT 121 CCAAAACCCT CATCAAGACC ATTGTCACCA GGATCAATGA 181 TATCCGCCAG GCAGAGGGTC ACCGGTTTGG ACTTCATTCC 241 GTTTGTCCAA GATGGACCAG ACCCTGGCAG TCTATCAACA 301 CCCAAAACGT GCTGCAGATA GCTCATGACC TGGAGAACCT 361 TGGCCTTCTC CAAGAGCTGC TCCCTGCCGC AGACCCGTGG 421 TGGATGGCGT CCTGGAAGCC TCGCTCTACT CCACAGAGGT 481 AGGGCTCTCT GCAGGACATT CTTCAACAGT TGGACCTTAG // Check: 5797. . ACCCCTGTGC CGGTTCCTGT CCACAAAGTC CAGGATGACA CATTTCACAC ACGCAGTCGG CGGGCTTCAC CCCATTCTGA GATCCTCACC AGCTTGCCTT GCGAGACCTC CTCCATCTGC CCTGCAGAAG CCAGAGAGCC GGTGGCTCTG AGCAGGCTGC CCCTGAATGC TGAGGTTTC

Sequence file format tips When saving a sequence for use in an email message or pasting into a web page, use an unannotated text format such as FASTA When retrieving from a database or exchanging between programs, use an annotated text format such as GCG or Genbank When using sequence again with the same program, use that program’s annotated binary format (or annotated text if binary not available)

Sequence assembly Goal: Assemble pieces of sequence into single, continuous sequence Early commercial system to do sequence assembly was the GCG Gel. Overlap/Gel. Assemble suite (VMS, Unix) We will use Assembly. LIGN (Macintosh), a companion to Mac. Vector

Sequence assembly: Terms project collection of fragments, templates and contigs fragments pieces of sequence entered by user or read from files contigs lists of aligned fragments generated (normally) by program

Sequence assembly: Terms templates any sequence to be searched for can be entered by user can be read from system files most common example is sequence of vector sequences in templates are NOT included in assembled sequences unless they are ALSO present in a fragment (and have not been removed)

Sequence assembly: File organization Assembly. LIGN keeps all information (including sequences) in a single project document GCG keeps all information in a directory (and its subdirectories), with each fragment in a separate file

Sequence assembly: Steps Data entry/import (fragments, templates) Removal of unwanted sequence Automated creation of contigs Manual editing/confirmation Export

Automated creation of contigs Steps 1. Finding pairwise overlaps 2. Resolving overlaps 3. Improving alignment 4. Final assembly and consensus generation

1. Finding pairwise overlaps Compare each fragment (and its complement) with each other fragment Generate list of regions of similarity that meet criteria below Parameters minimum overlap length (default 20) stringency (% of bases that must match, default 70) minimum repeat length (default 30)

1. Finding pairwise overlaps Each fragment may appear in more than one overlap 1 3 6 8 5 7 8 5 3 2 4 9

2. Resolving overlaps Build larger pieces by combining overlaps 1 6 5 8 3 4 3 5 8 2 7 9

2. Resolving overlaps Build larger pieces by combining overlaps 1 8 6 3 5 4 1 3 5 8 2 7 9 8 2

2. Resolving overlaps Build larger pieces by combining overlaps 3 6 5 3 5 4 1 7 9 8 2

2. Resolving overlaps Build larger pieces by combining overlaps 3 6 5 3 5 7 4 1 6 9 8 3 2 5

2. Resolving overlaps Build larger pieces by combining overlaps 5 7 4 1 6 9 8 3 2 5

2. Resolving overlaps Build larger pieces by combining overlaps 5 7 4 1 6 9 8 3 2 5 4

2. Resolving overlaps Build larger pieces by combining overlaps 7 1 6 9 8 3 2 5 4

3. Improve alignment Introduce gaps in sequences if they will improve overlaps alignment parameters gap creation penalty (default 2. 0) gap extension penalty (default (0. 1)

4. Final assembly and consensus generation Mark fragments that are now part of a contig (no longer appear by themselves) Form consensus for each contig by “reading” along aligned sequences and converting to IUB codes by consensus rules consensus parameter base designation threshold (% of all bases at a given position that must be the same for that base to be assigned to the consensus; otherwise, less specific IUB code used; default 80%)

Manual consensus editing Crucial to verify alignment and resolve ambiguities (e. g. , sequencing errors) Program keeps an “edit history” that tracks all changes made to the original sequences: essential to be able to “retrace your steps” from original sequencing gels (e. g. , in case of conflicts with sequences in database)

Assembly. LIGN Tutorial Open “demo π” project

Assembly. LIGN Tutorial Goal: Eliminate vector sequence Double-click vector Select all fragments Drop on vector

Assembly. LIGN Tutorial “vector Alignments” window shows that frag 8 contains vector sequence Click on ‘shadow’ to edit frag 8 and display highlighted vector sequence

Assembly. LIGN Tutorial Highlighted sequence doesn’t look like sequence in “vector” window

Assembly. LIGN Tutorial Click on “vector” window Choose Select All (Edit Menu) Choose Reverse & Complement (Edit Menu) Now highlighted sequence in frag 8 matches that in “vector” window

Assembly. LIGN Tutorial Click on “frag 8” window Delete highlighted sequence Then close “frag 8” window

Assembly. LIGN Tutorial Choose Select All (Edit Menu) Choose Assemble (Project Menu)

Assembly. LIGN Tutorial All but one fragment (frag 14) combined into Untitled Config 1

Assembly. LIGN Tutorial Goal: Try relaxing assembly parameters to merge frag 14 into the contig Choose Assembly Options (Project Menu) Reduce minimum overlap length to 5

Assembly. LIGN Tutorial Now all fragments are merged Double-click Untitled Contig 2 to see alignment and consensus

Assembly. LIGN Tutorial Map shows gross alignments of fragments Click on Magnifying glass ‘A’ and select region of map to view

Assembly. LIGN Tutorial Positions that do not match at/above the Base Designation Threshold are highlighted in the consensus or the original sequences

Can decrease the Base Designation Threshold to reduce ‘uncalled’ bases

Reading for Next Class Baxellanis & Ouellette, Chapter 7 Sequence Analysis Primer, pp. 90 -124 “Similarity versus Homology” and “Dot Matrix Methods” (on web page) (03 -510) Durbin et al, pp. 12 -17

Summary, Part 6 A variety of sequence file formats are currently in use. Files can be either text or binary, and can consist only of sequence or also include annotation information. The choice of file format is dictated by the requirements of the analysis desired and the subset of formats compatible between the “writing” and “reading” program.

Summary, Part 6 Sequence assembly requires the ability to compare sequences to find regions of high homology. Pieces of sequence are assembled by “connecting” them via regions of overlap. A consensus sequence can be generated from the connected pieces (using userspecified rules to resolve ambiguity).

Sequence comparisons using BLAST server Web Page Main BLAST web page URL http: //www. ncbi. nlm. nih. gov/BLAST/ Links to Basic and Advanced Search Pages Two main BLAST programs blastn - compares user nucleotide sequence to nucleotide sequences in database blastp - compares user peptide sequence to peptide sequences in database