Sequence File formats and format conversion tools Dinesh

Why different formats • Organised sequence information • Database integration 9/15/2020

Main file formats used in Bioinformatics • ASN. 1 • EMBL Swiss Prot •

ASN 1: Abstract Syntax Notation 1 used by NCBI Seq-entry : : = set

EMBL/Swiss Prot (http: //www. ebi. ac. uk/help/formats_frame. html) • The first line of each

FASTA • A sequence in Fasta format begins with a single-line description, • followed

GCG • Exactly one sequence • Begins with annotation lines • Start of the

Gen. Bank/Gen. Pept The nucleotide (Gen. Bank) and protein (Gen Pept) database entries are

NBRF (protein or nucleic acid) File Format The first line of each sequence entry

Read. Seq Don Gilbert software@bio. indiana. edu, May 2001 Bioinformatics group, Biology Department &

The Readseq package can read most common formats: examples of all these formats are

Installation and running Readseq download from software directory from course homepage mkdir readseq cd

Slides: 12

Download presentation

Sequence File formats and format conversion tools • Dinesh Gupta (ICGEB, New Delhi, India) 9/15/2020

Why different formats • Organised sequence information • Database integration 9/15/2020

Main file formats used in Bioinformatics • ASN. 1 • EMBL Swiss Prot • FASTA • GCG • Gen. Bank/Gen. Pept • Nexus • PHYLIP • NBRF and PIR 9/15/2020

ASN 1: Abstract Syntax Notation 1 used by NCBI Seq-entry : : = set { class phy-set , descr { pub { article { title { name "Cross-species infection of blood parasites between resident and migratory songbirds in Africa" } , authors { names std { { name { last "Waldenstroem" , first "Jonas" , initials "J. " } } , { name { last "Bensch" , first "Staffan" , initials "S. " } } , { name { last "Kiboi" , first "Sam" , initials "S. " } } , { name { last "Hasselquist" , first "Dennis" , initials "D. " } } , { 9/15/2020 name {

EMBL/Swiss Prot (http: //www. ebi. ac. uk/help/formats_frame. html) • The first line of each sequence entry is the ID definition line which contains entry name, dataclass, molecule, division and sequence length. • XX line contains no data, just a separator • The AC line lists the accession number. • DE line gives description about the sequence • FT precise annotation for the sequence • aa Sequence information SQ in the first two spaces. • The nucleotide sequence begins on the fifth line of the sequence entry. • The last line of each sequence entry in the file is a terminator line which has the two characters // in the first two spaces. ID XX AC XX DE DE DE RX RX XX FT FT FT SQ // AA 03518 standard; DNA; FUN; 237 BP. XX AC U 03518; Aspergillus awamori internal transcribed spacer 1 (ITS 1) and 18 S r. RNA and 5. 8 S r. RNA genes, partial sequence. MEDLINE; 94303342. PUBMED; 8030378. r. RNA <1. . 20 /product="18 S ribosomal RNA" misc_RNA 21. . 205 /standard_name="Internal transcribed spacer 1 (ITS 1)" r. RNA 206. . >237 /product="5. 8 S ribosomal RNA" Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237 9/15/2020

FASTA • A sequence in Fasta format begins with a single-line description, • followed by lines of sequence data. • The description line is distinguished from the sequence data by a greaterthan (">") symbol in the first column. • It is recommended that all lines of text be shorter than 80 characters in length. >U 03518 Aspergillus awamori internal transcribed spacer 1 (ITS 1) AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGAATGCAATCAGTTAAAACT TTCAACAATGGATCTCTTGGTTCCGGC 9/15/2020

GCG • Exactly one sequence • Begins with annotation lines • Start of the sequence is marked by a line ending with ". . “ • This line also contains the sequence identifier, the sequence length and a checksum ID XX AC XX DE DE XX AA 03518 standard; DNA; FUN; 237 BP. U 03518; Aspergillus awamori internal transcribed spacer 1 (ITS 1) and 18 S r. RNA and 5. 8 S r. RNA genes, partial sequence. SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; AA 03518 Length: 237 Check: 4514 . . 1 61 121 181 aacctgcgga tattgtaccc ccccccgggc tgagttgatt 9/15/2020 aggatcatta tgttgcttcg ccgtgcccgc gaatgcaatc ccgagtgcgggcccgc cggagacccc agttaaaact gtcctttggg cgcttgtcgg aacacgaaca ttcaacaatg cccaacctcc ccgccggggg ctgtctgaaa gatctcttgg catccgtgtc ggcgcctctg gcgtgcagtc ttccggc

Gen. Bank/Gen. Pept The nucleotide (Gen. Bank) and protein (Gen Pept) database entries are available from Entrez in this format • Can contain several sequences • One sequence starts with: “LOCUS” • The sequence starts with: "ORIGIN“ • The sequence ends with: "//“ LOCUS AAU 03518 237 bp DNA PLN 04 -FEB-1995 DEFINITION Aspergillus awamori internal transcribed spacer 18 S r. RNA and 5. 8 S r. RNA genes, partial sequence. ACCESSION U 03518 BASE COUNT 41 a 77 c 67 g 52 t ORIGIN 1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc 61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg 121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa 181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg // 9/15/2020 1 (ITS 1) and catccgtgtc ggcgcctctg gcgtgcagtc ttccggc

NBRF (protein or nucleic acid) File Format The first line of each sequence entry begins with a greater than symbol, >. This is immediately followed by the two character sequence type specifier. Space four must contain a semi-colon. Beginning in space five is the sequence name or identification code for the NBRF database. The code is from four to six letters and numbers. Specifier P 1 F 1 DL DC RL RC N 1 N 3 Sequence type protein, complete protein, fragment DNA, linear DNA, circular RNA, linear RNA, circular functional RNA, other than t. RNA The second line of each sequence entry contains two kinds of information. First is the sequence name which is separated from the organism or organelle name by the three character sequence blank space, dash, blank space, " - ". There is no special character marking the beginning of this line. Either the amino acid or nucleic acid sequence begins on line three and can begin in any space, including the first. The sequence is free format and may be interrupted by blanks for ease of reading. Protein sequences man contain special punctuation to indicate various indeterminacies in the sequence. In the NBRF data files all lines may be up to 500 characters long. However some PSC programs currently have a limit of 130 characters per line (including blanks), and Bit. Net will not accept lines of over eighty characters. The last character in the sequence must be an asterisks, *. 9/15/2020

Read. Seq Don Gilbert software@bio. indiana. edu, May 2001 Bioinformatics group, Biology Department & Cntr. Genomics & Bioinformatics, Indiana University, Bloomington, Indiana WWW http: //bioportal. bic. nus. edu. sg/readseq. html http: //www-bimas. cit. nih. gov/molbio/readseq/ http: //bioweb. pasteur. fr/seqanal/interfaces/readseq-simple. html Seqret A program in EMBOSS suite 9/15/2020

The Readseq package can read most common formats: examples of all these formats are included in the readseq directory. The formats include: • • • • IG/Stanford, used by Intelligenetics and others Gen. Bank/GB, genbank flatfile format NBRF format (SAM modifications cause this to break when sequences do not have a terminating asterix) EMBL, EMBL flatfile format GCG, single sequence format of GCG software DNAStrider, for common Mac program Fitch format, limited use Pearson/Fasta, a common format used by Fasta programs and others Zuker format, limited use. Input only. Olsen, format printed by Olsen VMS sequence editor. Input only. Phylip 3. 2, sequential format for Phylip programs Plain/Raw, sequence data only (no name, document, numbering) MSF multi sequence format used by GCG software PAUP's multiple sequence (NEXUS) format PIR/CODATA format used by PIR 9/15/2020

Installation and running Readseq download from software directory from course homepage mkdir readseq cd readseq tar xvf readseq. tar. /readseq --help. /readseq <INPUT 1> <INPUT 2> -format=genbank -output = output. gb Exercises: Go to NCBI site and download a fasta formatted file (via Entrez: Nucleotide) Convert the file into EMBL, Phylip, MSF, etc. formats Download more files: can you work on multiple files ? 9/15/2020