Bioinformatics Course Day 4 Perl Extensions Bio Perl

Bioinformatics Course Day 4 Perl Extensions: Bio. Perl and Ensembl API

Bio. Perl ● ● ● Collection of Perl scripts and modules Facilitate development of Perl scripts for bioinformatics applications – not ready-to-use programs! Object-orientated Perl code Generated by biologists, bioinformaticians, computer scientists Different levels of complexity

What are modules? ● Perl extensions ● . pm ending ● inserted via 'use' or 'require' statement ● can be nested: – Bio: : DB: : Gen. Bank – /sw/lib/perl 5/5. 8. 6/Bio/DB/Gen. Bank. pm ● contain code and documentation ● perldoc Bio: : DB: : Gen. Bank

Applications for Bio. Perl ● Sequence retrieval and manipulation ● Data re-formatting ● Run and parse output from bioinformatics programs: – Blast – Clustal. W – Tcoffee – Gen. Scan – HMMER – . . .

What is an Object? ● Example: Bio: : Seq Object use Bio: : Perl $seqobj = get_sequence('swissprot', 'TLR 4_HUMAN');

What is an Object? ● Example: Bio: : Seq Object load module new function available use Bio: : Perl $seqobj = get_sequence('swissprot', 'TLR 4_HUMAN'); generates object

What is an Object? ● Example: Bio: : Seq Object load module new function available use Bio: : Perl $seqobj = get_sequence('swissprot', 'TLR 4_HUMAN'); generates object provides methods $sequence_part = $seqobj->subseq(1, 100); $translation = $seqobj->translate();

What is an Object? ● Example: Bio: : Seq Object load module new function available use Bio: : Perl $seqobj = get_sequence('swissprot', 'TLR 4_HUMAN'); generates object provides methods $sequence_part = $seqobj->subseq(1, 100); $translation = $seqobj->translate(); combinations possible $trans_trunc_rev = $seqobj->trunc(100, 200)->revcom->translate();

Bio. Perl's Objects ● Sequence Objects (representation of various types of sequences): – Seq – Primary. Seq – Locatable. Seq – Rel. Segment – Live. Seq – Large. Seq – Rich. Seq – Seq. With. Quality

Bio. Perl's Objects ● Location Objects: – Where is a feature on a sequence? ● Alignment objects ● Blast reports ● . . .

Objects vs Functions ● Different access method – Example: reading an EMBL file: Function: $seq = read_sequence($file, 'embl') Object: $seqio = Bio: : Seq. IO->new( -format => 'embl' , -file => $file ); $seqobj = $seqio->next_seq();

Bio: : Perl example ● Load the module use Bio: : Perl; ● Get the sequence $seqobj = get_sequence('swissprot', 'TLR 4_HUMAN'); ● Now you have a Bio: : Seq object!

Bio: : Perl is not Bio. Perl! ● Bio. Perl: – ● whole collection of Bio-related Perl extension Bio: : Perl – just one of many modules – “Easy first time access to Bio. Perl via functions” – “Functional access to Bio. Perl for people who don't know objects” – limited functionality – nice starter

DB entry to Objects ● Conversion of parts of the data into objects: ID AC TLR 4_HUMAN STANDARD; O 00206; Q 9 UK 78; Q 9 UM 57; PRT; 839 AA. Bio: : Primary. Seq OS OC Homo sapiens (Human). Eukaryota; Metazoa; Chordata; Craniata; . . . Bio: : Species FT FT REPEAT 52 77 76 100 LRR 1. . LRR 2. . Bio: : Seq. Feature. I

Bio: : Seq – Formats AB 1, ABI ALF CTF EMBL EXP Fasta Fastq GCG Gen. Bank PIR PLN SCF ZTR ace game locuslink phd qual raw swiss ABI tracefile format ALF tracefile format CTF tracefile format EMBL format Staden tagged experiment tracefile format FASTA format Fastq format GCG format Gen. Bank format Protein Information Resource format Staden plain tracefile format SCF tracefile format ZTR tracefile format ACe. DB sequence format GAME XML format Locus. Link annotation (LL_tmpl format only) phred output Quality values (get a sequence of quality scores) Raw format (one sequence per line, no ID) Swissprot format

Access through Bio: : Seq object perldoc Bio: : Seq (methods returning strings): $seqobj->seq(); # string of sequence $seqobj->subseq(5, 10); # part of the sequence as a string $seqobj->accession_number(); # when there, the accession number $seqobj->alphabet(); # one of 'dna', 'rna', or 'protein' $seqobj->seq_version() # when there, the version $seqobj->keywords(); # when there, the Keywords line $seqobj->length() # length $seqobj->desc(); # description $seqobj->display_id(); # the human readable id of the sequence

Derived Bio: : Seq objects perldoc Bio: : Seq (methods returning strings): $seqobj->trunc(5, 10) $seqobj->revcom $seqobj->translate ● # truncation from 5 to 10 as new object # reverse complements sequence # translation of the sequence Example: $seqobj = read_sequence($file); if ($seqobj->alphabet eq 'dna' or $seqobj->alphabet eq 'rna') { } $revcom = $seqobj->revcom; write_sequence('', 'fasta', $revcom);

Other Bio: : Perl features ● Remote Blast: $seqobj = read_sequence($file); $blast_report = blast_sequence($seqobj); write_blast(">blast. out", $blast_report); ● Also possible to run stand-alone Blast

Other Bio: : Perl features ● Generate alignments: # load module use Bio: : Tools: : Run: : Alignment: : Clustalw; # define parameters @params = ('ktuple' => 2, 'matrix' => 'BLOSUM'); # build a clustalw alignment factory $factory = Bio: : Tools: : Run: : Alignment: : Clustalw->new(@params); # Pass the factory a list of sequences to be aligned. $aln = $factory->align('TLRs. fa'); # $aln is a Simple. Align object

Other Bio: : Perl features ● Work with alignments: $aln->length $aln->no_residues $aln->is_flush $aln->no_sequences $aln->score $aln->percentage_identity $aln->consensus_string(75) ? ? ? L? LS? N? I? ? ? ? ? L? L? ? N? ? ? ? ? F? ? ? L? L? ? N? ? ? ? L? L? ? ? ? ? ? ? ? ? ? ? ? ? ? F? ? L? L? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? L? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mostly Leucines conserved

Other Features ● sequence statistics (chemical description, residue count, word frequency) ● finding restriction enzyme sites ● finding amino acid cleavage sites ● show and add sequence annotation ● gene detection ● manipulate phylogenetic trees ● statistics for population genetics

Bio. Perl Documentation NAME Bio: : Perl - Functional access to Bio. Perl for people who don't know objects SYNOPSIS use Bio: : Perl; # will guess file format from extension $seq_object = read_sequence($filename); # forces genbank format $seq_object = read_sequence($filename, 'genbank'); # reads an array of sequences @seq_object_array = read_all_sequences($filename, 'fasta'); # sequences are Bio: : Seq objects, so the following methods work # for more info see Bio: : Seq, or do 'perldoc Bio/Seq. pm' print "Sequence name is ", $seq_object->display_id, "n"; print "Sequence acc is ", $seq_object->accession_number, "n";

More Info ● less `locate bptutorial`) ● perldoc Bio: : Perl ● www. bioperl. org ● Bio. Perl course: (or perl -d www. pasteur. fr/recherche/unites/sis/formation/bioperl/

Ensembl ● ● joint project between EMBL-EBI and the Sanger Centre automatic annotation on selected eukaryotic genomes ● free access to all the data and software ● ww. ensembl. org

Ensembl Organisms ● Human ● dog ● mouse ● chimp ● fly ● zebrafish ● worm ● pufferfish ● chicken ● mosquito ● cow ● honey bee ● rat ● yeast local installations (Arabidopsis)

Ensembl Website

Ensembl Pipeline ● Perl-based scripts ● run programs to detect and annotate genes ● compare genomes ● provide Web graphics ● all data stored in My. SQL database

Ensembl release cycle ● Data sets and software updates approximately ten times a year ● Versions for web code and databases ● All older versions (back to 2004) accessible ● Registry allows easy switch between versions

Ensembl database ● very rich data set ● complex database layout ● user-friendly Web interface ● DB components: – Core – Compara ● abstract layers through API (Perl, Java) ● anonymous @ ensembldb. ensembl. org – e. g. homo_sapiens_core_38_36

Ensembl core ● Genome sequences and annotation info – Gene transcripts – Protein models ● Assembly information ● CDNA and protein alignments ● External references ● Markers ● Repeats regions

Other Ensembl data sets ● EST databases ● Variation databases ● Both with application programming interface (API), e. g. Perl modules

Ensembl compara ● Multi-species database ● Genome-wide species comparison ● Re-calculated for each release ● Pair-wise whole genome alignments ● Synteny sets ● Orthologue predictions ● Protein family clusters

Compara: Genome comparison ● Which methods are available? mysql> SELECT * FROM method_link; +----------------------+ | method_link_id | type | +----------------------+ | 1 | BLASTZ_NET | | 2 | BLASTZ_NET_TIGHT | | 3 | BLASTZ_RECIP_NET | | 4 | PHUSION_BLASTN | | 5 | PHUSION_BLASTN_TIGHT | | 6 | TRANSLATED_BLAT | | 7 | BLASTZ_GROUP | | 8 | BLASTZ_GROUP_TIGHT | | 101 | SYNTENY | | 201 | ENSEMBL_ORTHOLOGUES | | 202 | ENSEMBL_PARALOGUES | | 301 | FAMILY | +----------------------+

Compara: Genome comparison ● Which genomes were compared? mysql> SELECT * FROM method_link_species_set WHERE method_link_species_set_id = 71; +--------------+--------+-------+ | method_link_species_set_id | method_link_id | genome_db_id | +--------------+--------+-------+ | 71 | 1 | | 71 | 11 | +--------------+--------+-------+ BLASTZ_NET (method_link_id = 1) has been used for linking all the species of this set: Human (genome_db_id = 1) and Chicken (genome_db_id = 11).

More Info ● www. ensembl. org/info/software ● – Tutorials – Database schema outline Man pages, e. g. perldoc Bio: : Ensembl: : DBSQL: : DBAdaptor