Sequence Analysis with Artemis and Artemis Comparison Tool

  • Slides: 38
Download presentation
Sequence Analysis with Artemis and Artemis Comparison Tool (ACT) Carribean Bioinformatics Workshop 18 th-29

Sequence Analysis with Artemis and Artemis Comparison Tool (ACT) Carribean Bioinformatics Workshop 18 th-29 th January , 2010

atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt tttaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcacttccaattttatattccg cagtacatcgaattctaaaaaataaataatatataatatataataaataatatataatatataataatatataatatataataaataatatataatatataatactttggaaagattattt atatgaatatatacacctttaataggatacacacatcatatttatatacatataaatattccataaatatttatacaacctcaaataaaca tacatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattagg agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat tatcatttatgtccttatcaaaattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa ttcaatcttaactccttcactcattttatatattccttaatttttactatgtttattaacatataaacaaatatgtcactaa taatatatatatatatattataaatgttttactctattttcacatcttgtcctttttaaaaatcccaattcttattcat taaataataatgtattttttttttttattattatgttactgttttattatatacactcttaatcatatatttatatatatatattattcccttttcatgttttaaacaagaaaactaaaaaaataataaaatatatttttataacatatgt attattaaaatgtataaaaatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattacta ccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaata

atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt tttaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcacttccaattttatattccg cagtacatcgaattctaaaaaataaataatatataatatataataaataatatataatatataataatatataatatataataaataatatataatatataatactttggaaagattattt atatgaatatatacacctttaataggatacacacatcatatttatatacatataaatattccataaatatttatacaacctcaaataaaca tacatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattagg agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat tatcatttatgtccttatcaaaattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa ttcaatcttaactccttcactcattttatatattccttaatttttactatgtttattaacatataaacaaatatgtcactaa taatatatatatatatattataaatgttttactctattttcacatcttgtcctttttaaaaatcccaattcttattcat taaataataatgtattttttttttttattattatgttactgttttattatatacactcttaatcatatatttatatatatatattattcccttttcatgttttaaacaagaaaactaaaaaaataataaaatatatttttataacatatgt attattaaaatgtataaaaatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattacta ccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaata tatatatatacatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaaga atttaattaaatataattacatctaatattattatataataagttttccaaatagaatacttatatatatatattcttccataaaaagaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtattt ataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaacattttcttcattatcaaaaatatttcctaattttttg taaaatatatttaaaaatgtaatagattatgtattaaataatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattc aaaagatacaggtaaaaaaataaagtaaaacaaaacaaaaaaaaaaaaatgacatgttataatataa taaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataacattcata tctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaat tctgatcattgatccgtcttccttaaatgttattacaatacagatctgtagttgatttcctttttaatgagaaaaataagaatcttattgtt ttagggtaatgaaatatagatttatatttttatttattatatattattttttaatttttcttttatatattttatttagtgtataaaa tgatatcctttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatgtatatttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagtta agcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaata aagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaa ttcatatgtatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttatcctacctcagagaatct ataataaaagcatataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa

atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt tttaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcacttccaattttatattccg cagtacatcgaattctaaaaaataaataatatataatatataataaataatatataatatataataatatataatatataataaataatatataatatataatactttggaaagattattt atatgaatatatacacctttaataggatacacacatcatatttatatacatataaatattccataaatatttatacaacctcaaataaaca tacatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacat gtcacaaaactaaaaaaggtattagg agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat tatcatttatgtccttatcaaaattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa ttcaatcttaactccttcactcattttatatattccttaatttttactatgtttattaacatataaacaaatatgtcactaa taatatatatatatatattataaatgttttactctattttcacatcttgtcctttttaaaaatcccaattcttattcat taaataataatgtattttttttttttattattatgttactgttttattatatacactcttaatcatatatttatatatatatattattcccttttcatgttttaaacaagaaaactaaaaaaataataaaatatatttttataacag atgt

atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt tttaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcacttccaattttatattccg cagtacatcgaattctaaaaaataaataatatataatatataataaataatatataatatataataatatataatatataataaataatatataatatataatactttggaaagattattt atatgaatatatacacctttaataggatacacacatcatatttatatacatataaatattccataaatatttatacaacctcaaataaaca tacatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacat gtcacaaaactaaaaaaggtattagg agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat tatcatttatgtccttatcaaaattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa ttcaatcttaactccttcactcattttatatattccttaatttttactatgtttattaacatataaacaaatatgtcactaa taatatatatatatatattataaatgttttactctattttcacatcttgtcctttttaaaaatcccaattcttattcat taaataataatgtattttttttttttattattatgttactgttttattatatacactcttaatcatatatttatatatatatattattcccttttcatgttttaaacaagaaaactaaaaaaataataaaatatatttttataacag atgt attattaaaatgtataaaaatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattacta Extracting information & interpreting ccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaata tatatatatacatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaaga atttaattaaatataattacatctaatattattatataataagttttccaaatagaatacttatatatatata What´s there tatatatattcttccataaaaagaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtattt ataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaa gtttttcttcattatcaaaaatatttcctaattttttg where are the genes taaaatatatttaaaaatgtaatagattatgtattaaataatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattc which genes aaaagatacaggtaaaaaaataaagtaaaacaaaacaaaaaaaaaaaaatgacatgttataatataa taaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataacattcata how to find them? tctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaat tctgatcattgatccgtcttccttag gtgttattacaatacagatctgtagttgatttcctttttaatgagaaaaataagaatcttattgtt ttagggtaatgaaatatagatttatatttttatttattatatattattttttaatttttcttttatatattttatttagtgtataaaa SEQUENCE ANNOTATION tgatatcctttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatgtatatttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagtta agcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaata aagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaa ttcatatgtatataccaattag atattaaaaattcccatattagttatacacttattgatagtttcaatttatcctacctcagagaatct ataataaaagcatataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa Sequencing is just the beginning of the process

Strategies for sequence annotation Ø Predictive methods Interpretation of the DNA sequence into genes

Strategies for sequence annotation Ø Predictive methods Interpretation of the DNA sequence into genes according to rules Ø Comparative methods Ø Experimental methods

Strategies for sequence annotation å Predictive methods Interpretation of the DNA sequence into genes

Strategies for sequence annotation å Predictive methods Interpretation of the DNA sequence into genes according to rules å Comparative methods Interpretation of the DNA sequence into genes according to similarities with other sequences å Experimental methods

Strategies for sequence annotation å Predictive methods Interpretation of the DNA sequence into genes

Strategies for sequence annotation å Predictive methods Interpretation of the DNA sequence into genes according to rules å Comparative methods Interpretation of the DNA sequence into genes according to similarities with other sequences å Experimental methods Interpretation of the DNA sequence into genes according to experimental results (e. g. c. DNA)

EST Blast Hit

EST Blast Hit

Gene prediction programs: ORFs and CDSs ORFs are not equivalent to CDSs Not all

Gene prediction programs: ORFs and CDSs ORFs are not equivalent to CDSs Not all open reading frames are coding sequences

Gene prediction Orpheus PHAT Gene. Mark Glimmer Gene finder

Gene prediction Orpheus PHAT Gene. Mark Glimmer Gene finder

Gene finding programs • Genefinding software packages use Hidden Markov Models. • Predict coding,

Gene finding programs • Genefinding software packages use Hidden Markov Models. • Predict coding, intergenic and intron sequences • Need to be trained on a specific organism. • Never perfect!

Gene prediction programs: Problems • ORFs are not equivalent to CDSs • Gene prediction

Gene prediction programs: Problems • ORFs are not equivalent to CDSs • Gene prediction programs find new genes that share properties with a given set of genes. • They can be confounded by: – – – Sequence constraints (ribosomal proteins etc. ) Sequence biases Different sets of genes Horizontal gene transfer Non-coding DNA

Gene prediction programs: Problems Different gene training sets: Plasmodium falciparum Original annotation Updated annotation

Gene prediction programs: Problems Different gene training sets: Plasmodium falciparum Original annotation Updated annotation

Gene prediction programs: Problems Non-protein coding regions: S. typhi ribosomal RNA genes final genefinder

Gene prediction programs: Problems Non-protein coding regions: S. typhi ribosomal RNA genes final genefinder orpheus glimmer orpheus genefinder final

Gene prediction programs: Problems Non-protein coding regions: N. meningitidis DNA repeats final orpheus glimmer

Gene prediction programs: Problems Non-protein coding regions: N. meningitidis DNA repeats final orpheus glimmer orpheus final

Gene prediction programs: Problems Pseudogenes M. leprae

Gene prediction programs: Problems Pseudogenes M. leprae

Gene prediction programs: Problems Pseudogenes: M. leprae Glimmer

Gene prediction programs: Problems Pseudogenes: M. leprae Glimmer

Gene prediction programs: Problems Pseudogenes: M. leprae ORPHEUS

Gene prediction programs: Problems Pseudogenes: M. leprae ORPHEUS

Gene prediction programs: Problems Pseudogenes: M. leprae WUBLASTX vs. M. tuberculosis

Gene prediction programs: Problems Pseudogenes: M. leprae WUBLASTX vs. M. tuberculosis

Gene prediction programs: Problems Pseudogenes: M. leprae Final annotation

Gene prediction programs: Problems Pseudogenes: M. leprae Final annotation

The Gene Prediction Process ESTs ANNALYSIS SOFTWARE DNA SEQUENCE FASTA Blast. X Gene finders

The Gene Prediction Process ESTs ANNALYSIS SOFTWARE DNA SEQUENCE FASTA Blast. X Gene finders Codon Usage AT content Annotator Usefull CDS Prediction

Eukaryotic gene 5’UTR Exon I intron ATG GT AG stop Exon III 3’UTR Exon

Eukaryotic gene 5’UTR Exon I intron ATG GT AG stop Exon III 3’UTR Exon II GT AG CAP AAAAAAAAAA m. RNA TTTTT c. DNA TTTTT EST

AT content • Coding regions have higher GC content in AT rich genomes

AT content • Coding regions have higher GC content in AT rich genomes

AT content

AT content

CODON USAGE • Codon bias is different for each organism. • DNA content in

CODON USAGE • Codon bias is different for each organism. • DNA content in coding regions is restricted – but it is not restricted in non coding regions. • The codon usage for any particular gene can influence expression.

Codon usage • All organisms have a preferred set of codons. Malaria GUU GUC

Codon usage • All organisms have a preferred set of codons. Malaria GUU GUC GUA GUG 0. 41 0. 06 0. 42 0. 11 Trypanosoma GUU GUC GUA GUG 0. 28 0. 19 0. 14 0. 39

Codon Usage • http: //www. kazusa. or. jp/codon/

Codon Usage • http: //www. kazusa. or. jp/codon/

Codon Usage in Artemis Forward frames Reverse frames

Codon Usage in Artemis Forward frames Reverse frames

Codon usage & gene finding in : Leishmania

Codon usage & gene finding in : Leishmania

GC frame plot • Plots the third position GC content of each frame of

GC frame plot • Plots the third position GC content of each frame of a DNA sequence. • In coding DNA the GC content of the 3 rd base is often higher. • Good prediction of coding in malaria and trypanosomes.

GC frame plot of tubulin gene cluster on T. brucei Chr 1

GC frame plot of tubulin gene cluster on T. brucei Chr 1

Homology Data • Coding regions are more conserved than non coding regions due to

Homology Data • Coding regions are more conserved than non coding regions due to selective pressure. • Comparing all possible translations against all known proteins will give clues to known genes. • Blastx

Gene finding: using ACT P. yoelii P. falciparum P. knowlesi TBLASTX comparisons

Gene finding: using ACT P. yoelii P. falciparum P. knowlesi TBLASTX comparisons

Gene finding by RNA-Seq (Transcriptional landscape of Neospora caninum Tachyzoites Day 3 Tachyzoites (RNAseq)

Gene finding by RNA-Seq (Transcriptional landscape of Neospora caninum Tachyzoites Day 3 Tachyzoites (RNAseq) Day 4 Tachyzoites (RNAseq)

Transcriptome sequencing in Neospora (RNAseq is useful for predicting/confirming UTR boundaries) Day 3 Tachyzoites

Transcriptome sequencing in Neospora (RNAseq is useful for predicting/confirming UTR boundaries) Day 3 Tachyzoites (RNAseq) Day 4 Tachyzoites (RNAseq) N. caninum Chr 08 TBLASTX matches visualised in ACT T. gondii Chr 08 5’ UTR 3’ UTR

RNA-Seq: correcting gene models Before %GC __16 hr, __32 hr, __48 hr After %GC

RNA-Seq: correcting gene models Before %GC __16 hr, __32 hr, __48 hr After %GC