Computational Analysis of the Taxanomical Classification of Short

  • Slides: 22
Download presentation
Computational Analysis of the Taxanomical Classification of Short 16 S r. RNA Sequences Christel

Computational Analysis of the Taxanomical Classification of Short 16 S r. RNA Sequences Christel Chehoud Mentor: Brian Haas

Overview l l l Human Microbiome Project 16 S r. RNA Reference and Test

Overview l l l Human Microbiome Project 16 S r. RNA Reference and Test Sets Classifiers Accuracy of Classifications Results

Human Microbiome Project (HMP) l Microorganism communities l l l Human development Physiology Immunity

Human Microbiome Project (HMP) l Microorganism communities l l l Human development Physiology Immunity Disease Nutrition Core Microbiome http: //nihroadmap. nih. gov/hmp/

16 S r. RNA l 16 S l Ribosomal RNA l Large RNA component

16 S r. RNA l 16 S l Ribosomal RNA l Large RNA component of the small subunit of the ribosome l Phylogenetic Markers l l Species Identification 1542 bp

Using 16 S for Species Identification Sequence Classifier Predicted Classification

Using 16 S for Species Identification Sequence Classifier Predicted Classification

Project Goal l l New Sequencing Technology Evaluate the accuracy of the classification of

Project Goal l l New Sequencing Technology Evaluate the accuracy of the classification of the 16 S r. RNA across different: l l l Classifiers Regions of the sequence Phylogeny

Reference Dataset l RDP Core Set l l l l Trusted Taxonomies 6, 621

Reference Dataset l RDP Core Set l l l l Trusted Taxonomies 6, 621 sequences Phylum: 27 Class: 43 Order: 97 Family: 258 Genus: 1352

Green. Genes’s Full Collection of Sequences l l Full Collection used by Green. Genes

Green. Genes’s Full Collection of Sequences l l Full Collection used by Green. Genes High phylogenetic diversity l 188, 073 sequences

Comparison of Taxonomy Predictions by Method l Classified Green. Genes Core Set Using: l

Comparison of Taxonomy Predictions by Method l Classified Green. Genes Core Set Using: l l All Match l 188, 073 135, 269 RDP (Naïve Bayesian) kmer. Rank Blast 135, 269 sequences l. Phylum: 27 l. Class: 43 l. Order: 96 l. Family: 257 l. Genus: 1335

None Match: 19588 None Match 32334 19588 BLAST RDP 4934 135269 kmer. Rank 15949

None Match: 19588 None Match 32334 19588 BLAST RDP 4934 135269 kmer. Rank 15949

CD-hit: Normalizing Genus Representation l 3% difference between genera l l l 188, 073

CD-hit: Normalizing Genus Representation l 3% difference between genera l l l 188, 073 l 21, 179 sequences Phylum: 27 Class: 43 Order: 96 Family: 235 Genus: 1241 135, 269 21, 179 Li, 2006

Sliding Window: Producing our Localized Regions l l Sliding Window Approach l 300 bp

Sliding Window: Producing our Localized Regions l l Sliding Window Approach l 300 bp window l 25 bp overlap Sanger vs. 454 -XLR = Full-length vs. localized region Van de Peer, 1996

Overall Accuracy of the Three Different Classifiers

Overall Accuracy of the Three Different Classifiers

Overall Accuracy of the Three Different Classifiers l Average l l l BLASTN: .

Overall Accuracy of the Three Different Classifiers l Average l l l BLASTN: . 843 kmer. Rank: . 830 RDP: . 831

Overall Accuracy of the Three Different Classifiers l Average l l BLASTN: . 843

Overall Accuracy of the Three Different Classifiers l Average l l BLASTN: . 843 kmer. Rank: . 830 RDP: . 831 Standard Deviation l l l BLASTN: . 031 kmer. Rank: . 030 RDP: . 017

Genus Prediction Accuracy (per Phylum)

Genus Prediction Accuracy (per Phylum)

Genus Prediction Accuracy (per Phylum) l Average l l BLASTN: . 843 kmer. Rank:

Genus Prediction Accuracy (per Phylum) l Average l l BLASTN: . 843 kmer. Rank: . 830 RDP: . 831 Standard Deviation l l l BLASTN: . 107 kmer. Rank: . 153 RDP: . 142

Finding the 16 S Region Providing the Most Reliable Prediction Accuracy

Finding the 16 S Region Providing the Most Reliable Prediction Accuracy

Clustering Phyla and Methods by Prediction Accuracy

Clustering Phyla and Methods by Prediction Accuracy

Clustering Phyla and Methods by Prediction Accuracy l Best method is Phylum-dependent l Variation

Clustering Phyla and Methods by Prediction Accuracy l Best method is Phylum-dependent l Variation in accuracy impacted by depth of species coverage

Summary l l Central region of 16 S is the most accurate, on average

Summary l l Central region of 16 S is the most accurate, on average Of the methods examined, BLAST is most accurate across all 16 S regions and all phyla, on average RDP-bayes is least variable across short sequence regions Best short sequence classification method is phylum-dependent

Acknowledgements l Genome Sequencing and Analysis Program l l l l Brian Haas Dirk

Acknowledgements l Genome Sequencing and Analysis Program l l l l Brian Haas Dirk Gevers Michael Feldgarden Doyle Ward Chad Nusbaum Bruce Birren Administration l l l Shawna Young Lucia Vielma Maura Silverstein