Comparative genomics genome context and genome annotation Nothing

  • Slides: 46
Download presentation
Comparative genomics, genome context and genome annotation Nothing in (computational) biology makes sense except

Comparative genomics, genome context and genome annotation Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970)

Genome context analysis and genome annotation Using information other than homologous relationships between individual

Genome context analysis and genome annotation Using information other than homologous relationships between individual gene/proteins for functional prediction (guilt by association) Types of context analysis: • phyletic patterns • domain fusion (“Rosetta Stone” proteins) • gene order conservation • co-expression • ….

Goals: • Using gene sets from complete genomes, delineate families of orthologs and paralogs

Goals: • Using gene sets from complete genomes, delineate families of orthologs and paralogs - Clusters of Orthologous Groups (of genes) (COGs) COGs • Using COGs, develop an engine for functional annotation of new genomes • Apply COGs for analysis of phylogenetic patterns

COG: - group of homologous proteins such that all proteins from different species are

COG: - group of homologous proteins such that all proteins from different species are orthologs (all proteins from the same species in a COG are paralogs)

CONSTRUCTION OF COGs FOR 8 COMPLETE GENOMES Complete set of proteins from the analyzed

CONSTRUCTION OF COGs FOR 8 COMPLETE GENOMES Complete set of proteins from the analyzed genomes Merge triangles with common edges 1 6 FULL SELF-COMPARISON (BLASTPGP, no cut-off) Detect groups with multidomain proteins and isolate domains 2 Collapse obvious paralogs 5 3 REPEAT STEPS 3 -5 Detect all interspecies Best Hits (Be. Ts) between individual proteins or groups of paralogs 4 Detect all triangles of consistent Be. Ts COGs

A TRIANGLE OF Be. Ts IS A MINIMAL, ELEMENTARY COG

A TRIANGLE OF Be. Ts IS A MINIMAL, ELEMENTARY COG

A RELATIVELY SIMPLE COG PRODUCED BY MERGING ADJACENT TRIANGLES

A RELATIVELY SIMPLE COG PRODUCED BY MERGING ADJACENT TRIANGLES

A COMPLEX COG WITH MULTIPLE PARALOGS

A COMPLEX COG WITH MULTIPLE PARALOGS

Current status of the COGs Prokaryotes 11 Archaea + 1 unicellular eukaryote + 46

Current status of the COGs Prokaryotes 11 Archaea + 1 unicellular eukaryote + 46 bacteria = 58 complete genomes 149, 321 proteins 105, 861 proteins in 4075 COGs (71%) Eukaryotes 4 animals + 1 plant + 2 fungi + 1 microsporidium = 8 complete genomes 142, 498 proteins 74, 093 proteins in 4822 COGs (52%)

COGnitor. . .

COGnitor. . .

…IN ACTION

…IN ACTION

The Universal COGs

The Universal COGs

Search for genomic determinants of hyperthermophily

Search for genomic determinants of hyperthermophily

Search for unique archaeo-eukaryotic genes

Search for unique archaeo-eukaryotic genes

A complementary pattern: search for unique bacterial genes

A complementary pattern: search for unique bacterial genes

Essential function… but holes in the phyletic pattern Strict complementary pattern

Essential function… but holes in the phyletic pattern Strict complementary pattern

Relaxed complementary pattern

Relaxed complementary pattern

Relaxed complementary pattern with extra restrictions

Relaxed complementary pattern with extra restrictions

Conservation of gene order in bacterial species of the same genus 1 1 101

Conservation of gene order in bacterial species of the same genus 1 1 101 M. genitalium vs M. pneumoniae 201 301 401 501 601 101 201 301 401

Conservation of gene order in closely related bacterial genera 1 1 101 201 C.

Conservation of gene order in closely related bacterial genera 1 1 101 201 C. trachomatis vs C. pneumoniae 301 401 501 601 701 801 901 101 201 301 401 501 601 701 801

Lack of gene order conservation - even in “closely related” bacteria of the same

Lack of gene order conservation - even in “closely related” bacteria of the same Proteobacterial subdivision P. aeruginosa vs E. coli

Genome Alignments - Method Protein sets from completely genomes BLAST cross-comparison Table of Hits

Genome Alignments - Method Protein sets from completely genomes BLAST cross-comparison Table of Hits Pairwise Genome Alignment Local alignment algorithm Lamarck (gap opening penalty, gap extension penalty); statistics with Monte Carlo simulations Template-Anchored Genome Alignment

Genome Alignments - Statistics 0. 5 cpneu-ctra mjan-mthe bsub-ecoli drad-aero 0. 4 0. 3

Genome Alignments - Statistics 0. 5 cpneu-ctra mjan-mthe bsub-ecoli drad-aero 0. 4 0. 3 0. 2 Distribution of conserved gene string lengths >20 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 0. 0 2 0. 1

Genome Alignments - Statistics Pairwise alignments: No. % in strings genes Gen 1 Gen

Genome Alignments - Statistics Pairwise alignments: No. % in strings genes Gen 1 Gen 2 all homologs ecoli-hinf 138 566 13% ecoli-bsub 89 322 8% ecoli-mjan 10 30 1% 33% 8% 2% probable orthologs ecoli-hinf 105 482 11% ecoli-bsub 34 168 4% ecoli-mjan 12 33 1% 28% 4% 2%

Genome Alignments - Statistics ae ro af ul m ja n m th e

Genome Alignments - Statistics ae ro af ul m ja n m th e py r aq o ua e bb ur bs ub ca c cj e cp j ne u ct ra dr ad ec ol i hi nf hp y m l ge m n pn eu m tu b nm en r sy pxx ne ch o tm ar tp al uu re 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 Not in gene strings In non-conserved gene strings (directons) In conserved gene strings Breakdown of genes in the genome

Genome Alignments - Statistics Fraction of the genome in conserved gene strings - from

Genome Alignments - Statistics Fraction of the genome in conserved gene strings - from template-anchored alignments Minimum Synechocystis sp. 5% Aquifex aeolicus Archaeoglobus fulgidus Escherichia coli Treponema pallidum 10% 13% 14% 17% Maximum Thermotoga maritima Mycoplasma genitalium 23% 24%

Context-Based Prediction of Protein Functions A Novel Translation Factor (COG 0536) L 21 L

Context-Based Prediction of Protein Functions A Novel Translation Factor (COG 0536) L 21 L 27 GTP-binding GTPase? translation factor

Context-Based Prediction of Protein Functions A Novel Translation Factor (COG 0012) TGS domain GTP-binding

Context-Based Prediction of Protein Functions A Novel Translation Factor (COG 0012) TGS domain GTP-binding containing translation GTPase? factor Peptidyl-t. RNA hydrolase