Phylogenomics and the evolution of gene repertoires in

  • Slides: 57
Download presentation
Phylogenomics and the evolution of gene repertoires in bacteria Paris, MEP, June 18 th

Phylogenomics and the evolution of gene repertoires in bacteria Paris, MEP, June 18 th 2005 Vincent Daubin Bioinformatique et Génomique Evolutive

Menu • Introduction: phylogenomics – A neologism and an old quote. • Phylogenomics in

Menu • Introduction: phylogenomics – A neologism and an old quote. • Phylogenomics in Bacteria/Prokaryotes – What phylogenetic framework? ? ? • Approaches for finding the Tree (if there is one) – Results obtained from different methods • Reconstructing the history of complete genomes • Conclusion

Phylogenomics • Nothing makes sense in genomics except in a phylogenetic framework – Understanding

Phylogenomics • Nothing makes sense in genomics except in a phylogenetic framework – Understanding the organization of genomes, the evolution of functions, the histories of duplications, etc… • Numerous prokaryotic genomes (relatively small, dense in genes…) • But what phylogenetic framework for prokaryotes?

Woese, 1987 SSU r. RNA phylogeny

Woese, 1987 SSU r. RNA phylogeny

From one gene… ATTTGAC… ACTTGAC… ATTCGCC…

From one gene… ATTTGAC… ACTTGAC… ATTCGCC…

… to another TTTAGAC… TCTAGAC… TTACGCC… TTACGAC…

… to another TTTAGAC… TCTAGAC… TTACGCC… TTACGAC…

Evidence for Lateral Gene Transfer 0. 5 Buchnera sp. Pasteurella multocida Haemophilus influenzae Vibrio

Evidence for Lateral Gene Transfer 0. 5 Buchnera sp. Pasteurella multocida Haemophilus influenzae Vibrio cholerae Pseudomonas aeruginosa Xylella fastidiosa Rickettsia prowazekii Caulobacter crescentus Neisseria meningitidis Campylobacter jejuni Helicobacter pylori Arabidopsis thaliana Synechocystis sp. Aquifex aeolicus Bacillus halodurans Bacillus subtilis Staphylococcus aureus Lactococcus lactis Streptococcus pyogenes Mycobacterium tuberculosis Mycobacterium leprae Streptomyces coelicolor Deinococcus radiodurans Chlamydia trachomatis Chlamydia muridarum Chlamydophila pneumoniae Mycoplasma genitalium Ureaplasma parvum Thermotoga maritima Archaeoglobus fulgidus Pyrococcus abyssi Pyrococcus horikoshii Methanococcus jannashii Halobacterium sp. Methanobacterium thermoautotrophicum Sulfolobus solfataricus Aeropyrum pernix Thermoplasma acidophilum Treponema pallidum Borrelia burgdorferi Green plant cyanobacteria Bacteria Archaea Eukaryota UMP-Kinase

Multiple LGT or … ? Mycobacterium leprae Mycobacterium tuberculosis 0. 5 Streptomyces coelicolor Aquifex

Multiple LGT or … ? Mycobacterium leprae Mycobacterium tuberculosis 0. 5 Streptomyces coelicolor Aquifex aeolicus Synechocystis sp. Pyrococcus horikoshii Pyrococcus abyssi Methanococcus jannashii Methanobacterium thermoautotrophicum Archaeoglobus fulgidus Campylobacter jejuni Helicobacter pylori Thermotoga maritima Caulobacter crescentus Bacteria Archaea Eukaryota Deinococcus radiodurans Halobacterium sp. Thermoplasma acidophilum Caenorhabditis elegans Chlamydophila pneumoniae Xylella fastidiosa Saccharomyces cerevisiae Pseudomonas aeruginosa Vibrio cholerae Pasteurella multocida Haemophilus influenzae Neisseria meningitidis Buchnera sp. Aeropyrum pernix Sulfolobus solfataricus Orotate Phosphoribosyltransferase

Lateral gene transfer in bacteria Transduction Conjugation Transformation

Lateral gene transfer in bacteria Transduction Conjugation Transformation

Acquisition of function via LGT Ochman, et al. , 2000

Acquisition of function via LGT Ochman, et al. , 2000

Massive gene “exchanges” ! Ochman, et al. , 2000

Massive gene “exchanges” ! Ochman, et al. , 2000

The alternative to the tree ? Zhaxybayeva and Gogarten, 2002

The alternative to the tree ? Zhaxybayeva and Gogarten, 2002

Methods used to reconstruct the tree of life using complete genomes • Oligonucleotides/peptides (words)

Methods used to reconstruct the tree of life using complete genomes • Oligonucleotides/peptides (words) frequency in genome/proteome • Global index of similarity (BLAST) Hypothesis of homology not always clear Mostly gene homology • Gene content • Gene order • Gene concatenation • Supertrees Mostly gene orthology Gene orthology (alignments) Finding xenology

Statistics on genomes Whole genomes (proteome) Word frequency sp 1 sp 2 sp 3

Statistics on genomes Whole genomes (proteome) Word frequency sp 1 sp 2 sp 3 … AAAA 104 63 307 …. AAAC … … … Tree AAAG AAAT sp 3 …. … Count words (correct for % of letters) Compute distances (= differences in word usage) Build a tree Re-sample words for support Hypothesis of homology ?

Statistics on genomes • Pride et al. 2003 • Based on tetranucleotide frequency in

Statistics on genomes • Pride et al. 2003 • Based on tetranucleotide frequency in 27 genomes • Distance ~ differences in usage • Relatively little signal for resolving the tree of bacteria BUT resolves recent and very deep nodes (i. e. , domains).

Statistics on genomes • Qi et al. , 2004 • K-strings in proteins (i.

Statistics on genomes • Qi et al. , 2004 • K-strings in proteins (i. e. , words of K letters) – here, K=6 • 109 genomes • Gets better with longer strings (relationship to gene homology? )

Blast scores Compare proteomes (BLASTP…) Distance matrix sp 1 sp 2 sp 3 sp

Blast scores Compare proteomes (BLASTP…) Distance matrix sp 1 sp 2 sp 3 sp 1 sp 2 sp 1 0 0. 5 sp 2 0. 5 0 sp 3 … sp 3 Tree … …. 0 0 … Average %identity, normalized BLAST scores… (restrict to orthologous genes) Transform into distance Build a tree Re-sample pairs of matching genes for support (remove discordant matches)

Blast scores • Clarke et al. , 2002 • Normalized BLASTP scores (=match/self_match) •

Blast scores • Clarke et al. , 2002 • Normalized BLASTP scores (=match/self_match) • 37 genomes (3 domains) • Finds most of the phyla defined by r. RNA • Remove phylogenetically discordant matches (little effect)

Gene content Compare proteomes (BLASTP…) Parsimony matrix sp 1 sp 2 sp 3 …

Gene content Compare proteomes (BLASTP…) Parsimony matrix sp 1 sp 2 sp 3 … Gene 1 0 …. Gene 2 1 1 1 Tree … … …. … Code presence/absence of : - Orthologs (reciprocal best matches) - Homologs (families) - Domains, Folds (superfamilies) Compute distance (correct for genome size) Dollo parsimony … Build a tree Sample subset of genes for statistical support

Gene content • Yang et al. 2004 • Folds (=superfamilies) in 119 bacterial genomes

Gene content • Yang et al. 2004 • Folds (=superfamilies) in 119 bacterial genomes • Distance method • Finds a few phyla defined by r. RNA

Gene content • Snel et al. 1999 • Orthologs in 23 genomes • Finds

Gene content • Snel et al. 1999 • Orthologs in 23 genomes • Finds most of the phyla defined by r. RNA

Gene order Compare proteomes (BLASTP…) b sp 1 sp 3 a e c d

Gene order Compare proteomes (BLASTP…) b sp 1 sp 3 a e c d sp 1 a b sp 2 Tree Gene order d sp 2 … c e c a e sp 3 d b … Assign orthologs - keep those present in ≥ 2 - keep those present in all Compute distances based on: - conservation of pairs of neighbor - number of breakpoints - sequence of inversions…

Gene order • Wolf et al. , 2001 • Based on conservation of pairs

Gene order • Wolf et al. , 2001 • Based on conservation of pairs of neighbors • Finds most of the phyla defined by r. RNA + suggests some non-trivial groups Wolf et al. , 2001

Gene concatenation gene alignments Super-alignment select genes that can be concatenated: - reduce missing

Gene concatenation gene alignments Super-alignment select genes that can be concatenated: - reduce missing data - analyze congruence (… or not) Bootstrap: Re-sample sites (Re-sample genes)

Gene concatenation • 57 genes in 45 species (8857 positions) • unrooted tree of

Gene concatenation • 57 genes in 45 species (8857 positions) • unrooted tree of bacteria • Finds all phyla defined by r. RNA + suggests some non-trivial groupings Brochier et al. , 2002

Whatever distance Comparison of (some) phylogenomic distances 1, 8 1, 6 1, 4 1,

Whatever distance Comparison of (some) phylogenomic distances 1, 8 1, 6 1, 4 1, 2 1 Gene_order = -ln(s) R 2 = 0, 0913 Concatenated proteins (9 genes - JTT) R 2= 0, 6477 0, 8 0, 6 Gene order = (s-1) R 2= 0, 132 Presence/absence = -ln(s) R 2= 0, 0756 0, 4 Presence/absence = (s-1) R 2= 0, 1849 0, 2 0 0 0, 1 0, 2 0, 3 0, 4 0, 5 0, 6 16 S r. RNA divergence (F 84)

Supertrees Combination of trees gene trees F A E D F C D A

Supertrees Combination of trees gene trees F A E D F C D A E B G F A B C B D E G select trees that can be combined = analyze congruence (… or not) Bootstrap: Re-sample sites (MRP) (Re-sample trees)

Supertree of bacteria • Daubin et al. 2002 • bacterial supertree based on the

Supertree of bacteria • Daubin et al. 2002 • bacterial supertree based on the combination of 121 gene trees with 7 ≤ nb sp ≤ 32 • Matrix Representation with Parsimony • Finds all phyla defined by r. RNA + suggests some nontrivial groupings 100 100 95 100 63 91 43 100 100 83 100 100 99 80 100 92 100 100 Streptomyces pyogenes Lactococcus lactis Staphylococcus aureus Bacillus subtilis Bacillus halodurans Ureaplasma parvum Mycoplasma genitalium Mycoplasma pneumoniae Synechocystis sp. Deinococcus radiodurans Mycobacterium tuberculosis Mycobacterium leprae Streptomyces coelicolor Helicobacter pylori Campylobacter jejuni Rickettsia prowazekii Caulobacter crescentus Neisseria meningitidis Xylella fastidiosa Pseudomonas aeruginosa Buchnera sp. Haemophilus influenzae Pasteurella multocida Escherichia coli Vibrio cholerae Aquifex aeolicus Thermotoga maritima Chlamydophila pneumoniae Chlamydia muridarum Chlamydia trachomatis Borrelia burgdorferi Treponema pallidum Low G+C Gram-postives Mycoplasmas High G+C Gram-postives Proteobacteria Chlamydiales Spirochaetes

A tree of bacteria? 100 100 95 100 63 91 43 100 100 83

A tree of bacteria? 100 100 95 100 63 91 43 100 100 83 100 100 99 80 100 92 100 100 Streptomyces pyogenes Lactococcus lactis Staphylococcus aureus Bacillus subtilis Bacillus halodurans Ureaplasma parvum Mycoplasma genitalium Mycoplasma pneumoniae Synechocystis sp. Deinococcus radiodurans Mycobacterium tuberculosis Mycobacterium leprae Streptomyces coelicolor Helicobacter pylori Campylobacter jejuni Rickettsia prowazekii Caulobacter crescentus Neisseria meningitidis Xylella fastidiosa Pseudomonas aeruginosa Buchnera sp. Haemophilus influenzae Pasteurella multocida Escherichia coli Vibrio cholerae Aquifex aeolicus Thermotoga maritima Chlamydophila pneumoniae Chlamydia muridarum Low G+C Gram-postives Mycoplasmas High G+C Gram-postives Proteobacteria Chlamydia trachomatis Borrelia burgdorferi Treponema pallidum Super-tree (Daubin et al. 2002) 121 genes Chlamydiales Spirochaetes Concatenation of ribosomal proteins (Brochier, et al. , 2002) 57 genes

A consensus for the tree of life • Black: already known from r. RNA

A consensus for the tree of life • Black: already known from r. RNA • Red: established from complete genome analysis (congruence among methods) • Dashed red: suggested by complete genome analysis Wolf et al. , 2002

Phylogenomics in bacteria Nature of gene innovation in bacteria? In eukaryotes: mainly duplication What

Phylogenomics in bacteria Nature of gene innovation in bacteria? In eukaryotes: mainly duplication What about bacteria?

The origin of « duplicates » in bacterial genomes -Intra-genomic duplication a a a’

The origin of « duplicates » in bacterial genomes -Intra-genomic duplication a a a’ PARALOGS -LGT of a gene having already an homolog in the genome bx b b bx XENOLOGS Calling these genes « duplicates » or « paralogs » is an overstatement: “SYNOLOGS” = PARALOGS || XENOLOGS

Phylogenomics of Gammaproteobacteria (13 species) • Ancient group (0. 5 -1 billion years –

Phylogenomics of Gammaproteobacteria (13 species) • Ancient group (0. 5 -1 billion years – May et al. , 2001) • Model of bacterial diversification: – – – Escherichia coli K 12 Salmonella typhimurium LT 2 Buchnera aphidicola AP Haemophilus influenzae Pasteurella multocida Yersinia pestis (CO 92 and KIM) Pseudomonas aeruginosa PAO 1 Xanthomonas axonopodis Xanthomonas campestris Xylella fastidiosa Wigglesworthia brevipalpis Vibrio cholerae commensal human pathogen endosymbiont of aphids commensal human pathogen animal pathogen (agent of plague) human opportunistic pathogen plant pathogen (Citrus) plant pathogen (crucifers) plant pathogen (Citrus) endosymbiont of tse-tse fly human pathogen (agent of cholera) • High rate of LGT reported (e. g. , E. coli)

Gene families in -proteobacteria Number of families Genes unique to a genome minimal core

Gene families in -proteobacteria Number of families Genes unique to a genome minimal core of genes in -proteobacteria Number of species

The core of genes • among the 275 families that group genes from the

The core of genes • among the 275 families that group genes from the 13 species: 205 families with 1 gene per species. è true orthologs. è Do these genes have the same history?

ML tests (ELW, SH, KH…) Sequence alignment Ln 1 and Ln. X significantly different

ML tests (ELW, SH, KH…) Sequence alignment Ln 1 and Ln. X significantly different ? ML tree (Ln 1) Ln. X NO: accept phylogenetic hypothesis Phylogenetic hypothesis to test (e. g. , “species phylogeny”) YES: possible LGT

197 196 200 best topology 203 186 172 181 178 177 150 133 130

197 196 200 best topology 203 186 172 181 178 177 150 133 130 117 110 100 95 108 97 88 72 50 33 8 9 27 24 19 75 28 2 0 1 2 3 SSU r. RNA 4 5 6 Concatenated proteins not different from the ML tree 7 8 9 10 11 12 13 other hypothesis different from the ML tree

The organismal phylogeny 100 100 E. coli 4183 S. typhimurium 4203 Y. pestis CO

The organismal phylogeny 100 100 E. coli 4183 S. typhimurium 4203 Y. pestis CO 92 3599 Y. pestis KIM 3879 B. aphidicola 100 W. brevipalpis H. infuenzae 100 P. multocida V. cholerae P. aeruginosa 100 0. 2 564 653 1709 2015 2724+1081 5540 X. fastidiosa 2680 X. axonopodis 4192 X. campestris 4029 Based on the concatenation of 203 genes Lerat et al. , 2003

Exemple: Maximum likelihood test with one synolog Sp A Synolog in sp A Test

Exemple: Maximum likelihood test with one synolog Sp A Synolog in sp A Test ΔL species topology ML trees - Allows detection of possible LGT and identification of the true ortholog - !!! Incongruence can result from duplication + loss (results need to be checked manually) !!!

Results for the phylogenetically « informative » fraction of the genomes 80 Number of

Results for the phylogenetically « informative » fraction of the genomes 80 Number of synologs Percentage of LGT 0 1 2 >2 60 40 20 0 6 7 8 9 10 11 12 13 Number of species èSynology is associated with a high frequency of LGT A lot of the so called duplicates in bacterial genomes arise in fact by LGT Lerat et al. , 2005

But families having synologs are rare 7655 2429 835 457 Number of synologs (#

But families having synologs are rare 7655 2429 835 457 Number of synologs (# genes – # genomes) 0 1 2 3 4 5 6 7 8 9 10 >10 Number of families 250 a 200 150 100 50 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Number of species Families having synologs represent less than 2% of the total

The auxiliary genome of bacteria is an ORFanage Welch et al. , 2002

The auxiliary genome of bacteria is an ORFanage Welch et al. , 2002

Genes unique to a genome - Some genes are annotated as phages or secretion

Genes unique to a genome - Some genes are annotated as phages or secretion proteins… - Most have unknown function - Most are ORFans (no homolog known in databases)

What are ORFans ? • Rapidly evolving genes, or possibly pseudogenes (cf. Amiri et

What are ORFans ? • Rapidly evolving genes, or possibly pseudogenes (cf. Amiri et al. , 2003) • Genes produced de novo from non-coding sequences • Artifacts resulting from the algorithms used to recognize coding sequences in genomes. • Genes transferred from organisms that have no representatives in the databases

How can we understand ORFans ? • By definition, no possible comparative study (evolutionary

How can we understand ORFans ? • By definition, no possible comparative study (evolutionary rate, structure determination by homology…) • But… if the mechanism producing ORFans is continuous overtime, we can find ORFans for every node in a tree • Search for ORFans in the lineage leading to E. coli MG 1655 (K 12)

Examine genes restricted to each clade at increasing phylogenetic depths (n 0, n 1,

Examine genes restricted to each clade at increasing phylogenetic depths (n 0, n 1, n 2, etc. ) as well as those ancestral to all taxa (native). At each node, define two types of genes: - ORFans: genes restricted to a clade and having no other homologs - HOPs (Heterogeneous Occurrence in Prokaryotes): genes restricted to a clade but with matches in some distantly related organism (LGT events) This approach allows: 1. comparisons of the sequence features of ORFans of different ages 2. comparisons of ORFans to acquired & ancestral genes. 3. use of comparative methods to obtain information about evolutionary rate and functional status of ORFans (e. g. , n 2: E. coli vs. Salmonella)

ORFans HOPs Daubin & Ochman, 2004

ORFans HOPs Daubin & Ochman, 2004

Length of ORFans and HOPs -proteobacteria E. coli Vibrio/Haem enterics S. enterica E. coli

Length of ORFans and HOPs -proteobacteria E. coli Vibrio/Haem enterics S. enterica E. coli +Shigella E. coli Length (bp) 1200 1000 HOPs O RF ans 800 600 400 n 0 younger n 1 n 2 n 3 n 4 native older

Evolutionary rates of ORFans and HOPs -proteobacteria Vibrio/Haem enterics S. enterica E. coli +Shigella

Evolutionary rates of ORFans and HOPs -proteobacteria Vibrio/Haem enterics S. enterica E. coli +Shigella E. coli Ka/Ks Escherichia-Salmonella 0. 14 0. 12 HOPs O RF ans 0. 10 0. 08 0. 06 n 2 n 3 n 4 native Ka/Ks is low, indicating that ORFans encode proteins; however, both Ka & Ks are elevated

G+C content of ORFans and HOPs -proteobacteria E. coli Vibrio/Haem enterics S. enterica E.

G+C content of ORFans and HOPs -proteobacteria E. coli Vibrio/Haem enterics S. enterica E. coli +Shigella E. coli % G + C 3 58 54 HOPs O RF ans 50 46 42 n 0 younger n 1 n 2 n 3 n 4 native older

ORFans in A+T rich genomes 0, 39 0, 38 0, 37 0, 36 0,

ORFans in A+T rich genomes 0, 39 0, 38 0, 37 0, 36 0, 35 0, 34 0, 33 0, 32 0, 31 0, 3 0, 29 Helicobacter pylori G+C 3 Streptococcus pneumoniae “Natives” ORFans 0, 44 0, 42 0, 4 0, 38 0, 36 0, 34 0, 32 0, 3 0, 28 “Natives” ORFans Daubin et al. , 2003

Features of ORFans arise quickly in genomes & can be strain-specific Do not originate

Features of ORFans arise quickly in genomes & can be strain-specific Do not originate from native DNA that is shared among strains ORFans are short and very A+T-rich Consistently A+T-richer donor ? Average Ka/Ks of ORFans is much less than 1 (often < 0. 2) Most ORFans are functional (although functions are unassigned) ORFans evolve faster than other genes in the genome Under less constraints or possibly due to positive selection

ORFans originate by lateral gene transfer but by different vehicles, mechanisms or processes than

ORFans originate by lateral gene transfer but by different vehicles, mechanisms or processes than HOPs (which are present in other Bacteria, Archaea or Eukaryotes) Given their base compositions, lack of homologs & functional status ORFans most likely derive from DNA phages (which are poorly represented in the databases) Rocha et Danchin, 2002

And if you are not yet convinced, 1. Younger ORFans tend to be clustered

And if you are not yet convinced, 1. Younger ORFans tend to be clustered (as if co-inherited in a single event), whereas older ORFans are dispersed (by rearrangements & deletions) ORFan cluster sizes average 2. 1 genes in n 0/n 1 and 1. 3 genes in n 4 2. Genes in DNA phage genomes are short. Average is 615 bp, and only 471 bp for those encoding ‘hypothetical’ proteins 3. ORFans often occur at t. RNA genes or near translocatable sequences 4. ORFans in E. coli have di-nucleotide frequencies close to coliphages

ORFans are conserved through time and may assume key functions - An ORFan from

ORFans are conserved through time and may assume key functions - An ORFan from n 2 (only in E. coli and Salmonella) is the ribosomal protein S 22, expressed in stationary phase - Some ORFans from n 3 (restricted to the enterics) have been retained in the highly reduced genome of Buchnera: e. g. , dna. T and dna. C, which are essential to E. coli Daubin & Ochman, 2004

The genealogy of bacterial genomes S. typhimurium (4206) E. coli (4187) Y. pestis KIM

The genealogy of bacterial genomes S. typhimurium (4206) E. coli (4187) Y. pestis KIM (3883) Y. pestis CO 92 (3599) W. brevipalpis (653) B. aphidicola (564) P. multocida (2015) H. influenzae (1709) V. cholerae (3805) P. aeruginosa (5540) X. fastidiosa (2680) X. campestris (4030) X. axonopodis (4193) Ubiquitous genes are rare and show few evidence for LGT Genes seem to be acquired continuously Most of the acquired genes are completely new for the genome (no homologs) A lot of them are even ORFans genes appear as a contribution of phage to bacterial evolution Because genomes are not increasing in size, non-homologous replacement may play a major role

acknowledgements • • • Emmanuelle Lerat (esp. LGT in Gamma-proteobacteria !!!) Manolo Gouy Guy

acknowledgements • • • Emmanuelle Lerat (esp. LGT in Gamma-proteobacteria !!!) Manolo Gouy Guy Perrière Howard Ochman Nancy Moran