Introduction to Phylogenetic Estimation Algorithms Tandy Warnow Questions

Questions • What is a phylogeny? • What data are used? • What is

Phylogeny From the Tree of the Life Website, University of Arizona Orangutan Gorilla Chimpanzee

Data • Biomolecular sequences: DNA, RNA, amino acid, in a multiple alignment • Molecular

DNA Sequence Evolution -3 mil yrs AAGACTT AAGGCCT AGGGCAT TAGCCCA -2 mil yrs TGGACTT

Phylogeny Problem U AGGGCAT V W TAGCCCA X TAGACTT Y TGCACAA X U Y

Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA…

Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… …ACCAGTCACCA…

Deletion Mutation The true pairwise alignment is: …ACGGTGCAGTTACCA… …AC----CAGTCACCA… …ACCAGTCACCA… The true multiple alignment

Easy Sequence Alignment B_WEAU 160 A_U 455 A_IFA 86 A_92 UG 037 A_Q 23

Harder Sequence Alignment B_WEAU 160 A_U 455 A_SF 1703 A_92 RW 020. 5 A_92

Multiple sequence alignment Objective: Typical approach: Estimate the “true alignment” (defined by the sequence

X U V W X Y AGTGGAT U TATGCCCA TATGACTT AGCCCTA AGCCCGCTT Y V

Input: unaligned sequences S 1 S 2 S 3 S 4 = = AGGCTATCACCTGACCTCCA

Phase 1: Multiple Sequence Alignment S 1 S 2 S 3 S 4 =

Phase 2: Construct tree S 1 S 2 S 3 S 4 = =

So many methods!!! Alignment method • Clustal • POY (and POY*) • Probcons (and

Phylogenetic reconstruction methods 1. Polynomial time distance-based methods: UPGMA, Neighbor Joining, Fast. ME, Weighbor,

UPGMA While |S|>2: find pair x, y of closest taxa; delete x Recurse on

UPGMA Works when evolution is “clocklike” a b c d e

UPGMA Fails to produce true tree if evolution deviates too much from a clock!

Performance criteria • Running time. • Space. • Statistical performance issues (e. g. ,

Four-point condition • A matrix D is additive if and only if for every

Naïve Quartet Method • Compute the tree on each quartet using the four-point condition

Better distance-based methods • • • Neighbor Joining Minimum Evolution Weighted Neighbor Joining Bio-NJ

Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50%

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001]

“Character-based” methods • Maximum parsimony • Maximum Likelihood • Bayesian MCMC (also likelihood-based) These

Standard problem: Maximum Parsimony (Hamming distance Steiner Tree) • Input: Set S of n

Maximum parsimony (example) • Input: Four sequences – ACT – ACA – GTT –

Maximum Parsimony ACT GTA ACT GTT ACA GTT GTA ACA GTA ACT GTT

Maximum Parsimony ACT GTT 2 GTT GTA 1 2 GTA ACA GTT ACA ACT

Maximum Parsimony: computational complexity Optimal labeling can be computed in linear time O(nk) ACA

But solving this problem exactly is … unlikely # of Taxa # of Unrooted

Local search strategies Local optimum Cost Global optimum Phylogenetic trees

Local search strategies • Hill-climbing based upon topological changes to the tree • Incorporating

Evaluating heuristics with respect to MP or ML scores Fake study Performance of Heuristic

“Boosting” MP heuristics • We use “Disk-covering methods” (DCMs) to improve heuristic searches for

Rec-I-DCM 3 significantly improves performance (Roshan et al. ) Current best techniques DCM boosted

Current methods • Maximum Parsimony (MP): – TNT – PAUP* (with Rec-I-DCM 3) •

But… X U V W X Y AGTGGAT U TATGCCCA TATGACTT AGCCCTA AGCCCGCTT Y

• Phylogenetic reconstruction methods assume the sequences all have the same length. •

Basic Questions • Does improving the alignment lead to an improved phylogeny? • Are

DNA sequence evolution Simulation using ROSE: 100 taxon model trees, models 1 -4 have

Slides: 51

Download presentation

Introduction to Phylogenetic Estimation Algorithms Tandy Warnow

Questions • What is a phylogeny? • What data are used? • What is involved in a phylogenetic analysis? • What are the most popular methods? • What is meant by “accuracy”, and how is it measured?

Phylogeny From the Tree of the Life Website, University of Arizona Orangutan Gorilla Chimpanzee Human

Data • Biomolecular sequences: DNA, RNA, amino acid, in a multiple alignment • Molecular markers (e. g. , SNPs, RFLPs, etc. ) • Morphology • Gene order and content These are “character data”: each character is a function mapping the set of taxa to distinct states (equivalence classes), with evolution modelled as a process that changes the state of a character

DNA Sequence Evolution -3 mil yrs AAGACTT AAGGCCT AGGGCAT TAGCCCA -2 mil yrs TGGACTT TAGACTT AGCACAA AGCGCTT -1 mil yrs today

Phylogeny Problem U AGGGCAT V W TAGCCCA X TAGACTT Y TGCACAA X U Y V W TGCGCTT

Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA…

Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… …ACCAGTCACCA…

Deletion Mutation The true pairwise alignment is: …ACGGTGCAGTTACCA… …AC----CAGTCACCA… …ACCAGTCACCA… The true multiple alignment on a set of homologous sequences is obtained by tracing their evolutionary history, and extending the pairwise alignments on the edges to a multiple alignment on the leaf sequences.

Easy Sequence Alignment B_WEAU 160 A_U 455 A_IFA 86 A_92 UG 037 A_Q 23 B_SF 2 B_LAI B_F 12 B_HXB 2 R B_LW 123 B_NL 43 B_NY 5 B_MN B_JRCSF B_JRFL B_NH 52 B_OYI B_CAM 1 ATGGAAAACAGATGGCAGGTGATGATTGTGTGGCAAGTAGACAGG. . . . A. . G. . . . . . G. . . . C. . . . G. . . . . . . . . . . . . . . . . . . . . C. . . . . . . . . G. . . . . . . 45 45 45 45 45

Harder Sequence Alignment B_WEAU 160 A_U 455 A_SF 1703 A_92 RW 020. 5 A_92 UG 031. 7 A_92 UG 037. 8 A_TZ 017 A_UG 275 A A_UG 273 A A_DJ 258 A A_KENYA A_CARGAN A_CARSAS A_CAR 4054 A_CAR 286 A A_CAR 4023 A_CAR 423 A A_VI 191 A ATGAGAGTGAAGGGGATCAGGAAGAATTATCAGCACTTG. . T. . . ACA. . G. . . . CTTG. . . T. . . ACA. . T. . . C. G. . . AA. . . G. . . ACA. . C. . GG. . AA. . . G. A. . ACA. . GG. . . . A. . . T. . . AGA. . G. . . . CTTG. . . G. . A. . . G. A. . G. . . A. . C. . T. . . CACA. . T. . . G. . . AA. . . G. . . . ACA. . GG. . . . . T. . . ACA. . . CA. T. . . A. . T. . . CACA. . G. . . A. . T. . . ACA. . . . CACA. . CTCT. C. . . . A. . CACA. . GG. . CA. . . . . CACA. . GG. . AA. . . . . A. ---------. . A. . . . . ACA. . T. . . GG. . A. . . 39 39 39 35 35 35 39 39 39 30 30 39

Multiple sequence alignment Objective: Typical approach: Estimate the “true alignment” (defined by the sequence of evolutionary events) 1. Estimate an initial tree 2. Estimate a multiple alignment by performing a “progressive alignment” up the tree, using Needleman. Wunsch (or a variant) to alignments

X U V W X Y AGTGGAT U TATGCCCA TATGACTT AGCCCTA AGCCCGCTT Y V W

Input: unaligned sequences S 1 S 2 S 3 S 4 = = AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACA

Phase 1: Multiple Sequence Alignment S 1 S 2 S 3 S 4 = = AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACA S 1 S 2 S 3 S 4 = = -AGGCTATCACCTGACCTCCA TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC----TCAC--GACCGACA

Phase 2: Construct tree S 1 S 2 S 3 S 4 = = AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACA S 1 S 4 S 1 S 2 S 3 S 4 S 2 S 3 = = -AGGCTATCACCTGACCTCCA TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC----TCAC--GACCGACA

So many methods!!! Alignment method • Clustal • POY (and POY*) • Probcons (and Probtree) • MAFFT • Prank • Muscle • Di-align • T-Coffee • Satchmo • Etc. Blue = used by systematists Purple = recommended by protein research community Phylogeny method • Bayesian MCMC • Maximum parsimony • Maximum likelihood • Neighbor joining • UPGMA • Quartet puzzling • Etc.

So many methods!!! Alignment method • Clustal • POY (and POY*) • Probcons (and Probtree) • MAFFT • Prank • Muscle • Di-align • T-Coffee • Satchmo • Etc. Blue = used by systematists Purple = recommended by Edgar and Batzoglou for protein alignments Phylogeny method • Bayesian MCMC • Maximum parsimony • Maximum likelihood • Neighbor joining • UPGMA • Quartet puzzling • Etc.

Phylogenetic reconstruction methods 1. Polynomial time distance-based methods: UPGMA, Neighbor Joining, Fast. ME, Weighbor, etc. 2. Hill-climbing heuristics for NP-hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Local optimum Cost Global optimum Phylogenetic trees 3. Bayesian methods

UPGMA While |S|>2: find pair x, y of closest taxa; delete x Recurse on S-{x} Insert y as sibling to x Return tree a b c d e

UPGMA Works when evolution is “clocklike” a b c d e

UPGMA Fails to produce true tree if evolution deviates too much from a clock! b a c d e

Performance criteria • Running time. • Space. • Statistical performance issues (e. g. , statistical consistency and sequence length requirements) • “Topological accuracy” with respect to the underlying true tree. Typically studied in simulation. • Accuracy with respect to a mathematical score (e. g. tree length or likelihood score) on real data.

Distance-based Methods

Additive Distance Matrices

Four-point condition • A matrix D is additive if and only if for every four indices i, j, k, l, the maximum and median of the three pairwise sums are identical Dij+Dkl < Dik+Djl = Dil+Djk The Four-Point Method computes trees on quartets using the Four-point condition

Naïve Quartet Method • Compute the tree on each quartet using the four-point condition • Merge them into a tree on the entire set if they are compatible: – Find a sibling pair A, B – Recurse on S-{A} – If S-{A} has a tree T, insert A into T by making A a sibling to B, and return the tree

Better distance-based methods • • • Neighbor Joining Minimum Evolution Weighted Neighbor Joining Bio-NJ DCM-NJ And others

Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001] Error Rate 0. 8 NJ 0. 6 0. 4 0. 2 0 0 400 800 No. Taxa 1200 Simulation study based upon fixed edge lengths, K 2 P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. 1600

“Character-based” methods • Maximum parsimony • Maximum Likelihood • Bayesian MCMC (also likelihood-based) These are more popular than distancebased methods, and tend to give more accurate trees. However, these are computationally intensive!

Standard problem: Maximum Parsimony (Hamming distance Steiner Tree) • Input: Set S of n aligned sequences of length k • Output: A phylogenetic tree T – leaf-labeled by sequences in S – additional sequences of length k labeling the internal nodes of T such that is minimized.

Maximum parsimony (example) • Input: Four sequences – ACT – ACA – GTT – GTA • Question: which of the three trees has the best MP scores?

Maximum Parsimony ACT GTA ACT GTT ACA GTT GTA ACA GTA ACT GTT

Maximum Parsimony ACT GTT 2 GTT GTA 1 2 GTA ACA GTT ACA ACT 1 3 3 MP score = 7 MP score = 5 ACA ACT GTA ACA GTA 2 1 1 MP score = 4 Optimal MP tree GTT ACT GTA

Maximum Parsimony: computational complexity Optimal labeling can be computed in linear time O(nk) ACA ACT GTA ACA 1 GTA 2 1 GTT MP score = 4 Finding the optimal MP tree is NP-hard

But solving this problem exactly is … unlikely # of Taxa # of Unrooted Trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2. 2 x 1020 100 4. 5 x 10190 1000 2. 7 x 102900

Local search strategies Local optimum Cost Global optimum Phylogenetic trees

Local search strategies • Hill-climbing based upon topological changes to the tree • Incorporating randomness to exit from local optima

Evaluating heuristics with respect to MP or ML scores Fake study Performance of Heuristic 1 Score of best trees Performance of Heuristic 2 Time

“Boosting” MP heuristics • We use “Disk-covering methods” (DCMs) to improve heuristic searches for MP and ML Base method M DCM-M

Rec-I-DCM 3 significantly improves performance (Roshan et al. ) Current best techniques DCM boosted version of best techniques Comparison of TNT to Rec-I-DCM 3(TNT) on one large dataset

Current methods • Maximum Parsimony (MP): – TNT – PAUP* (with Rec-I-DCM 3) • Maximum Likelihood (ML) – RAx. ML (with Rec-I-DCM 3) – GARLI – PAUP* • Datasets with up to a few thousand sequences can be analyzed in a few days • Portal at www. phylo. org

But… X U V W X Y AGTGGAT U TATGCCCA TATGACTT AGCCCTA AGCCCGCTT Y V W

• Phylogenetic reconstruction methods assume the sequences all have the same length. • Standard models of sequence evolution used in maximum likelihood and Bayesian analyses assume sequences evolve only via substitutions, producing sequences of equal length. • And yet, almost all nucleotide datasets evolve with insertions and deletions (“indels”), producing datasets that violate these models and methods. How can we reconstruct phylogenies from sequences of unequal length?

Basic Questions • Does improving the alignment lead to an improved phylogeny? • Are we getting good enough alignments from MSA methods? (In particular, is Clustal. W - the usual method used by systematists - good enough? ) • Are we getting good enough trees from the phylogeny reconstruction methods? • Can we improve these estimations, perhaps through simultaneous estimation of trees and alignments?

DNA sequence evolution Simulation using ROSE: 100 taxon model trees, models 1 -4 have “long gaps”, and 5 -8 have “short gaps”, site substitution is HKY+Gamma

Results Model difficulty