Computational and mathematical challenges involved in very largescale

How did life evolve on earth? An international effort to understand how life evolved

Steps in a phylogenetic analysis • Gather data • Align sequences • Reconstruct phylogeny

Reconstructing the “Tree” of Life Handling large datasets: millions of species, and most problems

The CIPRES Project (Cyber-Infrastructure for Phylogenetic Research) The US National Science Foundation funds this

This talk • Part 1: Overview of some interesting mathematical issues • Part 2:

Part 1: Mathematical issues under stochastic models

DNA Sequence Evolution -3 mil yrs AAGACTT AAGGCCT AGGGCAT TAGCCCA -2 mil yrs TGGACTT

Markov models of single site evolution Simplest (Jukes-Cantor): • The model tree is a

Modelling variation between characters: Rates-across-sites B D A C B A D C •

Modelling variation between characters: The “no-common-mechanism” model • A separate random variable for every

Identifiability and statistical consistency • A model is identifiable if it is uniquely characterized

Identifiability results • The “standard” Markov models (from Jukes-Cantor to the General Markov model)

What about phylogeny reconstruction methods? U AGGGCAT V W TAGCCCA X TAGACTT Y TGCACAA

Phylogenetic reconstruction methods 1. Hill-climbing heuristics for NP-hard optimization criteria (Maximum Parsimony and Maximum

Performance criteria • Running time. • Space. • Statistical performance issues (e. g. ,

Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50%

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

Theoretical results: statistical consistency under typical models? • Neighbor Joining is polynomial time, and

Theoretical results: sequence length requirements under typical models? • Neighbor joining (and some other

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001]

DCM 1 Warnow, St. John, and Moret, SODA 2001 Exponentially converging method DCM SQS

DCM 1 -boosting distance-based methods [Nakhleh et al. ISMB 2001] Error Rate 0. 8

However, • The best accuracy in simulation tends to be from computationally intensive methods

Part 2: Improved heuristics for NP-hard optimization problems

Approaches for “solving” MP/ML 1. 2. 3. Hill-climbing heuristics (which can get stuck in

Problems with current techniques for MP Shown here is the performance of a TNT

Observations • The best MP heuristics cannot get acceptably good solutions within 24 hours

Rec-I-DCM 3: a new technique (Roshan et al. ) • Combines a new decomposition

The DCM 3 decomposition Input: Set S of sequences, and guide-tree T 1. Compute

Rec-I-DCM 3 significantly improves performance (Roshan et al. ) Current best techniques DCM boosted

Observations • Rec-I-DCM 3 improves upon the best performing heuristics for MP. • The

Basic observations about standard two-phase methods • Many new MSA methods improve on Clustal.

What about simultaneous estimation? • Several Bayesian methods for simultaneous estimation of trees and

New method: SATe (Simultaneous Alignment and Tree estimation) • Developers: Warnow, Linder, Liu, Nelesen,

Simulation study • 100 taxon model trees (generated by r 8 s and then

Topological accuracy FN (false negative): proportion of correct edges missing from the estimated tree

Alignment accuracy • Normalized number of columns in the estimated alignment relative to the

Alignment accuracy FN: proportion of correctly homologous pairs of nucleotides missing from the estimated

Other observations • SATe is more accurate at estimating the number of gaps and

Part IV: Open questions • Tree shape (including branch lengths) has an impact on

General comments • Current models of sequence evolution are clearly too simple, and more

Acknowledgements • Funding: NSF, The David and Lucile Packard Foundation, The Program in Evolutionary

Slides: 47

Download presentation

Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin

How did life evolve on earth? An international effort to understand how life evolved on earth Biomedical applications: drug design, protein structure and function prediction, biodiversity. • Courtesy of the Tree of Life project

Steps in a phylogenetic analysis • Gather data • Align sequences • Reconstruct phylogeny on the multiple alignment often obtaining a large number of trees • Compute consensus (or otherwise estimate the reliable components of the evolutionary history) • Perform post-tree analyses.

Reconstructing the “Tree” of Life Handling large datasets: millions of species, and most problems are NP-hard Evolution is complex: Reticulate evolution Indels, genome rearrangements, etc. Gene tree/species tree differences

The CIPRES Project (Cyber-Infrastructure for Phylogenetic Research) The US National Science Foundation funds this project, which has the following major components: • ALGORITHMS and SOFTWARE: scaling to millions of sequences (open source, freely distributed) • MATHEMATICS/PROBABILITY/STATISTICS: Obtaining better mathematical theory under complex models of evolution • DATABASES: Producing new database technology for structured data, to enable scientific discoveries • SIMULATIONS: The first million taxon simulation under realistically complex models • OUTREACH: Museum partners, K-12, general scientific public • PORTAL available to all researchers • See www. phylo. org for more about CIPRES.

This talk • Part 1: Overview of some interesting mathematical issues • Part 2: Better heuristics for NP-hard optimization problems • Part 3: New methods for simultaneous estimation of trees and alignments • Part 4: Related research and open problems.

Part 1: Mathematical issues under stochastic models

DNA Sequence Evolution -3 mil yrs AAGACTT AAGGCCT AGGGCAT TAGCCCA -2 mil yrs TGGACTT TAGACTT AGCACAA AGCGCTT -1 mil yrs today

Markov models of single site evolution Simplest (Jukes-Cantor): • The model tree is a pair (T, {e, p(e)}), where T is a rooted binary tree, and p(e) is the probability of a substitution on the edge e. • The state at the root is random. • If a site changes on an edge, it changes with equal probability to each of the remaining states. • The evolutionary process is Markovian. More complex models (such as the General Markov model) are also considered, often with little change to theory.

Modelling variation between characters: Rates-across-sites B D A C B A D C • If a site (i. e. , character) is twice as fast as another on one edge, it is twice as fast everywhere. • The distribution of the rates is typically assumed to be gamma.

Modelling variation between characters: The “no-common-mechanism” model • A separate random variable for every combination of site and edge the underlying tree is fixed, but otherwise there are no constraints on variation between sites. C A B D A B C D

Identifiability and statistical consistency • A model is identifiable if it is uniquely characterized by the probability distribution it defines. • A phylogenetic reconstruction method is statistically consistent under a model if the probability that the method reconstructs the true tree goes to 1 as the sequence length increases.

Identifiability results • The “standard” Markov models (from Jukes-Cantor to the General Markov model) are identifiable. • These models are also identifiable when sites draw rates from a gamma distribution (easy to prove if the distribution is known, and harder to prove if the distribution must be estimated - cf. Allman’s talk). • However, mixed models are often not identifiable (cf. Matsen’s talk), nor are some models in which sites draw rates from more complex distributions. Phylogeny estimation typically is done under identifiable models.

What about phylogeny reconstruction methods? U AGGGCAT V W TAGCCCA X TAGACTT Y TGCACAA X U Y V W TGCGCTT

Phylogenetic reconstruction methods 1. Hill-climbing heuristics for NP-hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Local optimum Cost Global optimum Phylogenetic trees 2. 3. Polynomial time distance-based methods: Neighbor Joining, Fast. ME, Weighbor, etc. Bayesian methods

Performance criteria • Running time. • Space. • Statistical performance issues (e. g. , statistical consistency and sequence length requirements) • “Topological accuracy” with respect to the underlying true tree. Typically studied in simulation. • Accuracy with respect to a particular criterion (e. g. tree length or likelihood score), on real data.

Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

Theoretical results: statistical consistency under typical models? • Neighbor Joining is polynomial time, and statistically consistent. • Maximum Parsimony is NP-hard, and even exact solutions are not statistically consistent. • Maximum Likelihood is NP-hard, but exact solutions are statistically consistent

Theoretical results: sequence length requirements under typical models? • Neighbor joining (and some other distance-based methods) will return the true tree with high probability provided sequence lengths are exponential in the diameter of the tree (Erdos et al. , Atteson). • Maximum likelihood will return the true tree with high probability provided sequence lengths are exponential in the number of taxa (Steel and Szekely).

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001] Error Rate 0. 8 NJ Simulation study based upon fixed edge lengths, K 2 P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. 0. 6 0. 4 0. 2 0 0 400 800 No. Taxa 1200 1600

DCM 1 Warnow, St. John, and Moret, SODA 2001 Exponentially converging method DCM SQS Absolute fast converging method • A two-phase procedure which reduces the sequence length requirement of methods. The DCM phase produces a collection of trees, and the SQS phase picks the “best” tree. • The “base method” is applied to subsets of the original dataset. When the base method is NJ, you get DCM 1 -NJ.

DCM 1 -boosting distance-based methods [Nakhleh et al. ISMB 2001] Error Rate 0. 8 NJ DCM 1 -NJ 0. 6 0. 4 • Theorem: DCM 1 -NJ converges to the true tree from polynomial length sequences 0. 2 0 0 400 800 No. Taxa 1200 1600

However, • The best accuracy in simulation tends to be from computationally intensive methods (and most molecular phylogeneticists prefer these methods). • Unfortunately, these approaches can take weeks or more, just to reach decent local optima. • Conclusion: We need better heuristics!

Part 2: Improved heuristics for NP-hard optimization problems

Approaches for “solving” MP/ML 1. 2. 3. Hill-climbing heuristics (which can get stuck in local optima) Randomized algorithms for getting out of local optima Approximation algorithms for MP (based upon Steiner Tree approximation algorithms). Local optimum Cost Global optimum Phylogenetic trees

Problems with current techniques for MP Shown here is the performance of a TNT heuristic maximum parsimony analysis on a real dataset of almost 14, 000 sequences. (“Optimal” here means best score to date, using any method for any amount of time. ) Acceptable error is below 0. 01%. Performance of TNT with time

Observations • The best MP heuristics cannot get acceptably good solutions within 24 hours on some large datasets. • Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions. • Apparent convergence can be misleading.

Rec-I-DCM 3: a new technique (Roshan et al. ) • Combines a new decomposition technique (DCM 3) with recursion and iteration, to produce a novel approach for escaping local optima • Tested initially on MP (maximum parsimony), but also implemented for ML and other optimization problems

The DCM 3 decomposition Input: Set S of sequences, and guide-tree T 1. Compute short subtree graph G(S, T), based upon T 2. Find clique separator in the graph G(S, T) and form subproblems DCM 3 decompositions (1) can be obtained in O(n) time (the short subtree graph is triangulated) (2) yield small subproblems (3) can be used iteratively

Iterative-DCM 3 T DCM 3 Base method T’

Rec-I-DCM 3 significantly improves performance (Roshan et al. ) Current best techniques DCM boosted version of best techniques Comparison of TNT to Rec-I-DCM 3(TNT) on one large dataset

Observations • Rec-I-DCM 3 improves upon the best performing heuristics for MP. • The improvement increases with the difficulty of the dataset. Rec-I-DCM 3 also improves maximum likelihood (using RAx. ML) analyses (data not shown), and is included in the CIPRES portal.

Part 3: Multiple sequence alignment

Basic observations about standard two-phase methods • Many new MSA methods improve on Clustal. W, with Prob. Cons and MAFFT the two best MSA methods. • Although alignment accuracy correlates with phylogenetic accuracy, it has less effect than might be expected (Wang et al. , unpublished).

What about simultaneous estimation? • Several Bayesian methods for simultaneous estimation of trees and alignments have been developed, but none can be applied to datasets with more than (approx. ) 20 sequences. • POY attempts to solve the NP-hard “minimum length tree” problem, where gaps contribute to the length of the tree and can be applied to large datasets. However, its performance on simulated data isn’t competitive with the best two-phase methods (unpublished data).

New method: SATe (Simultaneous Alignment and Tree estimation) • Developers: Warnow, Linder, Liu, Nelesen, and Zhao. • Basic technique: propose alignments (using treelength under a selected affine gap penalty), and compute maximum likelihood trees for these alignments under GTR. • Unpublished.

Simulation study • 100 taxon model trees (generated by r 8 s and then modified, so as to deviate from the molecular clock). • DNA sequences evolved under ROSE (indel events of blocks of nucleotides, plus HKY site evolution). The root sequence has 1000 sites. • We vary the gap length distribution, probability of gaps, and probability of substitutions, to produce 8 model conditions: models 1 -4 have “long gaps” and 5 -8 have “short gaps”. • We compared RAx. ML on various alignments (including the true alignment) to SATe.

Topological accuracy FN (false negative): proportion of correct edges missing from the estimated tree FP (false positive): proportion of incorrect edges in the estimated tree

Alignment accuracy • Normalized number of columns in the estimated alignment relative to the true alignment.

Alignment accuracy FN: proportion of correctly homologous pairs of nucleotides missing from the estimated alignment (i. e. , 1 -SP score). FP: proportion of incorrect pairings of nucleotides in the estimated alignment.

Other observations • SATe is more accurate at estimating the number of gaps and the correct alignment length than other methods. • The standard alignment accuracy measure, SP, is not particularly predictive of phylogenetic accuracy. • We still need methods for MSA under models that include rearrangements and duplications.

Part IV: Open questions • Tree shape (including branch lengths) has an impact on phylogeny reconstruction - but what model of tree shape to use? • What is the sequence length requirement for Maximum Likelihood? (Current bound is probably too large. ) • Why is MP not so bad? • How to detect and reconstruct reticulate evolution? • “Teasing apart” trees under complex models? • What complex models are identifiable?

General comments • Current models of sequence evolution are clearly too simple, and more realistic ones are not identifiable. • Relative performance between methods can change as the models become more complex or as the number of taxa increases. • We do not know how methods perform under realistic conditions (nor how long we need to let computationally intensive methods run).

Acknowledgements • Funding: NSF, The David and Lucile Packard Foundation, The Program in Evolutionary Dynamics at Harvard, and The Institute for Cellular and Molecular Biology at UT-Austin. • Collaborators: Peter Erdos, Daniel Huson, Randy Linder, Kevin Liu, Bernard Moret, Serita Nelesen, Usman Roshan, Mike Steel, Katherine St. John, Laszlo Szekely, Tiffani Williams, and David Zhao.