Understanding sets of trees CS 394 C September

Understanding sets of trees CS 394 C September 10, 2009

Basic challenge • Phylogenetic analyses are sometimes based upon a single marker, but often based upon many markers • Each marker can be analyzed separately, or the entire set can be combined into one “super-matrix” • Each matrix (each dataset) can result in many trees (almost no matter how you analyze the matrix) What to do with huge numbers of trees?

What to do? • How to estimate evolutionary history from many trees • How to efficiently store large sets of trees • How to enable efficient queries of the set of trees

First, a few questions: • Why are gene trees different from the species tree? • Why are estimated gene trees different from the true gene tree? • Under what conditions is the true evolutionary history not a tree? (i. e. , what is “reticulation”? )

Reticulation • Evolutionary histories can be reticulate (meaning non-treelike): – Horizontal Gene Transfer (HGT) – Hybrid speciation – Recombination • Most phylogeny estimation methods produce trees. • Good resource about reticulate phylogenies: book chapter by Luay Nakhleh (see 394 C webpage for the link)

• We will assume that all evolutionary histories are treelike for the remainder of today’s presentation. • Later in the course we’ll discuss reticulate evolution…

Estimated Gene Trees can differ from Species Trees • Biological reasons: – Deep coalescent events (alleles) – Gene duplication and loss (gene families) • Computational reasons: – Insufficient time – Poor methods (e. g. , UPGMA) – Poor models (e. g. , ML using Jukes-Cantor) • Data issues: – Insufficient data (meaning not enough sites) – Poor alignments

Examples of problems When true gene trees can differ from species tree: • Given a collection of gene trees, find a species tree that minimizes the number of “deep coalescent” events When true gene trees should equal the species tree: • Given a collection of gene trees, find a species tree that minimizes the total distance to the gene trees

When gene trees can differ from species tree Software/Algorithms for deep-coalescent (see Phylo. Net from Nakhleh’s webpage at Rice) GLASS (Roch and Mossel) - distance-based MDC (Than and Nakhleh) - parsimony STEM (Kubatko) - ML BEST (Liu et al. ) - Bayesian BUCKy (Ané et al. ) - Bayesian Software/Algorithms for duplication-loss NOTUNG (Durand) Duptree (Bansal et al. ) Hallet and Lagergren - algorithms/complexity

When gene trees should equal the species tree • The problem here is that estimated gene trees can differ from the true gene trees. • Although the problem is “simple”, it is still interesting -- computationally and mathematically. • Plus, we can still make novel contributions.

The very simplest problem Easiest case: • One species tree, true gene trees will agree with the species tree, • Estimated trees are on the full set of taxa Approaches: Consensus methods: return a tree on the entire set S of taxa summarizing the input trees Agreement methods: return a tree on a subset of the taxa on which the trees agree Clustering, then consensus/agreement

Consensus methods • These are the most usual ways of analyzing datasets of trees • Examples: – Strict consensus – Majority consensus – Greedy consensus (aka “extended majority”) – Others less frequently used include: Gordon’s, Adams, the Strict Consensus Supertree, Local Consensus methods, and more. • Survey paper by David Bryant for some of these

Simplest problems, cont. • “Agreement” methods return trees on subsets of S, on which the trees are the same (or compatible) – MAST: maximum agreement subtree (used in practice, sometimes) – MCST: maximum compatible subtree (Ganapathy et al. , not used in practice) • The difference between these is how polytomies are handled

Soft vs. hard polytomies • Polytomy: node of high degree (greater than three for an unrooted tree) • Polytomies arise in estimations when consensus methods are used • Polytomies also arise when contracting short branches in estimated trees • Polytomies can be “hard” (representing true radiations) or “soft” (representing lack of information)

Compatible source trees • Estimated trees can be “compatible” when we interpret polytomies as “soft” • “Compatible” means that there is a tree which is a common refinement. • Example: 123|456, 12|3456, 1235|46. • We can compute the compatibility tree (when it exists) in O(nk) time, where n=|S| and there are k source trees

Computational complexity • Most consensus methods (which return a tree on the entire set S of taxa) are polynomial time. • Most “agreement methods” (which return a tree on the largest subset of the taxa on which the source trees “agree”) are based upon NP-hard problems. Some (e. g. , MAST) have fixed-parameter polynomial time solutions.

Supertree problems • Realistic complexity: not all the source trees are on the same set of taxa. • Obvious problems: – Find the tree on which all the source trees agree (if it exists). – Find the tree on which a maximum number of the source trees agree. • Both are NP-hard.

Quartet compatibility • Simple case: all the source trees are on four taxa. • We ask: does there exist a tree which agrees with all the source trees? • NP-hard!

Quartet tree amalgamation • Given collection of quartet trees, find a tree which agrees with a maximum number of these quartet trees NP-hard, since compatibility is NP-hard Hard to approximate, but PTAS if you have a tree on every quartet of taxa (Jiang et al. )

Quartet amalgamation algorithms • Quartet Puzzling (Strimmer and von Haeseler) • Q* (Berry et al. ) • Quartet Cleaning (Berry et al. ) • Weight Optimization (Ranwez and Gascuel) • Quartets Max. Cut (Snir and Rao) But see also the paper (St. John et al. ) evaluating early quartet methods on the CS 394 C webpage

What about rooted trees? Given set of rooted source trees, we ask: • Is there a tree on which all the rooted source trees are correct?

Rooted tree compatibility • Aho, Sagiv, Szymanski, and Ullman: polynomial time, recursive algorithm: – If n=1, return the singleton tree. – If n>1, then compute an equivalence relation on the set of taxa as follows. • For each rooted triple ((a, b), c) in the set, put a and b in the same equivalence class. • Compute transitive closure. – If only one equivalence class, reject (set is incompatible). Otherwise, recurse on each subset, and return tree obtained by making all recursively computed trees sibling subtrees.

Subtree compatibility • If source trees are rooted, then compatibility can be tested in polynomial time. Optimization problems are NP-hard, however. • If source trees are unrooted, then compatibility is NP-hard. And so optimization problems are also NP-hard.

Supertree problems, in practice • In practice, the most frequently used supertree method is MRP, for “Matrix Representation with Parsimony”. • There are, however, many other supertree methods!

Many Supertree Methods Matrix Representation with Parsimony (Most commonly used) • • MRP weighted MRP Min-Cut Modified Min-Cut Semi-strict Supertree MRF MRD QILI • • SDM Q-imputation Phy. SIC Majority-Rule Supertrees • Maximum Likelihood Supertrees • and many more. . .

MRP • Idea: take every sourcetree, and replace it with a matrix of 0, 1, ? . • Concatenate the matrices. • Apply Maximum Parsimony. If all the source trees are compatible, then an exact solution to MRP will return the compatibility trees.

Homework, due 9/15 • Read two papers (linked on the webpage): – St. John et al. , about quartet-based methods – Moret et al. , about sequence-length requirements • Pick one, write summary, and include questions

Question! • How do you feel about occasionally having class on some Monday or Friday, so we can have guest lectures?