Distance in Trees dijT the length of a

  • Slides: 26
Download presentation
Distance in Trees dij(T) - the length of a path between leaves i and

Distance in Trees dij(T) - the length of a path between leaves i and j j i d 1, 4 = 12 + 13 + 14 + 17 + 13 = 69

Phylogenetic Tree Reconstruction • Input: • • Distance matrix D Output: • Binary Tree

Phylogenetic Tree Reconstruction • Input: • • Distance matrix D Output: • Binary Tree T such that dij(T) = Dij

Reconstructing a 3 Leaved Tree • • Tree reconstruction for any 3 x 3

Reconstructing a 3 Leaved Tree • • Tree reconstruction for any 3 x 3 matrix is straightforward We have 3 leaves i, j, k and a center vertex c Observe: dic + djc = Dij dic + dkc = Dik djc + dkc = Djk

Reconstructing a 3 Leaved Tree (cont’d) dic + djc = Dij + dic +

Reconstructing a 3 Leaved Tree (cont’d) dic + djc = Dij + dic + dkc = Dik 2 dic + djc + dkc = Dij + Dik 2 dic + Djk = Dij + Dik dic = (Dij + Dik – Djk)/2 Similarly, djc = (Dij + Djk – Dik)/2 dkc = (Dki + Dkj – Dij)/2

Trees with > 3 Leaves • An tree with n leaves has 2 n-3

Trees with > 3 Leaves • An tree with n leaves has 2 n-3 edges • This means fitting a given tree to a distance matrix D requires solving a system of “n choose 2” equations with 2 n-3 variables • This is not always possible to solve for n > 3

The Four Point Condition Compute: 1. Dij + Dkl, 2. Dik + Djl, 3.

The Four Point Condition Compute: 1. Dij + Dkl, 2. Dik + Djl, 3. Dil + Djk 2 2 and 3 represent the same number: the length of all edges + the middle edge (it is counted twice) 3 1 1 represents a smaller number: the length of all edges – the middle edge

The Four Point Condition • Four point condition: For i, j, k, l two

The Four Point Condition • Four point condition: For i, j, k, l two of the sums Dij + Dkl, Dik + Djl, Dil + Djk are equal and the third sum is smaller • Definition : An n x n matrix D is additive provided there exists a tree T with D(T) = D. (Note: T is unique. ) • Theorem: D is additive if and only if the four point condition holds for every quartet 1 ≤ i, j, k, l ≤ n

Additive Distance Matrices Matrix D is ADDITIVE if there exists a tree T with

Additive Distance Matrices Matrix D is ADDITIVE if there exists a tree T with dij(T) = Dij NON-ADDITIVE otherwise

Reconstructing Additive Distances Given T x D v w x y z y T

Reconstructing Additive Distances Given T x D v w x y z y T v w x y z 0 10 17 16 16 0 15 14 14 0 9 15 v 0 14 If we know T and D, but do not know the length of each edge, we can reconstruct those lengths 0 z w

Reconstructing Additive Distances Given T v D v w 0 10 17 16 16

Reconstructing Additive Distances Given T v D v w 0 10 17 16 16 w 0 x x y x z T y 15 14 14 0 y 9 15 0 14 z z a w 0 dvx + dwx = 2 dax + dvw a D 1 x y z a x y z 0 11 10 10 0 9 15 0 14 0 dax = ½ (dvx + dwx – dvw) day = ½ (dvy + dwy – dvw) daz = ½ (dvz + dwz – dvw) v

Reconstructing Additive Distances Given T x a D 1 a x y z 0

Reconstructing Additive Distances Given T x a D 1 a x y z 0 11 10 10 0 9 15 0 14 x y z D 2 a b z y 4 b z 7 0 a b z 0 6 10 0 D 3 a c 0 3 0 T 5 3 c 3 a 4 w 6 d(a, c) = 3 d(b, c) = d(a, b) – d(a, c) = 3 d(c, z) = d(a, z) – d(a, c) = 7 d(b, x) = d(a, x) – d(a, b) = 5 d(b, y) = d(a, y) – d(a, b) = 4 d(a, w) = d(z, w) – d(a, z) = 4 d(a, v) = d(z, v) – d(a, z) = 6 Correct!!! v

Distance Based Phylogeny Problem • • Goal: Reconstruct an evolutionary tree from a distance

Distance Based Phylogeny Problem • • Goal: Reconstruct an evolutionary tree from a distance matrix Input: n x n distance matrix Dij Output: weighted tree T with n leaves fitting D If D is additive, this problem has a solution and there is a simple algorithm to solve it

Using Neighboring Leaves to Construct the Tree • • • Find neighboring leaves i

Using Neighboring Leaves to Construct the Tree • • • Find neighboring leaves i and j with parent k Remove the rows and columns of i and j Add a new row and column corresponding to k, where the distance from k to any other leaf m can be computed as: Dkm = (Dim + Djm – Dij)/2 Compress i and j into k, iterate algorithm for rest of tree

Finding Neighboring Leaves • To find neighboring leaves we simply select a pair of

Finding Neighboring Leaves • To find neighboring leaves we simply select a pair of closest leaves.

Finding Neighboring Leaves • To find neighboring leaves we simply select a pair of

Finding Neighboring Leaves • To find neighboring leaves we simply select a pair of closest leaves. WRONG

Finding Neighboring Leaves Closest leaves aren’t necessarily neighbors i and j are neighbors, but

Finding Neighboring Leaves Closest leaves aren’t necessarily neighbors i and j are neighbors, but (dij = 13) > (djk = 12) • • • Finding a pair of neighboring leaves is a nontrivial problem!

Degenerate Triples • A degenerate triple is a set of three distinct elements 1≤i,

Degenerate Triples • A degenerate triple is a set of three distinct elements 1≤i, j, k≤n where Dij + Djk = Dik • Element j in a degenerate triple i, j, k lies on the evolutionary path from i to k (or is attached to this path by an edge of length 0).

Looking for Degenerate Triples • If distance matrix D has a degenerate triple i,

Looking for Degenerate Triples • If distance matrix D has a degenerate triple i, j, k then j can be “removed” from D thus reducing the size of the problem. • If distance matrix D does not have a degenerate triple i, j, k, one can “create” a degenerative triple in D by shortening all hanging edges (in the tree).

Shortening Hanging Edges to Produce Degenerate Triples • Shorten all “hanging” edges (edges that

Shortening Hanging Edges to Produce Degenerate Triples • Shorten all “hanging” edges (edges that connect leaves) until a degenerate triple is found

Finding Degenerate Triples • • • If there is no degenerate triple, all hanging

Finding Degenerate Triples • • • If there is no degenerate triple, all hanging edges are reduced by the same amount δ, so that all pair-wise distances in the matrix are reduced by 2δ. Eventually this process collapses one of the leaves (when δ = length of shortest hanging edge), forming a degenerate triple i, j, k and reducing the size of the distance matrix D. The attachment point for j can be recovered in the reverse transformations by saving Dij for each collapsed leaf.

Reconstructing Trees for Additive Distance Matrices Trim(D, δ) for all 1 ≤ i ≠

Reconstructing Trees for Additive Distance Matrices Trim(D, δ) for all 1 ≤ i ≠ j ≤ n Dij = Dij - 2δ

Additive. Phylogeny Algorithm 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

Additive. Phylogeny Algorithm 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Additive. Phylogeny(D) if D is a 2 x 2 matrix T = tree of a single edge of length D 1, 2 return T if D is non-degenerate Compute trimming parameter δ Trim(D, δ) Find a triple i, j, k in D such that Dij + Djk = Dik x = Dij Remove jth row and jth column from D T = Additive. Phylogeny(D) Traceback

Additive. Phylogeny (cont’d) Traceback 1. 2. 3. 4. 5. 6. 7. 8. Add a

Additive. Phylogeny (cont’d) Traceback 1. 2. 3. 4. 5. 6. 7. 8. Add a new vertex v to T at distance x from i to k Add j back to T by creating an edge (v, j) of length 0 for every leaf l in T if distance from l to v in the tree ≠ Dl, j output “matrix is not additive” return Extend all “hanging” edges by length δ return T

Neighbor Joining Algorithm • In 1987 Naruya Saitou and Masatoshi Nei developed a neighbor

Neighbor Joining Algorithm • In 1987 Naruya Saitou and Masatoshi Nei developed a neighbor joining algorithm for phylogenetic tree reconstruction • Finds a pair of leaves that are close to each other but far from other leaves: implicitly finds a pair of neighboring leaves • Advantages: works well for additive and other non-additive matrices, it does not have the flawed molecular clock assumption

Neighbor-Joining • • Guaranteed to produce the correct tree if distance is additive May

Neighbor-Joining • • Guaranteed to produce the correct tree if distance is additive May produce a good tree even when distance is not additive 1 Let C = current clusters. Step 1: Finding neighboring clusters Define: u(C) =1/(|C|-2) C’ 2 C D(C, C 0 ) u(C) measures separation of C from other clusters 3 0. 1 Want to minimize D(C 1, C 2) and maximize u(C 1) + u(C 2) Magic trick: Choose C 1 and C 2 that minimize D(C 1, C 2) - (u(C 1) + u(C 2) ) 0. 1 0. 4 2 Claim: Above ensures that Dij is minimal iff i, j are neighbors Proof: Very technical, please read Durbin et al. ! 0. 1 0. 4 4

Algorithm: Neighbor-joining Initialization: For n clusters, one for each leaf node Define T to

Algorithm: Neighbor-joining Initialization: For n clusters, one for each leaf node Define T to be the set of leaf nodes, one per sequence Iteration: Pick Ci, Cj s. t. D(Ci, Cj) – (u(C 1) + u(C 2)) is minimal Merge C 1 and C 2 into new cluster with |C 1| + |C 2| elements Add a new vertex C to T and connect to vertices C 1 and C 2 Assign length 1/2 (D(C 1, C 2) + (u(C 1) - u(C 2) ) to edge (C 1, C) Assign length 1/2 (D(C 1, C 2) + (u(C 2) - u(C 1) ) to edge (C 2, C) Remove rows and columns from D corresponding to C 1 and C 2; Add row and column to D for new cluster C Termination: When only one cluster