Multiple Sequence Alignment Evolution at the DNA level

  • Slides: 36
Download presentation
Multiple Sequence Alignment

Multiple Sequence Alignment

Evolution at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… REARRANGEMENTS Inversion Translocation Duplication SEQUENCE

Evolution at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… REARRANGEMENTS Inversion Translocation Duplication SEQUENCE EDITS

Orthology and Paralogy Yeast HA 1 Human HA 2 Human WA Worm HB Human

Orthology and Paralogy Yeast HA 1 Human HA 2 Human WA Worm HB Human WB Worm Orthologs: Derived by speciation Paralogs: Everything else

Orthology, Paralogy, Inparalogs, Outparalogs

Orthology, Paralogy, Inparalogs, Outparalogs

Phylogenetic Trees • Nodes: species • Edges: time of independent evolution • Edge length

Phylogenetic Trees • Nodes: species • Edges: time of independent evolution • Edge length represents evolution time § AKA genetic distance § Not necessarily chronological time

Inferring Phylogenetic Trees can be inferred by several criteria: § Morphology of the organisms

Inferring Phylogenetic Trees can be inferred by several criteria: § Morphology of the organisms • Can lead to mistakes § Sequence comparison Example: Mouse: Rat: Baboon: Chimp: Human: ACAGTGACGCCCCAAACGT ACAGTGACGCTACAAACGT CCTGTGACGTAACAAACGA CCTGTGACGTAGCAAACGA

Distance Between Two Sequences Basic principle: • Distance proportional to degree of independent sequence

Distance Between Two Sequences Basic principle: • Distance proportional to degree of independent sequence evolution Given sequences xi, xj, dij = distance between the two sequences One possible definition: dij = fraction f of sites u where xi[u] xj[u] Better scores are derived by modeling evolution as a continuous change process

A simple clustering method for building tree UPGMA (unweighted pair group method using arithmetic

A simple clustering method for building tree UPGMA (unweighted pair group method using arithmetic averages) Or the Average Linkage Method Given two disjoint clusters Ci, Cj of sequences, 1 dij = ––––– {p Ci, q Cj}dpq |Ci| |Cj| Claim that if Ck = Ci Cj, then distance to another cluster Cl is: dil |Ci| + djl |Cj| dkl = ––––––– |Ci| + |Cj|

Algorithm: Average Linkage Initialization: 1 4 Assign each xi into its own cluster Ci

Algorithm: Average Linkage Initialization: 1 4 Assign each xi into its own cluster Ci Define one leaf per sequence, height 0 3 Iteration: 5 2 Find two clusters Ci, Cj s. t. dij is min Let Ck = Ci Cj Define node connecting Ci, Cj, and place it at height dij/2 Delete Ci, Cj Termination: When two clusters i, j remain, place root at height dij/2 1 4 2 3 5

Neighbor-Joining • Guaranteed to produce the correct tree if distance is additive • May

Neighbor-Joining • Guaranteed to produce the correct tree if distance is additive • May produce a good tree even when distance is not additive 1 Step 1: Finding neighboring leaves 3 0. 1 Define 0. 4 Dij = (N – 2) dij – k i dik – k j djk 2 Claim: The above “magic trick” ensures that i, j are neighbors if Dij is minimal 0. 4 4

Species Trees and Gene Trees Rasmussen M D , Kellis M Genome Res. 2007;

Species Trees and Gene Trees Rasmussen M D , Kellis M Genome Res. 2007; 17: 1932 -1942 © 2007 by Cold Spring Harbor Laboratory Press

Evolutionary rates decoupled into gene-specific and species-specific components © 2007 by Cold Spring Harbor

Evolutionary rates decoupled into gene-specific and species-specific components © 2007 by Cold Spring Harbor Laboratory Press Rasmussen M D , Kellis M Genome Res. 2007; 17: 1932 -1942

Evaluating gene-tree likelihood using learned rate distributions. Rasmussen M D , Kellis M Genome

Evaluating gene-tree likelihood using learned rate distributions. Rasmussen M D , Kellis M Genome Res. 2007; 17: 1932 -1942

Sequence Alignment

Sequence Alignment

Scoring Function • Sequence edits: AGGCCTC § Mutations AGGACTC § Insertions AGGGCCTC § Deletions

Scoring Function • Sequence edits: AGGCCTC § Mutations AGGACTC § Insertions AGGGCCTC § Deletions AGG. CTC Scoring Function: Match: +m Mismatch: -s Gap: -d Alternative definition: minimal edit distance “Given two strings x, y, find minimum # of edits (insertions, deletions, mutations) to transform one string to the other” Score F = (# matches) m - (# mismatches) s – (#gaps) d

How do we compute the best alignment? AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGGGTCACAAAACTC Too many possible alignments: >>

How do we compute the best alignment? AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGGGTCACAAAACTC Too many possible alignments: >> 2 N (exercise)

Dynamic Programming • There are only a polynomial number of subproblems § Align x

Dynamic Programming • There are only a polynomial number of subproblems § Align x 1…xi to y 1…yj • Original problem is one of the subproblems § Align x 1…x. M to y 1…y. N • Each subproblem is easily solved from smaller subproblems § We will show next • Then, we can apply Dynamic Programming!!! Let F(i, j) = optimal score of aligning x 1……xi y 1……yj F is the DP “Matrix” or “Table” “Memoization”

Dynamic Programming (cont’d) Notice three possible cases: 1. 2. 3. xi aligns to yj

Dynamic Programming (cont’d) Notice three possible cases: 1. 2. 3. xi aligns to yj x 1……xi-1 xi y 1……yj-1 yj m, if xi = yj F(i, j) = F(i – 1, j – 1) + -s, if not xi aligns to a gap x 1……xi-1 xi y 1……yj - F(i, j) = F(i – 1, j) – d yj aligns to a gap x 1……xi y 1……yj-1 yj F(i, j) = F(i, j – 1) – d

Dynamic Programming (cont’d) How do we know which case is correct? Inductive assumption: F(i,

Dynamic Programming (cont’d) How do we know which case is correct? Inductive assumption: F(i, j – 1), F(i – 1, j – 1) are optimal Then, F(i, j) = max Where F(i – 1, j – 1) + s(xi, yj) F(i – 1, j) – d F(i, j – 1) – d s(xi, yj) = m, if xi = yj; -s, if not

Example x = AGTA y = ATA m= 1 Procedure to output s =

Example x = AGTA y = ATA m= 1 Procedure to output s = -1 Alignment d = -1 F(i, j) i=0 j=0 1 2 3 4 A G T A 0 -1 -2 -3 -4 1 A -1 1 0 -1 -2 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2 • Follow the backpointers F(1, 1) = • When diagonal, + s(A, A), max{F(0, 0) OUTPUT i, y F(0, x 1) –j d, F(1, 0) – d} = • When up, OUTPUT yj max{0 + 1, -1 – 1, • When left, -1 – 1} = 1 OUTPUT xi AGTA A - TA

The Needleman-Wunsch Algorithm 1. Initialization. a. b. c. 2. F(0, 0) F(0, j) F(i,

The Needleman-Wunsch Algorithm 1. Initialization. a. b. c. 2. F(0, 0) F(0, j) F(i, 0) = 0 =-j d =-i d Main Iteration. Filling-in partial alignments a. For each i = 1……M For each j = 1……N F(i, j) Ptr(i, j) 3. = max = F(i – 1, j – 1) + s(xi, yj) F(i – 1, j) – d F(i, j – 1) – d DIAG, LEFT, UP, Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment if [case 1] if [case 2] if [case 3] [case 1] [case 2] [case 3]

Multiple Sequence Alignments

Multiple Sequence Alignments

Definition Given N sequences x 1, x 2, …, x. N: § Insert gaps

Definition Given N sequences x 1, x 2, …, x. N: § Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the global map is maximum

Applications

Applications

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: y: z: AC-GCGG-C AC-GC-GAG GCCGC-GAG Induces: x: ACGCGG-C; y: ACGC-GAC; x: AC-GCGG-C; z: GCCGC-GAG; y: AC-GCGAG z: GCCGCGAG

Sum Of Pairs (cont’d) • Heuristic way to incorporate evolution tree: Human Mouse Duck

Sum Of Pairs (cont’d) • Heuristic way to incorporate evolution tree: Human Mouse Duck Chicken • Weighted SOP: S(m) = k<l wkl s(mk, ml)

A Profile Representation A C G T - T C C C A A

A Profile Representation A C G T - T C C C A A A G G G – – C C C T T T A A A T C C T T 0. 6 0. 2. 2 1 0 0 0 0 1. 2 0 0 0. 8 0 1 0 0. 4 0 0 0. 6 0 0 C C C A A G C C – – T G G G G 0. 8 0 0 0 1 0. 6. 2 0 0. 4 0 0. 2 0 0. 4. 8. 4 0 0 1 0 0 • Given a multiple alignment M = m 1…mn § Replace each column mi with profile entry pi • Frequency of each letter in • # gaps • Optional: # gap openings, extensions, closings § Can think of this as a “likelihood” of each letter in each position

Multiple Sequence Alignments Algorithms

Multiple Sequence Alignments Algorithms

Multidimensional DP Generalization of Needleman-Wunsh: S(m) = i S(mi) (sum of column scores) F(i

Multidimensional DP Generalization of Needleman-Wunsh: S(m) = i S(mi) (sum of column scores) F(i 1, i 2, …, i. N): Optimal alignment up to (i 1, …, i. N) F(i 1, i 2, …, i. N) = max(all neighbors of cube)(F(nbr)+S(nbr))

Multidimensional DP • Example: in 3 D (three sequences): • 7 neighbors/cell F(i, j,

Multidimensional DP • Example: in 3 D (three sequences): • 7 neighbors/cell F(i, j, k) = max{ F(i F(i – – 1, j – 1, k – 1) + S(xi, xj, xk), 1, j – 1, k ) + S(xi, xj, - ), 1, j , k – 1) + S(xi, -, xk), 1, j , k ) + S(xi, -, - ), , j – 1, k – 1) + S( -, xj, xk), , j – 1, k ) + S( -, xj, - ), , j , k – 1) + S( -, -, xk) }

Multidimensional DP Running Time: 1. Size of matrix: LN; Where L = length of

Multidimensional DP Running Time: 1. Size of matrix: LN; Where L = length of each sequence N = number of sequences 2. Neighbors/cell: 2 N – 1 Therefore…………… O(2 N LN)

Multidimensional DP Running Time: 1. Size of matrix: LN; Where L = length of

Multidimensional DP Running Time: 1. Size of matrix: LN; Where L = length of each sequence N = number of sequences 2. Neighbors/cell: 2 N – 1 Therefore…………… O(2 N LN)

Progressive Alignment x pxyzw • pzw y z w When evolutionary tree is known:

Progressive Alignment x pxyzw • pzw y z w When evolutionary tree is known: § Align closest first, in the order of the tree § In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult Weighted version: § Tree edges have weights, proportional to the divergence in that edge § New profile is a weighted average of two old profiles

Progressive Alignment x y Example z Profile: (A, C, G, T, -) px =

Progressive Alignment x y Example z Profile: (A, C, G, T, -) px = (0. 8, 0. 2, 0, 0, 0) w py = (0. 6, 0, 0. 4) • When evolutionary tree is known: s(px, py) = 0. 8*0. 6*s(A, A) + 0. 2*0. 6*s(C, A) + 0. 8*0. 4*s(A, -) + 0. 2*0. 4*s(C, -) § Align closest first, in the order of the tree § In each step, align two sequences. Result: x, y, or profiles px, py 0. 1, , to generate a new pxy = (0. 7, 0, 0, 0. 2) alignment with associated profile presult s(p , -) = 0. 8*1. 0*s(A, -) + 0. 2*1. 0*s(C, -) x Weighted version: § Tree edges have weights, proportional to the divergence in that edge Result: p = (0. 4, 0. 1, 0, 0, 0. 5) § New profile is a weighted average of two old x-profiles

Progressive Alignment x ? y z w • When evolutionary tree is unknown: §

Progressive Alignment x ? y z w • When evolutionary tree is unknown: § Perform all pairwise alignments § Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment § Construct a tree (UPGMA / Neighbor Joining / Other methods) § Align on the tree

MUSCLE at a glance 1. Fast measurement of all pairwise distances between sequences •

MUSCLE at a glance 1. Fast measurement of all pairwise distances between sequences • DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N 2 L log. L) time 2. Build tree TDRAFT based on those distances, with UPGMA 3. Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT 4. Measure new Kimura-based distances D(x, y) based on MDRAFT 5. Build tree T based on D 6. Progressive alignment over T, to build M 7. Iterative refinement; for many rounds, do: • • Tree Partitioning: Split M on one branch and realign the two resulting profiles If new alignment M’ has better sum-of-pairs score than previous one, accept