CSE 5290 Algorithms for Bioinformatics Fall 2011 Suprakash
CSE 5290: Algorithms for Bioinformatics Fall 2011 Suprakash Datta datta@cse. yorku. ca Office: CSEB 3043 Phone: 416 -736 -2100 ext 77875 Course page: http: //www. cse. yorku. ca/course/5290 6/5/2021 CSE 5290, Fall 2011 1
Last time Finding Regulatory Motifs in DNA sequences (exhaustive search variants) Next: Greedy algorithms The following slides are based on slides by the authors of our text. 6/5/2021 CSE 5290, Fall 2011 2
Turnip vs Cabbage: Look and Taste Different • Although cabbages and turnips share a recent common ancestor, they look and taste different 6/5/2021 CSE 5290, Fall 2011 3
Turnip vs Cabbage - 2 6/5/2021 CSE 5290, Fall 2011 4
Turnip vs Cabbage: Almost Identical mt. DNA gene sequences • In 1980 s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip • 99% similarity between genes • These surprisingly identical gene sequences differed in gene order • This study helped pave the way to analyzing genome rearrangements in molecular evolution 6/5/2021 CSE 5290, Fall 2011 5
Turnip vs Cabbage: Different mt. DNA Gene Order • Gene order comparison: 6/5/2021 CSE 5290, Fall 2011 6
Turnip vs Cabbage: Different mt. DNA Gene Order • Gene order comparison: 6/5/2021 CSE 5290, Fall 2011 7
Turnip vs Cabbage: Different mt. DNA Gene Order • Gene order comparison: 6/5/2021 CSE 5290, Fall 2011 8
Turnip vs Cabbage: Different mt. DNA Gene Order • Gene order comparison: 6/5/2021 CSE 5290, Fall 2011 9
Turnip vs Cabbage: Different mt. DNA Gene Order • Gene order comparison: Before After Evolution is manifested as the divergence in gene order 6/5/2021 CSE 5290, Fall 2011 10
Transforming Cabbage into Turnip 6/5/2021 CSE 5290, Fall 2011 11
Genome rearrangements Mouse (X chrom. ) Unknown ancestor ~ 75 million years ago Human (X chrom. ) • What are the similarity blocks and how to find them? • What is the architecture of the ancestral genome? • What is the evolutionary scenario for transforming one genome into the other? 6/5/2021 CSE 5290, Fall 2011 12
History of Chromosome X Rat Consortium, Nature, 2004 6/5/2021 CSE 5290, Fall 2011 13
Reversals 1 2 3 9 8 4 7 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 • 6/5/2021 10 6 5 Blocks represent conserved genes. CSE 5290, Fall 2011 14
Reversals 1 2 3 9 8 4 7 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 n n 6/5/2021 10 6 5 Blocks represent conserved genes. In the course of evolution or in a clinical context, blocks 1, …, 10 could be misread as 1, 2, 3, -8, -7, -6, -5, -4, 9, 10. CSE 5290, Fall 2011 15
Reversals and Breakpoints 1 2 3 9 8 4 7 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 10 6 5 The reversion introduced two breakpoints (disruptions in order). 6/5/2021 CSE 5290, Fall 2011 16
Reversals: Example 5’ ATGCCTGTACTA 3’ Break and Invert 3’ TACGGACATGAT 5’ 5’ ATGTACAGGCTA 3’ 3’ TACATGTCCGAT 5’ 6/5/2021 CSE 5290, Fall 2011 17
Types of Rearrangements Reversal 1 2 3 4 5 6 1 2 -5 -4 -3 6 Translocation 1 2 3 45 6 1 26 4 53 Fusion 1 2 3 4 5 6 Fission 6/5/2021 CSE 5290, Fall 2011 18
Comparative Genomic Architectures: Mouse vs Human Genome • Humans and mice have similar genomes, but their genes are ordered differently • ~245 rearrangements – Reversals – Fusions – Fissions – Translocation 6/5/2021 CSE 5290, Fall 2011 19
Waardenburg’s Syndrome: Mouse Provides Insight into Human Genetic Disorder Waardenburg’s syndrome is characterized by pigmentary dysphasia • Gene implicated in the disease was linked to human chromosome 2 but it was not clear where exactly it is located on chromosome 2 6/5/2021 CSE 5290, Fall 2011 20
Waardenburg’s syndrome and splotch mice • A breed of mice (with splotch gene) had similar symptoms caused by the same type of gene as in humans • Scientists succeeded in identifying location of gene responsible for disorder in mice • Finding the gene in mice gives clues to where the same gene is located in humans 6/5/2021 CSE 5290, Fall 2011 21
Reversals: Example p=12345678 r(3, 5) 12543678 r(5, 6) 12546378 6/5/2021 CSE 5290, Fall 2011 22
Reversals and Gene Orders • Gene order is represented by a permutation p: p = p 1 ------ p i-1 p i+1 ------ p j-1 p j+1 ----- p n r(i, j) p 1 ------ p i-1 p j-1 ------ p i+1 p i p j+1 ----- pn l Reversal r ( i, j ) reverses (flips) the elements from i to j in p 6/5/2021 CSE 5290, Fall 2011 23
Reversal Distance Problem Goal: Given two permutations p, s, find the shortest series of reversals that transforms p into s Input: Permutations p and s Output: A series of reversals r 1, …rt transforming p into s, such that t is minimum Notation: • t - reversal distance between p and s • d(p, s) - smallest possible value of t, given p and s 6/5/2021 CSE 5290, Fall 2011 24
Sorting By Reversals Problem Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n ) • Input: Permutation p • Output: A series of reversals r 1, … rt transforming p into the identity permutation such that t is minimum 6/5/2021 CSE 5290, Fall 2011 25
Sorting By Reversals: Example • t =d(p ) - reversal distance of p • Example : p = 3 4 2 1 5 6 4 3 2 1 5 6 1 2 3 4 5 6 So d(p ) = 3 6/5/2021 CSE 5290, Fall 2011 7 10 9 8 7 8 9 10 26
Sorting by reversals: 5 steps 6/5/2021 CSE 5290, Fall 2011 27
Sorting by reversals: 4 steps What is the reversal distance for this permutation? Can it be sorted in 3 steps? 6/5/2021 CSE 5290, Fall 2011 28
Pancake Flipping Problem • The chef is sloppy; he prepares an unordered stack of pancakes of different sizes • The waiter wants to rearrange them (so that the smallest winds up on top, and so on, down to the largest at the bottom) • He does it by flipping over several from the top, repeating this as many times as necessary 6/5/2021 Christos Papadimitrou and Bill Gates flip pancakes CSE 5290, Fall 2011 29
Pancake Flipping Problem: Formulation Goal: Given a stack of n pancakes, what is the minimum number of flips to rearrange them into perfect stack? • Input: Permutation p • Output: A series of prefix reversals r 1, … rt transforming p into the identity permutation such that t is minimum 6/5/2021 CSE 5290, Fall 2011 30
Pancake Flipping Problem: Greedy Algorithm • Greedy approach: 2 prefix reversals at most to place a pancake in its right position, 2 n – 2 steps total at most • William Gates and Christos Papadimitriou showed in the mid-1970 s that this problem can be solved by at most 5/3 (n + 1) prefix reversals 6/5/2021 CSE 5290, Fall 2011 31
Sorting By Reversals: A Greedy Algorithm • If sorting permutation p = 1 2 3 6 4 5, the first three elements are already in order so it does not make any sense to break them. • The length of the already sorted prefix of p is denoted prefix(p) – prefix(p) = 3 • This results in an idea for a greedy algorithm: increase prefix(p) at every step 6/5/2021 CSE 5290, Fall 2011 32
Greedy Algorithm: An Example • Doing so, p can be sorted 123645 123465 123456 • Number of steps to sort permutation of length n is at most (n – 1) 6/5/2021 CSE 5290, Fall 2011 33
Greedy Algorithm: Pseudocode Simple. Reversal. Sort(p) 1 for i 1 to n – 1 2 j position of element i in p (i. e. , pj = i) 3 if j ≠i 4 p p * r(i, j) 5 output p 6 if p is the identity permutation 7 return 6/5/2021 CSE 5290, Fall 2011 34
Analyzing Simple. Reversal. Sort • Simple. Reversal. Sort does not guarantee the smallest number of reversals and takes five steps on p = 6 1 2 3 4 5 : • • • 6/5/2021 Step Step 1: 2: 3: 4: 5: 1 1 1 6 2 2 2 6 3 3 3 6 4 4 4 6 5 5 5 6 CSE 5290, Fall 2011 35
Analyzing Simple. Reversal. Sort • But it can be sorted in two steps: p = 612345 – Step 1: 5 4 3 2 1 6 – Step 2: 1 2 3 4 5 6 • So, Simple. Reversal. Sort(p) is not optimal • Optimal algorithms are unknown for many problems; approximation algorithms are used 6/5/2021 CSE 5290, Fall 2011 36
Approximation Algorithms • These algorithms find approximate solutions rather than optimal solutions • The approximation ratio of an algorithm A on input p is: A(p) / OPT(p) where A(p) -solution produced by algorithm A OPT(p) - optimal solution of the problem 6/5/2021 CSE 5290, Fall 2011 37
Approximation Ratio/Performance Guarantee • Approximation ratio (performance guarantee) of algorithm A: max approximation ratio of all inputs of size n – For algorithm A that minimizes objective function (minimization algorithm): • max|p| = n A(p) / OPT(p) 6/5/2021 CSE 5290, Fall 2011 38
Approximation Ratio/Performance Guarantee • Approximation ratio (performance guarantee) of algorithm A: max approximation ratio of all inputs of size n – For algorithm A that minimizes objective function (minimization algorithm): • max|p| = n A(p) / OPT(p) – For maximization algorithm: • min|p| = n A(p) / OPT(p) 6/5/2021 CSE 5290, Fall 2011 39
Adjacencies and Breakpoints p = p 1 p 2 p 3…pn-1 pn • A pair of elements p i and p i + 1 are adjacent if pi+1 = pi + 1 • For example p=1 9 3 4 7 8 2 6 5 • (3, 4) or (7, 8) and (6, 5) are adjacent pairs 6/5/2021 CSE 5290, Fall 2011 40
Breakpoints: An Example There is a breakpoint between any adjacent element that are non-consecutive: p=1 9 3 4 7 8 2 6 5 • Pairs (1, 9), (9, 3), (4, 7), (8, 2) and (2, 6) form breakpoints of permutation p • b(p) - # breakpoints in permutation p 6/5/2021 CSE 5290, Fall 2011 41
Adjacency & Breakpoints • An adjacency - a pair of adjacent elements that are consecutive • A breakpoint - a pair of adjacent elements that are not consecutive π=5 6 2 1 3 4 Extend π with π0 = 0 and π7 = 7 adjacencies 0 5 6 2 1 3 4 7 breakpoints 6/5/2021 CSE 5290, Fall 2011 42
Extending Permutations • We put two elements p 0 =0 and p n + 1=n+1 at the ends of p Example: p=1 9 3 4 7 8 2 6 5 Extending with 0 and 10 p = 0 1 9 3 4 7 8 2 6 5 10 Note: A new breakpoint was created after extending 6/5/2021 CSE 5290, Fall 2011 43
Reversal Distance and Breakpoints § Each reversal eliminates at most 2 breakpoints. §This implies: reversal distance ≥ #breakpoints / 2 p =2 3 1 4 6 5 0 0 6/5/2021 2 1 1 1 3 3 2 2 1 2 3 3 4 4 6 6 6 5 5 6 7 7 CSE 5290, Fall 2011 b(p) = 5 b(p) = 4 b(p) = 2 b(p) = 0 44
Sorting By Reversals: A Better Greedy Algorithm Break. Point. Reversal. Sort(p) 1 while b(p) > 0 2 Among all possible reversals, choose reversal r minimizing b (p • r ) 3 p p • r(i, j) 4 output p 5 return Q: Does this algorithm terminate? 6/5/2021 CSE 5290, Fall 2011 45
Strips • Strip: an interval between two consecutive breakpoints in a permutation – Decreasing strip: strip of elements in decreasing order (e. g. 6 5 and 3 2 ). – Increasing strip: strip of elements in increasing order (e. g. 7 8) 0 1 9 4 3 7 8 2 5 6 10 – A single-element strip can be declared either increasing or decreasing. We will choose to declare them as decreasing with exception of the strips with 0 and n+1 6/5/2021 CSE 5290, Fall 2011 46
Reducing the Number of Breakpoints Theorem 1: If permutation p contains at least one decreasing strip, then there exists a reversal r which decreases the number of breakpoints (i. e. b(p • r) < b(p) ) 6/5/2021 CSE 5290, Fall 2011 47
Things To Consider • For p = 1 4 6 5 7 8 3 2 0 1 4 6 5 7 8 3 2 9 b(p) = 5 – Choose decreasing strip with the smallest element k in p ( k = 2 in this case) – Find k – 1 in the permutation 6/5/2021 CSE 5290, Fall 2011 48
Things To Consider (cont’d) • For p = 1 4 6 5 7 8 3 2 0 1 4 6 5 7 8 3 2 9 b(p) = 5 – Choose decreasing strip with the smallest element k in p ( k = 2 in this case) – Find k – 1 in the permutation – Reverse the segment between k and k-1: – 0 1 4 6 5 7 8 3 2 9 b(p) = 5 – 0 1 2 3 8 7 5 6 4 9 6/5/2021 CSE 5290, Fall 2011 b(p) = 4 49
Reducing the Number of Breakpoints Again – If there is no decreasing strip, there may be no reversal r that reduces the number of breakpoints (i. e. b(p • r) ≥ b(p) for any reversal r). – By reversing an increasing strip ( # of breakpoints stay unchanged ), we will create a decreasing strip at the next step. Then the number of breakpoints will be reduced in the next step (theorem 1). 6/5/2021 CSE 5290, Fall 2011 50
Things To Consider (cont’d) • There are no decreasing strips in p, for: p =0 1 2 5 6 7 3 4 8 b(p) = 3 p • r(6, 7) = 0 1 2 5 6 7 4 3 8 b(p) = 3 ü r(6, 7) does not change the # of breakpoints ü r(6, 7) creates a decreasing strip thus guaranteeing that the next step will decrease the # of breakpoints. 6/5/2021 CSE 5290, Fall 2011 51
Improved. Breakpoint. Reversal. Sort(p) 1 while b(p) > 0 2 if p has a decreasing strip 3 Among all possible reversals, choose reversal r that minimizes b(p • r) 4 5 else Choose a reversal r that flips an increasing strip in p 6 p p • r 7 output p 8 return 6/5/2021 CSE 5290, Fall 2011 52
Improved. Breakpoint. Reversal. Sort: Performance Guarantee • Improved. Break. Point. Reversal. Sort is an approximation algorithm with a performance guarantee of at most 4 – It eliminates at least one breakpoint in every two steps; at most 2 b(p) steps – Approximation ratio: 2 b(p) / d(p) – Optimal algorithm eliminates at most 2 breakpoints in every step: d(p) b(p) / 2 – Performance guarantee: • ( 2 b(p) / d(p) ) [ 2 b(p) / (b(p) / 2) ] = 4 6/5/2021 CSE 5290, Fall 2011 53
Signed Permutations • Up to this point, all permutations to sort were unsigned • But genes have directions… so we should consider signed permutations 5’ p = 6/5/2021 3’ 1 -2 - 3 4 CSE 5290, Fall 2011 -5 54
Signed Permutations • Algorithms are a little more involved. • Possible project topic 6/5/2021 CSE 5290, Fall 2011 55
GRIMM Web Server • Real genome architectures are represented by signed permutations • Efficient algorithms to sort signed permutations have been developed • GRIMM web server computes the reversal distances between signed permutations: 6/5/2021 CSE 5290, Fall 2011 56
GRIMM Web Server http: //www-cse. ucsd. edu/groups/bioinformatics/GRIMM 6/5/2021 CSE 5290, Fall 2011 57
Next Dynamic programming, sequence alignment Some of the following slides are based on slides by the authors of our text. 6/5/2021 CSE 5290, Fall 2011 58
Dynamic programming (DP) • Typically used for optimization problems • Often results in efficient algorithms • Not applicable to all problems Caveats: • Need not yield poly-time algorithms • No unique formulations for most problems • May not rule out greedy algorithms 6/5/2021 CSE 5290, Fall 2011 59
Example • Counting the number of shortest paths in a grid with blocked intersections • Finding paths in a weighted grid • Sequence alignment 6/5/2021 CSE 5290, Fall 2011 60
Setting up DP in practice • The optimal solution should be computable as a (recursive) function of the solution to sub-problems • Solve sub-problems systematically and store solutions (to avoid duplication of work). 6/5/2021 CSE 5290, Fall 2011 61
Number of paths in a grid Problem: Travel from the top-left to the bottom right of a rectangular grid using only right and down moves • Combinatorial approach • DP approach: how can we decompose the problem into sub-problems ? 6/5/2021 CSE 5290, Fall 2011 62
Number of paths in a grid with blocked intersections Problem: Same as before but some grid points are blocked and cannot be used • Combinatorial approach? • DP approach: how can we decompose the problem into sub-problems ? 6/5/2021 CSE 5290, Fall 2011 63
Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid 6/5/2021 Source CSE 5290, Fall 2011 * * *Sink 64
Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid 6/5/2021 Source CSE 5290, Fall 2011 * * *Sink 65
Manhattan Tourist Problem: Formulation Goal: Find the longest path in a weighted grid. Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink” Output: A longest path in G from “source” to “sink” 6/5/2021 CSE 5290, Fall 2011 66
MTP: An Example 0 source 0 0 1 3 1 i coordinate 4 5 6/5/2021 5 13 15 0 3 CSE 5290, Fall 2011 2 1 4 19 1 2 20 3 5 2 j coordinate 3 2 8 6 1 3 3 0 2 5 4 9 4 4 4 7 3 3 4 5 2 0 3 2 6 4 2 2 0 3 1 4 3 2 2 23 sink 67
MTP: Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 5 3 0 1 4 3 5 0 0 5 5 1 2 promising start, but leads to bad choices! 5 0 2 0 18 6/5/2021 CSE 5290, Fall 2011 22 sink 68
MTP: Simple Recursive Program MT(n, m) if n=0 or m=0 return MT(n, m) x MT(n-1, m)+ length of the edge from (n- 1, m) to (n, m) y MT(n, m-1)+ length of the edge from (n, m-1) to (n, m) return max{x, y} 6/5/2021 CSE 5290, Fall 2011 69
MTP: Simple Recursive Program MT(n, m) x MT(n-1, m)+ length of the edge from (n- 1, m) to (n, m) y MT(n, m-1)+ length of the edge from (n, m-1) to (n, m) return max{x, y} What’s wrong with this approach? 6/5/2021 CSE 5290, Fall 2011 70
MTP: Dynamic Programming j 0 source 1 0 i 1 1 S 0, 1 = 1 5 S 1, 0 = 5 • Calculate optimal path score for each vertex in the graph • Each vertex’s score is the maximum of the prior vertices score plus the weight of the respective edge in between 6/5/2021 CSE 5290, Fall 2011 71
MTP: Dynamic Programming j(cont’d) 0 source 1 1 0 i 3 3 S 0, 2 = 3 -5 1 5 4 S 1, 1 = 4 3 6/5/2021 2 1 5 2 2 8 S 2, 0 = 8 CSE 5290, Fall 2011 72
MTP: Dynamic Programming (cont’d) j 0 source 1 1 0 i 10 8 S 3, 0 = 8 1 5 4 13 S 1, 2 = 13 5 3 -5 2 8 0 6/5/2021 5 3 3 -5 1 3 2 1 5 3 2 8 S 3, 0 = 8 9 S 2, 1 = 9 CSE 5290, Fall 2011 73
MTP: Dynamic Programming j 0 source (cont’d) 1 2 1 0 2 5 1 i 5 13 5 -3 8 S 1, 3 = 8 3 -5 8 9 0 -5 -5 4 3 8 10 1 5 2 3 3 -5 1 3 0 12 S 2, 2 = 12 0 3 6/5/2021 8 greedy alg. fails! 9 CSE 5290, Fall 2011 S 3, 1 = 9 74
MTP: Dynamic Programming j 0 source 1 5 13 5 12 0 -5 0 8 3 9 0 2 3 8 8 -3 -5 2 -5 -5 4 3 8 10 1 5 3 3 3 -5 1 6/5/2021 2 1 5 3 2 1 0 i (cont’d) 15 S 2, 3 = 15 0 9 9 S 3, 2 = 9 CSE 5290, Fall 2011 75
MTP: Dynamic Programming j 0 source 1 5 12 0 8 15 -5 0 1 0 0 9 (showing all back-traces) 3 9 0 2 3 8 8 -3 -5 2 -5 13 5 Done! -5 4 3 8 10 1 5 3 3 3 -5 1 6/5/2021 2 1 5 3 2 1 0 i (cont’d) 9 CSE 5290, Fall 2011 16 S 3, 3 = 16 76
MTP: Recurrence Computing the score for a point (i, j) by the recurrence relation: si, j = max si-1, j + weight of the edge between (i-1, j) and (i, j) si, j-1 + weight of the edge between (i, j-1) and (i, j) The running time is n x m for a n by m grid (n = # of rows, m = # of columns) 6/5/2021 CSE 5290, Fall 2011 77
Manhattan Is Not A Perfect Grid A 2 A 1 A 3 B What about diagonals? • The score at point B is given by: s. B = max of s. A 1 + weight of the edge (A 1, B) s. A 2 + weight of the edge (A 2, B) s. A 3 + weight of the edge (A 3, B) 6/5/2021 CSE 5290, Fall 2011 78
Manhattan Is Not A Perfect Grid (contd) Computing the score for point x is given by the recurrence relation: sx = max of sy + weight of vertex (y, x) where y є Predecessors(x) • Predecessors (x) – set of vertices that have edges leading to x • The running time for a graph G(V, E) (V is the set of all vertices and E is the set of all edges) is O(E) since each edge is evaluated once 6/5/2021 CSE 5290, Fall 2011 79
Traveling in the Grid • The only hitch is that one must decide on the order in which visit the vertices • By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble. • We need to traverse the vertices in some order • Try to find such order for a directed cycle ? ? ? 6/5/2021 CSE 5290, Fall 2011 80
DAG: Directed Acyclic Graph • Since Manhattan is not a perfect regular grid, we represent it as a DAG • DAG for Dressing in the morning problem 6/5/2021 CSE 5290, Fall 2011 81
Topological Ordering • A numbering of vertices of the graph is called topological ordering of the DAG if every edge of the DAG connects a vertex with a smaller label to a vertex with a larger label • In other words, if vertices are positioned on a line in an increasing order of labels then all edges go from left to right. 6/5/2021 CSE 5290, Fall 2011 82
Topological ordering • 2 different topological orderings of the DAG 6/5/2021 CSE 5290, Fall 2011 83
Longest Path in DAG Problem • Goal: Find a longest path between two vertices in a weighted DAG • Input: A weighted DAG G with source and sink vertices • Output: A longest path in G from source to sink 6/5/2021 CSE 5290, Fall 2011 84
Longest Path in DAG: Dynamic Programming • Suppose vertex v has indegree 3 and predecessors {u 1, u 2, u 3} • Longest path to v from source is: su 1 + weight of edge from u 1 to v sv = max of su 2 + weight of edge from u 2 to v su 3 + weight of edge from u 3 to v In General: sv = maxu (su + weight of edge from u to v) 6/5/2021 CSE 5290, Fall 2011 85
Traversing the Manhattan Grid • 3 different strategies: – a) Column by column – b) Row by row – c) Along diagonals 6/5/2021 a) b) c) CSE 5290, Fall 2011 86
Sequence alignment • Fundamental problem • Many different versions 6/5/2021 CSE 5290, Fall 2011 87
Alignment: 2 row representation Given 2 DNA sequences v and w: v : ATGTTAT w: ATCGTAC m=7 n=7 Alignment : 2 * k matrix ( k > m, n ) letters of v A T -- G T T A T -- letters of w A T C G T -- A -- C 4 matches 6/5/2021 2 insertions CSE 5290, Fall 2011 2 deletions 88
Aligning DNA Sequences V = ATCTGATG W = TGCATAC match n=8 m=7 mismatch 4 1 2 2 matches mismatches insertions deletions C T G A T G V A T T G C A T A C W indels 6/5/2021 deletion insertion CSE 5290, Fall 2011 89
Aligning DNA Sequences - 2 • Brute force is infeasible…. • Number of alignments of X[1. . n], Y[1. . m], n<m is ( ) m+n n • For m=n, this is about 22 n/pn 6/5/2021 CSE 5290, Fall 2011 90
Longest Common Subsequence (LCS) – Alignment without Mismatches • Given two sequences v = v 1 v 2…vm and w = w 1 w 2…wn • The LCS of v and w is a sequence of positions in v: 1 < i 2 < … < it < m and a sequence of positions in w: 1 < j 2 < … < jt < n such that it -th letter of v equals to jt-letter of w and t is maximal 6/5/2021 CSE 5290, Fall 2011 91
LCS: Example 0 1 2 2 3 3 4 5 6 7 8 elements of v A T -- C -- T G A T C elements of w -- T 1 G C 2 3 A 4 T 5 -- -- C 7 i coords: j coords: 0 0 5 A 6 6 (0, 0) (1, 0) (2, 1) (2, 2) (3, 3) (3, 4) (4, 5) (5, 5) (6, 6) (7, 6) (8, 7) Matches shown in red positions in v: 2 < 3 < 4 < 6 < 8 positions in w: 1 < 3 < 5 < 6 < 7 Every common subsequence is a path in 2 -D grid 6/5/2021 CSE 5290, Fall 2011 92
LCS Problem as Manhattan Tourist Problem j A T C T G A T C 0 1 2 3 4 5 6 7 8 i 0 T 1 G 2 C 3 A 4 T 5 A 6 C 7 6/5/2021 CSE 5290, Fall 2011 94
Edit Graph for LCS Problem j A T C T G A T C 0 1 2 3 4 5 6 7 8 i 0 T 1 G 2 C 3 A 4 T 5 A 6 C 7 6/5/2021 CSE 5290, Fall 2011 95
Edit Graph for LCS Problem j A T C T G A T C 0 1 2 3 4 5 6 7 8 i 0 T 1 Every diagonal edge adds an extra element to common subsequence G 2 C 3 A 4 T 5 LCS Problem: Find a path with maximum number of diagonal edges A 6 C 7 6/5/2021 Every path is a common subsequence. CSE 5290, Fall 2011 96
Computing LCS Let vi = prefix of v of length i: v 1 … vi and wj = prefix of w of length j: w 1 … wj The length of LCS(vi, wj) is computed by: si, j = max si-1, j si, j-1 si-1, j-1 + 1 if vi = wj 6/5/2021 CSE 5290, Fall 2011 97
Computing LCS (cont’d) i-1, j -1 si, j = MAX 6/5/2021 si-1, j + 0 si, j -1 + 0 si-1, j -1 + 1, 1 i, j -1 if vi = wj CSE 5290, Fall 2011 0 0 i, j 98
Every Path in the Grid Corresponds to an Alignment W A 0 V T 1 C 2 G 3 T 4 6/5/2021 AT- GT | | 1 T 012 2 34 V= 0 A 4 W= | ATCG– 012 344 CSE 5290, Fall 2011 99
Aligning Sequences without Insertions and Deletions: Hamming Distance Given two DNA sequences v and w : v : AT AT w : T AT A • The Hamming distance: d. H(v, w) = 8 is large but the sequences are very similar 6/5/2021 CSE 5290, Fall 2011 100
Aligning Sequences with Insertions and Deletions By shifting one sequence over one position: v : AT AT -w : -- T AT A • The edit distance: d. H(v, w) = 2. • Hamming distance neglects insertions and deletions in DNA 6/5/2021 CSE 5290, Fall 2011 101
Edit Distance Levenshtein (1966) introduced edit distance between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other d(v, w) = MIN number of elementary operations to transform v w 6/5/2021 CSE 5290, Fall 2011 102
Edit Distance vs Hamming Distance Hamming distance always compares i-th letter of v with i-th letter of w V = ATAT W = TATA Hamming distance: d(v, w)=8 Computing Hamming distance is a trivial task. 6/5/2021 CSE 5290, Fall 2011 103
Edit Distance vs Hamming Distance Hamming distance always compares i-th letter of v with i-th letter of w V = ATAT W = TATA Just one shift Make it all line up Hamming distance: d(v, w)=8 V = - ATAT W = TATA Edit distance: d(v, w)=2 Computing Hamming distance is a trivial task 6/5/2021 Edit distance may compare i-th letter of v with j-th letter of w CSE 5290, Fall 2011 Computing edit distance is a non-trivial task 104
Edit Distance vs Hamming Distance Hamming distance always compares i-th letter of v with i-th letter of w Edit distance may compare i-th letter of v with j-th letter of w V = ATAT V = - ATAT W = TATATATA Hamming distance: Edit distance: d(v, w)=8 d(v, w)=2 (one insertion and one deletion) How to find what j goes with what i ? ? ? 6/5/2021 CSE 5290, Fall 2011 105
Edit Distance: Example TGCATAT ATCCGAT in 5 steps TGCATAT TGCATA TGCAT ATCCAT ATCCGAT 6/5/2021 (delete last T) (delete last A) (insert A at front) (substitute C for 3 rd G) (insert G before last A) (Done) CSE 5290, Fall 2011 106
Edit Distance: Example TGCATAT ATCCGAT in 5 steps TGCATAT TGCATA TGCAT ATCCAT ATCCGAT (delete last T) (delete last A) (insert A at front) (substitute C for 3 rd G) (insert G before last A) (Done) What is the edit distance? 5? 6/5/2021 CSE 5290, Fall 2011 107
Edit Distance: Example (cont’d) TGCATAT ATCCGAT in 4 steps TGCATAT (insert A at front) ATGCATAT (delete 6 th T) ATGCATA (substitute G for 5 th A) ATGCGTA (substitute C for 3 rd G) ATCCGAT (Done) 6/5/2021 CSE 5290, Fall 2011 108
Edit Distance: Example (cont’d) TGCATAT ATCCGAT in 4 steps TGCATAT (insert A at front) ATGCATAT (delete 6 th T) ATGCATA (substitute G for 5 th A) ATGCGTA (substitute C for 3 rd G) ATCCGAT (Done) Can it be done in 3 steps? ? ? 6/5/2021 CSE 5290, Fall 2011 109
The Alignment Grid – Every alignment path is from source to sink 6/5/2021 CSE 5290, Fall 2011 110
Alignment as a Path in the Edit Graph 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 - Corresponding path (0, 0) , (1, 1) , (2, 2), (2, 3), (3, 4), (4, 5), (5, 5), (6, 6), (7, 7) 6/5/2021 CSE 5290, Fall 2011 111
Alignments in Edit Graph (cont’d) and represent indels in v and w with score 0. represent matches with score 1. • The score of the alignment path is 5. 6/5/2021 CSE 5290, Fall 2011 112
Alignment as a Path in the Edit Graph Every path in the edit graph corresponds to an alignment: 6/5/2021 CSE 5290, Fall 2011 113
Alignment as a Path in the Edit Graph Old Alignment 0122345677 v= AT_GTTAT_ w= ATCGT_A_C 0123455667 New Alignment 0122345677 v= AT_GTTAT_ w= ATCG_TA_C 0123445667 6/5/2021 CSE 5290, Fall 2011 114
Alignment as a Path in the Edit Graph 0122345677 v= AT_GTTAT_ w= ATCGT_A_C 0123455667 (0, 0) , (1, 1) , (2, 2), (2, 3), (3, 4), (4, 5), (5, 5), (6, 6), (7, 7) 6/5/2021 CSE 5290, Fall 2011 115
Alignment: Dynamic Programming si, j = max si-1, j-1+1 if vi = wj si-1, j si, j-1 6/5/2021 CSE 5290, Fall 2011 116
Dynamic Programming Example Initialize 1 st row and 1 st column to be all zeroes. Or, to be more precise, initialize 0 th row and 0 th column to be all zeroes. 6/5/2021 CSE 5290, Fall 2011 117
Dynamic Programming Example Si, j = max Si-1, j-1 value from NW +1, if v = w i j Si-1, j value from North (top) Si, j-1 6/5/2021 CSE 5290, Fall 2011 value from West (left) 118
Alignment: Backtracking Arrows show where the score originated from. if from the top if from the left if vi = wj 6/5/2021 CSE 5290, Fall 2011 119
Backtracking Example Find a match in row and column 2. i=2, j=2, 5 is a match (T). j=2, i=4, 5, 7 is a match (T). Since vi = wj, si, j = si-1, j-1 +1 s 2, 2 s 2, 5 s 4, 2 s 5, 2 s 7, 2 6/5/2021 = = = CSE 5290, Fall 2011 [s 1, 4 [s 3, 1 [s 4, 1 [s 6, 1 = = = 1] 1] 1] + + + 1 1 120
Backtracking Example Continuing with the dynamic programming algorithm gives this result. 6/5/2021 CSE 5290, Fall 2011 121
Alignment: Dynamic Programming si, j = max si-1, j-1+1 if vi = wj si-1, j si, j-1 6/5/2021 CSE 5290, Fall 2011 122
Alignment: Dynamic Programming si, j = max si-1, j-1+1 if vi = wj si-1, j+0 si, j-1+0 This recurrence corresponds to the Manhattan Tourist problem (three incoming edges into a vertex) with all horizontal and vertical edges weighted by zero. 6/5/2021 CSE 5290, Fall 2011 123
LCS Algorithm LCS(v, w) for i 1 to n si, 0 0 for j 1 to m s 0, j 0 for i 1 to n for j 1 to m si-1, j si, j max si, j-1 si-1, j-1 + 1, if vi = wj “ “ if si, j = si-1, j bi, j “ “ if si, j = si, j-1 “ “ if si, j = si-1, j-1 + 1 return (sn, m, b) 6/5/2021 CSE 5290, Fall 2011 124
Now What? • LCS(v, w) created the alignment grid • Now we need a way to read the best alignment of v and w • Follow the arrows backwards from sink 6/5/2021 CSE 5290, Fall 2011 125
Printing LCS: Backtracking 1. Print. LCS(b, v, i, j) 2. if i = 0 or j = 0 3. return 4. if bi, j = “ “ 5. Print. LCS(b, v, i-1, j-1) 6. print vi else 7. 8. if bi, j = “ “ 9. Print. LCS(b, v, i-1, j) 10. else 11. Print. LCS(b, v, i, j-1) 6/5/2021 CSE 5290, Fall 2011 126
LCS Runtime • It takes O(nm) time to fill in the nxm dynamic programming matrix. • Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up a nxm matrix. 6/5/2021 CSE 5290, Fall 2011 127
Why does DP work? • Avoids re-computing the same subproblems • Limits the amount of work done in each step 6/5/2021 CSE 5290, Fall 2011 128
When is DP applicable? – Optimal substructure: Optimal solution to problem (instance) contains optimal solutions to sub-problems – Overlapping sub-problems: Limited number of distinct sub-problems, repeated many times 6/5/2021 CSE 5290, Fall 2011 129
Next: • More realistic sequence alignment algorithms Types: • • Global Alignment Scoring Matrices Local Alignment with Affine Gap Penalties 6/5/2021 CSE 5290, Fall 2011 130
From LCS to Alignment: Change up the Scoring • The Longest Common Subsequence (LCS) problem— the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). • In the LCS Problem, we scored 1 for matches and 0 for indels • Consider penalizing indels and mismatches with negative scores • Simplest scoring schema: +1 : match premium -μ : mismatch penalty -σ : indel penalty 6/5/2021 CSE 5290, Fall 2011 131
Simple Scoring • When mismatches are penalized by –μ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels) 6/5/2021 CSE 5290, Fall 2011 132
The Global Alignment Problem Find the best alignment between two strings under a given scoring schema Input : Strings v and w and a scoring schema Output : Alignment of maximum score ↑→ = -σ = 1 if match = -µ if mismatch si, j = max 6/5/2021 si-1, j-1 +1 if vi = wj s i-1, j-1 -µ if vi ≠ wj s i-1, j - σ s i, j-1 - σ CSE 5290, Fall 2011 m : mismatch penalty σ : indel penalty 133
Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score for comparison of a gap character “-”. This will simplify the algorithm as follows: si-1, j-1 + δ (vi, wj) si, j = max s i-1, j + δ (vi, -) s i, j-1 + δ (-, wj) 6/5/2021 CSE 5290, Fall 2011 134
Measuring Similarity • Measuring the extent of similarity between two sequences – Based on percent sequence identity – Based on conservation 6/5/2021 CSE 5290, Fall 2011 135
Percent Sequence Identity • The extent to which two nucleotide or amino acid sequences are invariant AC C TG A G – AG AC G TG – G C AG mismatch 70% identical 6/5/2021 CSE 5290, Fall 2011 indel 136
Making a Scoring Matrix • Scoring matrices are created based on biological evidence. • Alignments can be thought of as two sequences that differ due to mutations. • Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(vi , wj), will be less harsh than others. 6/5/2021 CSE 5290, Fall 2011 137
Scoring Matrix: Example A R A 5 -2 -1 -1 R - 7 -1 3 N - - 7 0 K - - - 6 6/5/2021 N K • Notice that although R and K are different amino acids, they have a positive score. • Why? They are both positively charged amino acids will not greatly change function of protein. CSE 5290, Fall 2011 138
Conservation • Amino acid changes that tend to preserve the physico-chemical properties of the original residue – Polar to polar • aspartate glutamate – Nonpolar to nonpolar • alanine valine – Similarly behaving residues • leucine to isoleucine 6/5/2021 CSE 5290, Fall 2011 139
Scoring matrices • Amino acid substitution matrices – PAM – BLOSUM • DNA substitution matrices – DNA is less conserved than protein sequences – Less effective to compare coding regions at nucleotide level 6/5/2021 CSE 5290, Fall 2011 140
PAM • Point Accepted Mutation (Dayhoff et al. ) • 1 PAM = PAM 1 = 1% average change of all amino acid positions – After 100 PAMs of evolution, not every residue will have changed • some residues may have mutated several times • some residues may have returned to their original state • some residues may not changed at all 6/5/2021 CSE 5290, Fall 2011 141
PAMX • PAMx = PAM 1 x – PAM 250 = PAM 1250 • PAM 250 is a widely used scoring matrix: Ala Arg Asn Asp Cys Gln. . . Trp Tyr Val 6/5/2021 A R N D C Q Ala A 13 3 4 5 2 3 Arg R 6 17 4 4 1 5 Asn N 9 4 6 8 1 5 Asp D 9 3 7 11 1 6 Cys C 5 2 2 1 52 1 Gln Q 8 5 5 7 1 10 Glu E 9 3 6 10 1 7 Gly G 12 2 4 5 2 3 His H 6 6 2 7 Ile I 8 3 3 3 2 2 W Y V 0 1 7 2 1 4 0 2 4 0 1 4 0 3 4 0 1 4 1 3 5 0 2 4 CSE 5290, Fall 2011 Leu L 6 2 2 2 1 3 Lys. . . K. . . 7. . . 9 5 5 1 2 15 0 1 10 142
BLOSUM • Blocks Substitution Matrix • Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins • Matrix name indicates evolutionary distance – BLOSUM 62 was created using sequences sharing no more than 62% identity 6/5/2021 CSE 5290, Fall 2011 143
The Blosum 50 Scoring Matrix 6/5/2021 CSE 5290, Fall 2011 144
Local vs. Global Alignment • The Global Alignment Problem tries to find the longest path between vertices (0, 0) and (n, m) in the edit graph. • The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i, j) and (i’, j’) in the edit graph. 6/5/2021 CSE 5290, Fall 2011 145
Local vs. Global Alignment • The Global Alignment Problem tries to find the longest path between vertices (0, 0) and (n, m) in the edit graph. • The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i, j) and (i’, j’) in the edit graph. • In the edit graph with negatively-scored edges, Local Alignmet may score higher than Global Alignment 6/5/2021 CSE 5290, Fall 2011 146
Local vs. Global Alignment (cont’d) • Global Alignment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | ||| || | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C • Local Alignment—better alignment to find conserved segment tcc. CAGTTATGTCAGgggacacgagcatgcagagac |||||| aattgccgccgtcgttttcag. CAGTTATGTCAGatc 6/5/2021 CSE 5290, Fall 2011 147
Local Alignment: Example Local alignment Compute a “mini” Global Alignment to get Local Global alignment 6/5/2021 CSE 5290, Fall 2011 148
- Slides: 147