CSE 5290 Algorithms for Bioinformatics Fall 2011 Suprakash

CSE 5290: Algorithms for Bioinformatics Fall 2011 Suprakash Datta datta@cse. yorku. ca Office: CSEB 3043 Phone: 416 -736 -2100 ext 77875 Course page: http: //www. cse. yorku. ca/course/5290 6/5/2021 CSE 5290, Fall 2011 1

Last time Finding Regulatory Motifs in DNA sequences (exhaustive search variants) Next: Greedy algorithms The following slides are based on slides by the authors of our text. 6/5/2021 CSE 5290, Fall 2011 2

Turnip vs Cabbage: Look and Taste Different • Although cabbages and turnips share a recent common ancestor, they look and taste different 6/5/2021 CSE 5290, Fall 2011 3

Turnip vs Cabbage - 2 6/5/2021 CSE 5290, Fall 2011 4

Turnip vs Cabbage: Almost Identical mt. DNA gene sequences • In 1980 s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip • 99% similarity between genes • These surprisingly identical gene sequences differed in gene order • This study helped pave the way to analyzing genome rearrangements in molecular evolution 6/5/2021 CSE 5290, Fall 2011 5

Turnip vs Cabbage: Different mt. DNA Gene Order • Gene order comparison: 6/5/2021 CSE 5290, Fall 2011 6

Turnip vs Cabbage: Different mt. DNA Gene Order • Gene order comparison: 6/5/2021 CSE 5290, Fall 2011 7

Turnip vs Cabbage: Different mt. DNA Gene Order • Gene order comparison: 6/5/2021 CSE 5290, Fall 2011 8

Turnip vs Cabbage: Different mt. DNA Gene Order • Gene order comparison: 6/5/2021 CSE 5290, Fall 2011 9

Turnip vs Cabbage: Different mt. DNA Gene Order • Gene order comparison: Before After Evolution is manifested as the divergence in gene order 6/5/2021 CSE 5290, Fall 2011 10

Transforming Cabbage into Turnip 6/5/2021 CSE 5290, Fall 2011 11

Genome rearrangements Mouse (X chrom. ) Unknown ancestor ~ 75 million years ago Human (X chrom. ) • What are the similarity blocks and how to find them? • What is the architecture of the ancestral genome? • What is the evolutionary scenario for transforming one genome into the other? 6/5/2021 CSE 5290, Fall 2011 12

History of Chromosome X Rat Consortium, Nature, 2004 6/5/2021 CSE 5290, Fall 2011 13

Reversals 1 2 3 9 8 4 7 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 • 6/5/2021 10 6 5 Blocks represent conserved genes. CSE 5290, Fall 2011 14

Reversals 1 2 3 9 8 4 7 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 n n 6/5/2021 10 6 5 Blocks represent conserved genes. In the course of evolution or in a clinical context, blocks 1, …, 10 could be misread as 1, 2, 3, -8, -7, -6, -5, -4, 9, 10. CSE 5290, Fall 2011 15

Reversals and Breakpoints 1 2 3 9 8 4 7 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 10 6 5 The reversion introduced two breakpoints (disruptions in order). 6/5/2021 CSE 5290, Fall 2011 16

Reversals: Example 5’ ATGCCTGTACTA 3’ Break and Invert 3’ TACGGACATGAT 5’ 5’ ATGTACAGGCTA 3’ 3’ TACATGTCCGAT 5’ 6/5/2021 CSE 5290, Fall 2011 17

Types of Rearrangements Reversal 1 2 3 4 5 6 1 2 -5 -4 -3 6 Translocation 1 2 3 45 6 1 26 4 53 Fusion 1 2 3 4 5 6 Fission 6/5/2021 CSE 5290, Fall 2011 18

Comparative Genomic Architectures: Mouse vs Human Genome • Humans and mice have similar genomes, but their genes are ordered differently • ~245 rearrangements – Reversals – Fusions – Fissions – Translocation 6/5/2021 CSE 5290, Fall 2011 19

Waardenburg’s Syndrome: Mouse Provides Insight into Human Genetic Disorder Waardenburg’s syndrome is characterized by pigmentary dysphasia • Gene implicated in the disease was linked to human chromosome 2 but it was not clear where exactly it is located on chromosome 2 6/5/2021 CSE 5290, Fall 2011 20

Waardenburg’s syndrome and splotch mice • A breed of mice (with splotch gene) had similar symptoms caused by the same type of gene as in humans • Scientists succeeded in identifying location of gene responsible for disorder in mice • Finding the gene in mice gives clues to where the same gene is located in humans 6/5/2021 CSE 5290, Fall 2011 21

Reversals: Example p=12345678 r(3, 5) 12543678 r(5, 6) 12546378 6/5/2021 CSE 5290, Fall 2011 22

Reversals and Gene Orders • Gene order is represented by a permutation p: p = p 1 ------ p i-1 p i+1 ------ p j-1 p j+1 ----- p n r(i, j) p 1 ------ p i-1 p j-1 ------ p i+1 p i p j+1 ----- pn l Reversal r ( i, j ) reverses (flips) the elements from i to j in p 6/5/2021 CSE 5290, Fall 2011 23

Reversal Distance Problem Goal: Given two permutations p, s, find the shortest series of reversals that transforms p into s Input: Permutations p and s Output: A series of reversals r 1, …rt transforming p into s, such that t is minimum Notation: • t - reversal distance between p and s • d(p, s) - smallest possible value of t, given p and s 6/5/2021 CSE 5290, Fall 2011 24

Sorting By Reversals Problem Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n ) • Input: Permutation p • Output: A series of reversals r 1, … rt transforming p into the identity permutation such that t is minimum 6/5/2021 CSE 5290, Fall 2011 25

Sorting By Reversals: Example • t =d(p ) - reversal distance of p • Example : p = 3 4 2 1 5 6 4 3 2 1 5 6 1 2 3 4 5 6 So d(p ) = 3 6/5/2021 CSE 5290, Fall 2011 7 10 9 8 7 8 9 10 26

Sorting by reversals: 5 steps 6/5/2021 CSE 5290, Fall 2011 27

Sorting by reversals: 4 steps What is the reversal distance for this permutation? Can it be sorted in 3 steps? 6/5/2021 CSE 5290, Fall 2011 28

Pancake Flipping Problem • The chef is sloppy; he prepares an unordered stack of pancakes of different sizes • The waiter wants to rearrange them (so that the smallest winds up on top, and so on, down to the largest at the bottom) • He does it by flipping over several from the top, repeating this as many times as necessary 6/5/2021 Christos Papadimitrou and Bill Gates flip pancakes CSE 5290, Fall 2011 29

Pancake Flipping Problem: Formulation Goal: Given a stack of n pancakes, what is the minimum number of flips to rearrange them into perfect stack? • Input: Permutation p • Output: A series of prefix reversals r 1, … rt transforming p into the identity permutation such that t is minimum 6/5/2021 CSE 5290, Fall 2011 30

Pancake Flipping Problem: Greedy Algorithm • Greedy approach: 2 prefix reversals at most to place a pancake in its right position, 2 n – 2 steps total at most • William Gates and Christos Papadimitriou showed in the mid-1970 s that this problem can be solved by at most 5/3 (n + 1) prefix reversals 6/5/2021 CSE 5290, Fall 2011 31

Sorting By Reversals: A Greedy Algorithm • If sorting permutation p = 1 2 3 6 4 5, the first three elements are already in order so it does not make any sense to break them. • The length of the already sorted prefix of p is denoted prefix(p) – prefix(p) = 3 • This results in an idea for a greedy algorithm: increase prefix(p) at every step 6/5/2021 CSE 5290, Fall 2011 32

Greedy Algorithm: An Example • Doing so, p can be sorted 123645 123465 123456 • Number of steps to sort permutation of length n is at most (n – 1) 6/5/2021 CSE 5290, Fall 2011 33

Greedy Algorithm: Pseudocode Simple. Reversal. Sort(p) 1 for i 1 to n – 1 2 j position of element i in p (i. e. , pj = i) 3 if j ≠i 4 p p * r(i, j) 5 output p 6 if p is the identity permutation 7 return 6/5/2021 CSE 5290, Fall 2011 34

Analyzing Simple. Reversal. Sort • Simple. Reversal. Sort does not guarantee the smallest number of reversals and takes five steps on p = 6 1 2 3 4 5 : • • • 6/5/2021 Step Step 1: 2: 3: 4: 5: 1 1 1 6 2 2 2 6 3 3 3 6 4 4 4 6 5 5 5 6 CSE 5290, Fall 2011 35

Analyzing Simple. Reversal. Sort • But it can be sorted in two steps: p = 612345 – Step 1: 5 4 3 2 1 6 – Step 2: 1 2 3 4 5 6 • So, Simple. Reversal. Sort(p) is not optimal • Optimal algorithms are unknown for many problems; approximation algorithms are used 6/5/2021 CSE 5290, Fall 2011 36

Approximation Algorithms • These algorithms find approximate solutions rather than optimal solutions • The approximation ratio of an algorithm A on input p is: A(p) / OPT(p) where A(p) -solution produced by algorithm A OPT(p) - optimal solution of the problem 6/5/2021 CSE 5290, Fall 2011 37

Approximation Ratio/Performance Guarantee • Approximation ratio (performance guarantee) of algorithm A: max approximation ratio of all inputs of size n – For algorithm A that minimizes objective function (minimization algorithm): • max|p| = n A(p) / OPT(p) 6/5/2021 CSE 5290, Fall 2011 38

Approximation Ratio/Performance Guarantee • Approximation ratio (performance guarantee) of algorithm A: max approximation ratio of all inputs of size n – For algorithm A that minimizes objective function (minimization algorithm): • max|p| = n A(p) / OPT(p) – For maximization algorithm: • min|p| = n A(p) / OPT(p) 6/5/2021 CSE 5290, Fall 2011 39

Adjacencies and Breakpoints p = p 1 p 2 p 3…pn-1 pn • A pair of elements p i and p i + 1 are adjacent if pi+1 = pi + 1 • For example p=1 9 3 4 7 8 2 6 5 • (3, 4) or (7, 8) and (6, 5) are adjacent pairs 6/5/2021 CSE 5290, Fall 2011 40

Breakpoints: An Example There is a breakpoint between any adjacent element that are non-consecutive: p=1 9 3 4 7 8 2 6 5 • Pairs (1, 9), (9, 3), (4, 7), (8, 2) and (2, 6) form breakpoints of permutation p • b(p) - # breakpoints in permutation p 6/5/2021 CSE 5290, Fall 2011 41

Adjacency & Breakpoints • An adjacency - a pair of adjacent elements that are consecutive • A breakpoint - a pair of adjacent elements that are not consecutive π=5 6 2 1 3 4 Extend π with π0 = 0 and π7 = 7 adjacencies 0 5 6 2 1 3 4 7 breakpoints 6/5/2021 CSE 5290, Fall 2011 42

Extending Permutations • We put two elements p 0 =0 and p n + 1=n+1 at the ends of p Example: p=1 9 3 4 7 8 2 6 5 Extending with 0 and 10 p = 0 1 9 3 4 7 8 2 6 5 10 Note: A new breakpoint was created after extending 6/5/2021 CSE 5290, Fall 2011 43

Reversal Distance and Breakpoints § Each reversal eliminates at most 2 breakpoints. §This implies: reversal distance ≥ #breakpoints / 2 p =2 3 1 4 6 5 0 0 6/5/2021 2 1 1 1 3 3 2 2 1 2 3 3 4 4 6 6 6 5 5 6 7 7 CSE 5290, Fall 2011 b(p) = 5 b(p) = 4 b(p) = 2 b(p) = 0 44

Sorting By Reversals: A Better Greedy Algorithm Break. Point. Reversal. Sort(p) 1 while b(p) > 0 2 Among all possible reversals, choose reversal r minimizing b (p • r ) 3 p p • r(i, j) 4 output p 5 return Q: Does this algorithm terminate? 6/5/2021 CSE 5290, Fall 2011 45

Strips • Strip: an interval between two consecutive breakpoints in a permutation – Decreasing strip: strip of elements in decreasing order (e. g. 6 5 and 3 2 ). – Increasing strip: strip of elements in increasing order (e. g. 7 8) 0 1 9 4 3 7 8 2 5 6 10 – A single-element strip can be declared either increasing or decreasing. We will choose to declare them as decreasing with exception of the strips with 0 and n+1 6/5/2021 CSE 5290, Fall 2011 46

Reducing the Number of Breakpoints Theorem 1: If permutation p contains at least one decreasing strip, then there exists a reversal r which decreases the number of breakpoints (i. e. b(p • r) < b(p) ) 6/5/2021 CSE 5290, Fall 2011 47

Things To Consider • For p = 1 4 6 5 7 8 3 2 0 1 4 6 5 7 8 3 2 9 b(p) = 5 – Choose decreasing strip with the smallest element k in p ( k = 2 in this case) – Find k – 1 in the permutation 6/5/2021 CSE 5290, Fall 2011 48

Things To Consider (cont’d) • For p = 1 4 6 5 7 8 3 2 0 1 4 6 5 7 8 3 2 9 b(p) = 5 – Choose decreasing strip with the smallest element k in p ( k = 2 in this case) – Find k – 1 in the permutation – Reverse the segment between k and k-1: – 0 1 4 6 5 7 8 3 2 9 b(p) = 5 – 0 1 2 3 8 7 5 6 4 9 6/5/2021 CSE 5290, Fall 2011 b(p) = 4 49

Reducing the Number of Breakpoints Again – If there is no decreasing strip, there may be no reversal r that reduces the number of breakpoints (i. e. b(p • r) ≥ b(p) for any reversal r). – By reversing an increasing strip ( # of breakpoints stay unchanged ), we will create a decreasing strip at the next step. Then the number of breakpoints will be reduced in the next step (theorem 1). 6/5/2021 CSE 5290, Fall 2011 50

Things To Consider (cont’d) • There are no decreasing strips in p, for: p =0 1 2 5 6 7 3 4 8 b(p) = 3 p • r(6, 7) = 0 1 2 5 6 7 4 3 8 b(p) = 3 ü r(6, 7) does not change the # of breakpoints ü r(6, 7) creates a decreasing strip thus guaranteeing that the next step will decrease the # of breakpoints. 6/5/2021 CSE 5290, Fall 2011 51

Improved. Breakpoint. Reversal. Sort(p) 1 while b(p) > 0 2 if p has a decreasing strip 3 Among all possible reversals, choose reversal r that minimizes b(p • r) 4 5 else Choose a reversal r that flips an increasing strip in p 6 p p • r 7 output p 8 return 6/5/2021 CSE 5290, Fall 2011 52

Improved. Breakpoint. Reversal. Sort: Performance Guarantee • Improved. Break. Point. Reversal. Sort is an approximation algorithm with a performance guarantee of at most 4 – It eliminates at least one breakpoint in every two steps; at most 2 b(p) steps – Approximation ratio: 2 b(p) / d(p) – Optimal algorithm eliminates at most 2 breakpoints in every step: d(p) b(p) / 2 – Performance guarantee: • ( 2 b(p) / d(p) ) [ 2 b(p) / (b(p) / 2) ] = 4 6/5/2021 CSE 5290, Fall 2011 53

Signed Permutations • Up to this point, all permutations to sort were unsigned • But genes have directions… so we should consider signed permutations 5’ p = 6/5/2021 3’ 1 -2 - 3 4 CSE 5290, Fall 2011 -5 54

Signed Permutations • Algorithms are a little more involved. • Possible project topic 6/5/2021 CSE 5290, Fall 2011 55

GRIMM Web Server • Real genome architectures are represented by signed permutations • Efficient algorithms to sort signed permutations have been developed • GRIMM web server computes the reversal distances between signed permutations: 6/5/2021 CSE 5290, Fall 2011 56

GRIMM Web Server http: //www-cse. ucsd. edu/groups/bioinformatics/GRIMM 6/5/2021 CSE 5290, Fall 2011 57

Next Dynamic programming, sequence alignment Some of the following slides are based on slides by the authors of our text. 6/5/2021 CSE 5290, Fall 2011 58

Dynamic programming (DP) • Typically used for optimization problems • Often results in efficient algorithms • Not applicable to all problems Caveats: • Need not yield poly-time algorithms • No unique formulations for most problems • May not rule out greedy algorithms 6/5/2021 CSE 5290, Fall 2011 59

Example • Counting the number of shortest paths in a grid with blocked intersections • Finding paths in a weighted grid • Sequence alignment 6/5/2021 CSE 5290, Fall 2011 60

Setting up DP in practice • The optimal solution should be computable as a (recursive) function of the solution to sub-problems • Solve sub-problems systematically and store solutions (to avoid duplication of work). 6/5/2021 CSE 5290, Fall 2011 61

Number of paths in a grid Problem: Travel from the top-left to the bottom right of a rectangular grid using only right and down moves • Combinatorial approach • DP approach: how can we decompose the problem into sub-problems ? 6/5/2021 CSE 5290, Fall 2011 62

Number of paths in a grid with blocked intersections Problem: Same as before but some grid points are blocked and cannot be used • Combinatorial approach? • DP approach: how can we decompose the problem into sub-problems ? 6/5/2021 CSE 5290, Fall 2011 63

Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid 6/5/2021 Source CSE 5290, Fall 2011 * * *Sink 64

Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid 6/5/2021 Source CSE 5290, Fall 2011 * * *Sink 65

Manhattan Tourist Problem: Formulation Goal: Find the longest path in a weighted grid. Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink” Output: A longest path in G from “source” to “sink” 6/5/2021 CSE 5290, Fall 2011 66

MTP: An Example 0 source 0 0 1 3 1 i coordinate 4 5 6/5/2021 5 13 15 0 3 CSE 5290, Fall 2011 2 1 4 19 1 2 20 3 5 2 j coordinate 3 2 8 6 1 3 3 0 2 5 4 9 4 4 4 7 3 3 4 5 2 0 3 2 6 4 2 2 0 3 1 4 3 2 2 23 sink 67

MTP: Greedy Algorithm Is Not Optimal source 1 2 3 5 2 3 10 5 3 0 1 4 3 5 0 0 5 5 1 2 promising start, but leads to bad choices! 5 0 2 0 18 6/5/2021 CSE 5290, Fall 2011 22 sink 68

MTP: Simple Recursive Program MT(n, m) if n=0 or m=0 return MT(n, m) x MT(n-1, m)+ length of the edge from (n- 1, m) to (n, m) y MT(n, m-1)+ length of the edge from (n, m-1) to (n, m) return max{x, y} 6/5/2021 CSE 5290, Fall 2011 69

MTP: Simple Recursive Program MT(n, m) x MT(n-1, m)+ length of the edge from (n- 1, m) to (n, m) y MT(n, m-1)+ length of the edge from (n, m-1) to (n, m) return max{x, y} What’s wrong with this approach? 6/5/2021 CSE 5290, Fall 2011 70

MTP: Dynamic Programming j 0 source 1 0 i 1 1 S 0, 1 = 1 5 S 1, 0 = 5 • Calculate optimal path score for each vertex in the graph • Each vertex’s score is the maximum of the prior vertices score plus the weight of the respective edge in between 6/5/2021 CSE 5290, Fall 2011 71

MTP: Dynamic Programming j(cont’d) 0 source 1 1 0 i 3 3 S 0, 2 = 3 -5 1 5 4 S 1, 1 = 4 3 6/5/2021 2 1 5 2 2 8 S 2, 0 = 8 CSE 5290, Fall 2011 72

MTP: Dynamic Programming (cont’d) j 0 source 1 1 0 i 10 8 S 3, 0 = 8 1 5 4 13 S 1, 2 = 13 5 3 -5 2 8 0 6/5/2021 5 3 3 -5 1 3 2 1 5 3 2 8 S 3, 0 = 8 9 S 2, 1 = 9 CSE 5290, Fall 2011 73

MTP: Dynamic Programming j 0 source (cont’d) 1 2 1 0 2 5 1 i 5 13 5 -3 8 S 1, 3 = 8 3 -5 8 9 0 -5 -5 4 3 8 10 1 5 2 3 3 -5 1 3 0 12 S 2, 2 = 12 0 3 6/5/2021 8 greedy alg. fails! 9 CSE 5290, Fall 2011 S 3, 1 = 9 74

MTP: Dynamic Programming j 0 source 1 5 13 5 12 0 -5 0 8 3 9 0 2 3 8 8 -3 -5 2 -5 -5 4 3 8 10 1 5 3 3 3 -5 1 6/5/2021 2 1 5 3 2 1 0 i (cont’d) 15 S 2, 3 = 15 0 9 9 S 3, 2 = 9 CSE 5290, Fall 2011 75

MTP: Dynamic Programming j 0 source 1 5 12 0 8 15 -5 0 1 0 0 9 (showing all back-traces) 3 9 0 2 3 8 8 -3 -5 2 -5 13 5 Done! -5 4 3 8 10 1 5 3 3 3 -5 1 6/5/2021 2 1 5 3 2 1 0 i (cont’d) 9 CSE 5290, Fall 2011 16 S 3, 3 = 16 76

MTP: Recurrence Computing the score for a point (i, j) by the recurrence relation: si, j = max si-1, j + weight of the edge between (i-1, j) and (i, j) si, j-1 + weight of the edge between (i, j-1) and (i, j) The running time is n x m for a n by m grid (n = # of rows, m = # of columns) 6/5/2021 CSE 5290, Fall 2011 77

Manhattan Is Not A Perfect Grid A 2 A 1 A 3 B What about diagonals? • The score at point B is given by: s. B = max of s. A 1 + weight of the edge (A 1, B) s. A 2 + weight of the edge (A 2, B) s. A 3 + weight of the edge (A 3, B) 6/5/2021 CSE 5290, Fall 2011 78

Manhattan Is Not A Perfect Grid (contd) Computing the score for point x is given by the recurrence relation: sx = max of sy + weight of vertex (y, x) where y є Predecessors(x) • Predecessors (x) – set of vertices that have edges leading to x • The running time for a graph G(V, E) (V is the set of all vertices and E is the set of all edges) is O(E) since each edge is evaluated once 6/5/2021 CSE 5290, Fall 2011 79

Traveling in the Grid • The only hitch is that one must decide on the order in which visit the vertices • By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble. • We need to traverse the vertices in some order • Try to find such order for a directed cycle ? ? ? 6/5/2021 CSE 5290, Fall 2011 80

DAG: Directed Acyclic Graph • Since Manhattan is not a perfect regular grid, we represent it as a DAG • DAG for Dressing in the morning problem 6/5/2021 CSE 5290, Fall 2011 81

Topological Ordering • A numbering of vertices of the graph is called topological ordering of the DAG if every edge of the DAG connects a vertex with a smaller label to a vertex with a larger label • In other words, if vertices are positioned on a line in an increasing order of labels then all edges go from left to right. 6/5/2021 CSE 5290, Fall 2011 82

Topological ordering • 2 different topological orderings of the DAG 6/5/2021 CSE 5290, Fall 2011 83

Longest Path in DAG Problem • Goal: Find a longest path between two vertices in a weighted DAG • Input: A weighted DAG G with source and sink vertices • Output: A longest path in G from source to sink 6/5/2021 CSE 5290, Fall 2011 84

Longest Path in DAG: Dynamic Programming • Suppose vertex v has indegree 3 and predecessors {u 1, u 2, u 3} • Longest path to v from source is: su 1 + weight of edge from u 1 to v sv = max of su 2 + weight of edge from u 2 to v su 3 + weight of edge from u 3 to v In General: sv = maxu (su + weight of edge from u to v) 6/5/2021 CSE 5290, Fall 2011 85

Traversing the Manhattan Grid • 3 different strategies: – a) Column by column – b) Row by row – c) Along diagonals 6/5/2021 a) b) c) CSE 5290, Fall 2011 86

Sequence alignment • Fundamental problem • Many different versions 6/5/2021 CSE 5290, Fall 2011 87

Alignment: 2 row representation Given 2 DNA sequences v and w: v : ATGTTAT w: ATCGTAC m=7 n=7 Alignment : 2 * k matrix ( k > m, n ) letters of v A T -- G T T A T -- letters of w A T C G T -- A -- C 4 matches 6/5/2021 2 insertions CSE 5290, Fall 2011 2 deletions 88

Aligning DNA Sequences V = ATCTGATG W = TGCATAC match n=8 m=7 mismatch 4 1 2 2 matches mismatches insertions deletions C T G A T G V A T T G C A T A C W indels 6/5/2021 deletion insertion CSE 5290, Fall 2011 89

Aligning DNA Sequences - 2 • Brute force is infeasible…. • Number of alignments of X[1. . n], Y[1. . m], n<m is ( ) m+n n • For m=n, this is about 22 n/pn 6/5/2021 CSE 5290, Fall 2011 90

Longest Common Subsequence (LCS) – Alignment without Mismatches • Given two sequences v = v 1 v 2…vm and w = w 1 w 2…wn • The LCS of v and w is a sequence of positions in v: 1 < i 2 < … < it < m and a sequence of positions in w: 1 < j 2 < … < jt < n such that it -th letter of v equals to jt-letter of w and t is maximal 6/5/2021 CSE 5290, Fall 2011 91

LCS: Example 0 1 2 2 3 3 4 5 6 7 8 elements of v A T -- C -- T G A T C elements of w -- T 1 G C 2 3 A 4 T 5 -- -- C 7 i coords: j coords: 0 0 5 A 6 6 (0, 0) (1, 0) (2, 1) (2, 2) (3, 3) (3, 4) (4, 5) (5, 5) (6, 6) (7, 6) (8, 7) Matches shown in red positions in v: 2 < 3 < 4 < 6 < 8 positions in w: 1 < 3 < 5 < 6 < 7 Every common subsequence is a path in 2 -D grid 6/5/2021 CSE 5290, Fall 2011 92

LCS Problem as Manhattan Tourist Problem j A T C T G A T C 0 1 2 3 4 5 6 7 8 i 0 T 1 G 2 C 3 A 4 T 5 A 6 C 7 6/5/2021 CSE 5290, Fall 2011 94

Edit Graph for LCS Problem j A T C T G A T C 0 1 2 3 4 5 6 7 8 i 0 T 1 G 2 C 3 A 4 T 5 A 6 C 7 6/5/2021 CSE 5290, Fall 2011 95

Edit Graph for LCS Problem j A T C T G A T C 0 1 2 3 4 5 6 7 8 i 0 T 1 Every diagonal edge adds an extra element to common subsequence G 2 C 3 A 4 T 5 LCS Problem: Find a path with maximum number of diagonal edges A 6 C 7 6/5/2021 Every path is a common subsequence. CSE 5290, Fall 2011 96

Computing LCS Let vi = prefix of v of length i: v 1 … vi and wj = prefix of w of length j: w 1 … wj The length of LCS(vi, wj) is computed by: si, j = max si-1, j si, j-1 si-1, j-1 + 1 if vi = wj 6/5/2021 CSE 5290, Fall 2011 97

Computing LCS (cont’d) i-1, j -1 si, j = MAX 6/5/2021 si-1, j + 0 si, j -1 + 0 si-1, j -1 + 1, 1 i, j -1 if vi = wj CSE 5290, Fall 2011 0 0 i, j 98

Every Path in the Grid Corresponds to an Alignment W A 0 V T 1 C 2 G 3 T 4 6/5/2021 AT- GT | | 1 T 012 2 34 V= 0 A 4 W= | ATCG– 012 344 CSE 5290, Fall 2011 99

Aligning Sequences without Insertions and Deletions: Hamming Distance Given two DNA sequences v and w : v : AT AT w : T AT A • The Hamming distance: d. H(v, w) = 8 is large but the sequences are very similar 6/5/2021 CSE 5290, Fall 2011 100

Aligning Sequences with Insertions and Deletions By shifting one sequence over one position: v : AT AT -w : -- T AT A • The edit distance: d. H(v, w) = 2. • Hamming distance neglects insertions and deletions in DNA 6/5/2021 CSE 5290, Fall 2011 101

Edit Distance Levenshtein (1966) introduced edit distance between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other d(v, w) = MIN number of elementary operations to transform v w 6/5/2021 CSE 5290, Fall 2011 102

Edit Distance vs Hamming Distance Hamming distance always compares i-th letter of v with i-th letter of w V = ATAT W = TATA Hamming distance: d(v, w)=8 Computing Hamming distance is a trivial task. 6/5/2021 CSE 5290, Fall 2011 103

Edit Distance vs Hamming Distance Hamming distance always compares i-th letter of v with i-th letter of w V = ATAT W = TATA Just one shift Make it all line up Hamming distance: d(v, w)=8 V = - ATAT W = TATA Edit distance: d(v, w)=2 Computing Hamming distance is a trivial task 6/5/2021 Edit distance may compare i-th letter of v with j-th letter of w CSE 5290, Fall 2011 Computing edit distance is a non-trivial task 104

Edit Distance vs Hamming Distance Hamming distance always compares i-th letter of v with i-th letter of w Edit distance may compare i-th letter of v with j-th letter of w V = ATAT V = - ATAT W = TATATATA Hamming distance: Edit distance: d(v, w)=8 d(v, w)=2 (one insertion and one deletion) How to find what j goes with what i ? ? ? 6/5/2021 CSE 5290, Fall 2011 105

Edit Distance: Example TGCATAT ATCCGAT in 5 steps TGCATAT TGCATA TGCAT ATCCAT ATCCGAT 6/5/2021 (delete last T) (delete last A) (insert A at front) (substitute C for 3 rd G) (insert G before last A) (Done) CSE 5290, Fall 2011 106

Edit Distance: Example TGCATAT ATCCGAT in 5 steps TGCATAT TGCATA TGCAT ATCCAT ATCCGAT (delete last T) (delete last A) (insert A at front) (substitute C for 3 rd G) (insert G before last A) (Done) What is the edit distance? 5? 6/5/2021 CSE 5290, Fall 2011 107

Edit Distance: Example (cont’d) TGCATAT ATCCGAT in 4 steps TGCATAT (insert A at front) ATGCATAT (delete 6 th T) ATGCATA (substitute G for 5 th A) ATGCGTA (substitute C for 3 rd G) ATCCGAT (Done) 6/5/2021 CSE 5290, Fall 2011 108

Edit Distance: Example (cont’d) TGCATAT ATCCGAT in 4 steps TGCATAT (insert A at front) ATGCATAT (delete 6 th T) ATGCATA (substitute G for 5 th A) ATGCGTA (substitute C for 3 rd G) ATCCGAT (Done) Can it be done in 3 steps? ? ? 6/5/2021 CSE 5290, Fall 2011 109

The Alignment Grid – Every alignment path is from source to sink 6/5/2021 CSE 5290, Fall 2011 110

Alignment as a Path in the Edit Graph 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 - Corresponding path (0, 0) , (1, 1) , (2, 2), (2, 3), (3, 4), (4, 5), (5, 5), (6, 6), (7, 7) 6/5/2021 CSE 5290, Fall 2011 111

Alignments in Edit Graph (cont’d) and represent indels in v and w with score 0. represent matches with score 1. • The score of the alignment path is 5. 6/5/2021 CSE 5290, Fall 2011 112

Alignment as a Path in the Edit Graph Every path in the edit graph corresponds to an alignment: 6/5/2021 CSE 5290, Fall 2011 113

Alignment as a Path in the Edit Graph Old Alignment 0122345677 v= AT_GTTAT_ w= ATCGT_A_C 0123455667 New Alignment 0122345677 v= AT_GTTAT_ w= ATCG_TA_C 0123445667 6/5/2021 CSE 5290, Fall 2011 114

Alignment as a Path in the Edit Graph 0122345677 v= AT_GTTAT_ w= ATCGT_A_C 0123455667 (0, 0) , (1, 1) , (2, 2), (2, 3), (3, 4), (4, 5), (5, 5), (6, 6), (7, 7) 6/5/2021 CSE 5290, Fall 2011 115

Alignment: Dynamic Programming si, j = max si-1, j-1+1 if vi = wj si-1, j si, j-1 6/5/2021 CSE 5290, Fall 2011 116

Dynamic Programming Example Initialize 1 st row and 1 st column to be all zeroes. Or, to be more precise, initialize 0 th row and 0 th column to be all zeroes. 6/5/2021 CSE 5290, Fall 2011 117

Dynamic Programming Example Si, j = max Si-1, j-1 value from NW +1, if v = w i j Si-1, j value from North (top) Si, j-1 6/5/2021 CSE 5290, Fall 2011 value from West (left) 118

Alignment: Backtracking Arrows show where the score originated from. if from the top if from the left if vi = wj 6/5/2021 CSE 5290, Fall 2011 119

Backtracking Example Find a match in row and column 2. i=2, j=2, 5 is a match (T). j=2, i=4, 5, 7 is a match (T). Since vi = wj, si, j = si-1, j-1 +1 s 2, 2 s 2, 5 s 4, 2 s 5, 2 s 7, 2 6/5/2021 = = = CSE 5290, Fall 2011 [s 1, 4 [s 3, 1 [s 4, 1 [s 6, 1 = = = 1] 1] 1] + + + 1 1 120

Backtracking Example Continuing with the dynamic programming algorithm gives this result. 6/5/2021 CSE 5290, Fall 2011 121

Alignment: Dynamic Programming si, j = max si-1, j-1+1 if vi = wj si-1, j si, j-1 6/5/2021 CSE 5290, Fall 2011 122

Alignment: Dynamic Programming si, j = max si-1, j-1+1 if vi = wj si-1, j+0 si, j-1+0 This recurrence corresponds to the Manhattan Tourist problem (three incoming edges into a vertex) with all horizontal and vertical edges weighted by zero. 6/5/2021 CSE 5290, Fall 2011 123

LCS Algorithm LCS(v, w) for i 1 to n si, 0 0 for j 1 to m s 0, j 0 for i 1 to n for j 1 to m si-1, j si, j max si, j-1 si-1, j-1 + 1, if vi = wj “ “ if si, j = si-1, j bi, j “ “ if si, j = si, j-1 “ “ if si, j = si-1, j-1 + 1 return (sn, m, b) 6/5/2021 CSE 5290, Fall 2011 124

Now What? • LCS(v, w) created the alignment grid • Now we need a way to read the best alignment of v and w • Follow the arrows backwards from sink 6/5/2021 CSE 5290, Fall 2011 125

Printing LCS: Backtracking 1. Print. LCS(b, v, i, j) 2. if i = 0 or j = 0 3. return 4. if bi, j = “ “ 5. Print. LCS(b, v, i-1, j-1) 6. print vi else 7. 8. if bi, j = “ “ 9. Print. LCS(b, v, i-1, j) 10. else 11. Print. LCS(b, v, i, j-1) 6/5/2021 CSE 5290, Fall 2011 126

LCS Runtime • It takes O(nm) time to fill in the nxm dynamic programming matrix. • Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up a nxm matrix. 6/5/2021 CSE 5290, Fall 2011 127

Why does DP work? • Avoids re-computing the same subproblems • Limits the amount of work done in each step 6/5/2021 CSE 5290, Fall 2011 128

When is DP applicable? – Optimal substructure: Optimal solution to problem (instance) contains optimal solutions to sub-problems – Overlapping sub-problems: Limited number of distinct sub-problems, repeated many times 6/5/2021 CSE 5290, Fall 2011 129

Next: • More realistic sequence alignment algorithms Types: • • Global Alignment Scoring Matrices Local Alignment with Affine Gap Penalties 6/5/2021 CSE 5290, Fall 2011 130

From LCS to Alignment: Change up the Scoring • The Longest Common Subsequence (LCS) problem— the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). • In the LCS Problem, we scored 1 for matches and 0 for indels • Consider penalizing indels and mismatches with negative scores • Simplest scoring schema: +1 : match premium -μ : mismatch penalty -σ : indel penalty 6/5/2021 CSE 5290, Fall 2011 131

Simple Scoring • When mismatches are penalized by –μ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels) 6/5/2021 CSE 5290, Fall 2011 132

The Global Alignment Problem Find the best alignment between two strings under a given scoring schema Input : Strings v and w and a scoring schema Output : Alignment of maximum score ↑→ = -σ = 1 if match = -µ if mismatch si, j = max 6/5/2021 si-1, j-1 +1 if vi = wj s i-1, j-1 -µ if vi ≠ wj s i-1, j - σ s i, j-1 - σ CSE 5290, Fall 2011 m : mismatch penalty σ : indel penalty 133

Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score for comparison of a gap character “-”. This will simplify the algorithm as follows: si-1, j-1 + δ (vi, wj) si, j = max s i-1, j + δ (vi, -) s i, j-1 + δ (-, wj) 6/5/2021 CSE 5290, Fall 2011 134

Measuring Similarity • Measuring the extent of similarity between two sequences – Based on percent sequence identity – Based on conservation 6/5/2021 CSE 5290, Fall 2011 135

Percent Sequence Identity • The extent to which two nucleotide or amino acid sequences are invariant AC C TG A G – AG AC G TG – G C AG mismatch 70% identical 6/5/2021 CSE 5290, Fall 2011 indel 136

Making a Scoring Matrix • Scoring matrices are created based on biological evidence. • Alignments can be thought of as two sequences that differ due to mutations. • Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(vi , wj), will be less harsh than others. 6/5/2021 CSE 5290, Fall 2011 137

Scoring Matrix: Example A R A 5 -2 -1 -1 R - 7 -1 3 N - - 7 0 K - - - 6 6/5/2021 N K • Notice that although R and K are different amino acids, they have a positive score. • Why? They are both positively charged amino acids will not greatly change function of protein. CSE 5290, Fall 2011 138

Conservation • Amino acid changes that tend to preserve the physico-chemical properties of the original residue – Polar to polar • aspartate glutamate – Nonpolar to nonpolar • alanine valine – Similarly behaving residues • leucine to isoleucine 6/5/2021 CSE 5290, Fall 2011 139

Scoring matrices • Amino acid substitution matrices – PAM – BLOSUM • DNA substitution matrices – DNA is less conserved than protein sequences – Less effective to compare coding regions at nucleotide level 6/5/2021 CSE 5290, Fall 2011 140

PAM • Point Accepted Mutation (Dayhoff et al. ) • 1 PAM = PAM 1 = 1% average change of all amino acid positions – After 100 PAMs of evolution, not every residue will have changed • some residues may have mutated several times • some residues may have returned to their original state • some residues may not changed at all 6/5/2021 CSE 5290, Fall 2011 141

PAMX • PAMx = PAM 1 x – PAM 250 = PAM 1250 • PAM 250 is a widely used scoring matrix: Ala Arg Asn Asp Cys Gln. . . Trp Tyr Val 6/5/2021 A R N D C Q Ala A 13 3 4 5 2 3 Arg R 6 17 4 4 1 5 Asn N 9 4 6 8 1 5 Asp D 9 3 7 11 1 6 Cys C 5 2 2 1 52 1 Gln Q 8 5 5 7 1 10 Glu E 9 3 6 10 1 7 Gly G 12 2 4 5 2 3 His H 6 6 2 7 Ile I 8 3 3 3 2 2 W Y V 0 1 7 2 1 4 0 2 4 0 1 4 0 3 4 0 1 4 1 3 5 0 2 4 CSE 5290, Fall 2011 Leu L 6 2 2 2 1 3 Lys. . . K. . . 7. . . 9 5 5 1 2 15 0 1 10 142

BLOSUM • Blocks Substitution Matrix • Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins • Matrix name indicates evolutionary distance – BLOSUM 62 was created using sequences sharing no more than 62% identity 6/5/2021 CSE 5290, Fall 2011 143

The Blosum 50 Scoring Matrix 6/5/2021 CSE 5290, Fall 2011 144

Local vs. Global Alignment • The Global Alignment Problem tries to find the longest path between vertices (0, 0) and (n, m) in the edit graph. • The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i, j) and (i’, j’) in the edit graph. 6/5/2021 CSE 5290, Fall 2011 145

Local vs. Global Alignment • The Global Alignment Problem tries to find the longest path between vertices (0, 0) and (n, m) in the edit graph. • The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i, j) and (i’, j’) in the edit graph. • In the edit graph with negatively-scored edges, Local Alignmet may score higher than Global Alignment 6/5/2021 CSE 5290, Fall 2011 146

Local vs. Global Alignment (cont’d) • Global Alignment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | ||| || | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C • Local Alignment—better alignment to find conserved segment tcc. CAGTTATGTCAGgggacacgagcatgcagagac |||||| aattgccgccgtcgttttcag. CAGTTATGTCAGatc 6/5/2021 CSE 5290, Fall 2011 147

Local Alignment: Example Local alignment Compute a “mini” Global Alignment to get Local Global alignment 6/5/2021 CSE 5290, Fall 2011 148