This Unit Longest common subsequence l Edit distance
This Unit Longest common subsequence l Edit distance l
Biological Applications l l Compare the DNA of two or more organisms How similar are the two strands? Is one a substring of the other? Find a new longest strand in which the bases (A, C, G, T) appear in the same order as in the original 2 strands?
Longest Common Subsequence (LCS) Problem: Given sequences x[1. . m] and y[1. . n], find a longest common subsequence of both. l Example: x=ABCBDAB and y=BDCABA, l – BCA is a common subsequence and – BCBA and BDAB are two LCSs
LCS Brute force solution l Writing a recurrence equation l The dynamic programming solution l Application of algorithm l
Brute force solution Solution: For every subsequence of x, check if it is a subsequence of y. l Analysis : l – There are 2 m subsequences of x. – Each check takes O(n) time, since we scan y for first element, and then scan for second element, etc. – The worst case running time is O(n 2 m) or (2 m).
Writing the recurrence equation l l Let Xi denote the ith prefix x[1, . . i] of x[1. . , m], and X 0 denotes an empty prefix We will first compute the length of an LCS of Xm and Yn, Len. LCS(m, n), and then use information saved during the computation for finding the actual subsequence We need a recursive formula for computing Len. LCS(i, j).
Writing the recurrence equation l l If Xi and Yj end with the same character xi=yj, an LCS must include the character. If it did not we could get a longer LCS by adding the common character. If Xi and Yj do not end with the same character there are two possibilities: – either the LCS does not end with xi, – or it does not end with yj l Let Zk denote an LCS of Xi and Yj
Xi and Yj end with xi=yj Xi x 1 x 2 … xi-1 xi Yj y 1 y 2 … yj-1 yj=xi Zk z 1 z 2…zk-1 zk =yj=xi Zk is Zk -1 followed by zk = yj = xi where Zk-1 is an LCS of Xi-1 and Yj -1 and Len. LCS(i, j)=Len. LCS(i-1, j-1)+1
Xi and Yj end with xi ¹ yj Xi x 1 x 2 … xi-1 xi Xi x 1 x 2 … xi-1 x i Yj y 1 y 2 … yj-1 yj Yj yj y 1 y 2 …yj-1 yj Zk z 1 z 2…zk-1 zk ¹ xi Zk is an LCS of Xi and Yj -1 Zk is an LCS of Xi -1 and Yj Len. LCS(i, j)=max{Len. LCS(i, j-1), Len. LCS(i-1, j)}
The recurrence equation
The dynamic programming solution l l l Initialize the first row and the first column of the matrix Len. LCS to 0 Calculate Len. LCS (1, j) for j = 1, …, n Then the Len. LCS (2, j) for j = 1, …, n, etc. Store also in a table an arrow pointing to the array element that was used in the computation. It is easy to see that the computation is O(mn)
LCS-Length(X, Y) m length[X} n length[Y] for i 1 to m do c[i, 0] 0 for j 1 to n do c[0, j] 0
LCS-Length(X, Y) cont. for i 1 to m do for j 1 to n do if xi = yj c[i, j] c[i-1, j-1]+1 b[i, j] “D” else if c[i-1, j] c[i, j-1] c[i, j] c[i-1, j] b[i, j] “U” else c[i, j] c[i, j-1] b[i, j] “L” return c and b
Example To find an LCS follow the arrows, each diagonal one denotes a member of the LCS
Edit distance l l l Given two strings s and t Edit distance = the minimum number of basic operations to covert one to the other Basic operations are typically character-level – Insert – Delete – Replace l l Often include also transposition http: //www. merriampark. com/ld. htm
Dynamic programming for edit distance l l Let s[1, 2, . . . , m] and t[1, 2, . . . , n] be the two strings. The recurrence equation is: r(i, j) =0 when s[i] = t[j], otherwise 1
- Slides: 16