Least common subsequence Biological applications often need to
Least common subsequence: Biological applications often need to compare the DNA of two (or more) different organisms. S 1 = {ACCGGTCGAGTGCGCGGAAGCCGAA} S 2 = {GTCGTTCGGAATGCCGTTGCTCTGTAAA} - none are substrings of each other - we could say that both are similar if the number of changes to turn one into the other is small - Find a 3 rd strand in which the letters in S 3 appear in S 1 and S 2 in the same order but not necessarily consecutively. CSC 317 1
More formally: Given a sequence , another sequence subsequence of X if there exists a subsequence all j = 1, 2, …, k we have Example: is a subsequence of CSC 317 is a of indices of X such that for with index sequence 2
More formally: Given two sequences X and Y we say that a sequence Z is a common subsequence of X and Y if Z is a subsequence of both X and Y Example: If and is a common subsequence of X and Y. Is it the longest subsequence? No the sequence longest common subsequences CSC 317 3
In the longest-common-subsequence problem, we are given two sequences and want to find a maximum length common subsequence of X and Y. Great. How are we going to do that using dynamic programming? ? ? Steps: • Characterizing a longest common subsequence • Recursive solution • Computing the length of LCS • Constructing a LCS CSC 317 4
Step 1: Characterizing a longest common subsequence Brute force solution: We simply enumerate all subsequences of X and check each subsequence to see whether it is also a subsequence of Y , keeping track of the longest subsequence we find Actually because we would need to run through 2 m subsequences … (sucks) But, does the LCS problem have an optimal-substructure property (dynamic programming, anyone)? Some definitions: Given a sequence. Example: if CSC 317 , we define the ith prefix of X as 5
Theorem (no proof): Optimal substructure of an LCS Let and any LCS of X and Y then: be two sequences and be 1. ) if xm = yn, then zk= xm = ym and Zk-1 is and LCS of Xm-1 and Yn-1 2. ) if xm ≠ yn, then zk ≠ xm implies that Z is a LCS of Xm-1 and Y. 3. ) if xm ≠ yn, then zk ≠ yn implies that Z is a LCS of X and Yn-1. But what does that tell us? A LCS of two sequences contains within it an LCS of prefixes of the two sequences. Thus, the LCS problem has an optimal-substructure property, meaning we could use a recursive solution! CSC 317 6
Step 2: A recursive algorithm To find an LCS of X and Y , we may need to find the LCSs of X and Yn-1 and of Xm-1 and Y. Furthermore, each of these subproblems has the subsubproblem of finding an LCS of Xm-1 and Yn-1. c[i, j] is the length of an LCS of the sequences Xi and Yj. The optimal substructure of the LCS problem gives the recursive formula if i = 0 or j = 0 if i, j > 0 and xi = yj if i, j > 0 and xi ≠ yj CSC 317 7
Step 3: Computing the length of a LCS CSC 317 8
Step 3: Computing the length of a LCS BCBA AB C BDAB BDCAB A CSC 317 9
Step 4: Constructing a LCS (Backtracking) BCBA AB C BDAB BDCAB A CSC 317 10
- Slides: 10