Longest Common Subsequence CSCI 385 Data Structures Analysis
Longest Common Subsequence CSCI 385 Data Structures & Analysis of Algorithms Lecture note Sajedul Talukder
Longest Common Subsequence (LCS) • The Longest Common Subsequence (LCS) of two strings is the longest sequence of characters that appear in the same order in both strings. • It differs from the longest common substring problem: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. • Finding the length of the LCS is a classic dynamic programming problem.
Longest Common Subsequence (LCS) • Given two sequences x[1. . m] and y[1. . n], find a longest subsequence common to them both. “a” not “the” x: A B C B D A y: B D C A B BCBA = LCS(x, y)
Longest Common Subsequence (LCS) • Given two sequences x[1. . m] and y[1. . n], find a longest subsequence common to them both. “a” not “the” x: A B C B D A y: B D C A B BCAB BCBA = LCS(x, y)
Motivation • Approximate string matching [Levenshtein, 1966] • Search for “occurance”, get results for “occurrence” • Computational biology [Needleman-Wunsch, 1970’s] • Simple measure of genome similarity cgtacgtacgtatcgtacgtacgtacgtacgt
Motivation • Approximate string matching [Levenshtein, 1966] • Search for “occurance”, get results for “occurrence” • Computational biology [Needleman-Wunsch, 1970’s] • Simple measure of genome similarity acgtacgtcgtatcgtacgt aacgtacgtcgtacgt • n – length(LCS(x, y)) is called the “edit distance”
Brute-force LCS algorithm Enumerate all subsequences of X Test which ones are also subsequences of Y Pick the longest one. Analysis: If X is of length n, then it has 2 n subsequences This is an exponential-time algorithm!
Dynamic programming algorithm Simplification: 1. Look at the length of a longest-common subsequence. 2. Extend the algorithm to find the LCS itself. Notation: Denote the length of a sequence s by | s |. Strategy: Consider prefixes of x and y. • Define c[i, j] = | LCS(x[1. . i], y[1. . j]) |. • Then, c[m, n] = | LCS(x, y) |.
Recursive formulation c[i– 1, j– 1] + 1 if x[i] = y[j], max{c[i– 1, j], c[i, j– 1]} otherwise. c[i, j] = Base case: c[i, j]=0 if i=0 or j=0. Case x[i] = y[ j]: x: 1 2 y: c[i, j] = c[i– 1, j– 1] + 1 i L = j L R E C U R S I F U C T I O N
Recursive formulation c[i– 1, j– 1] + 1 if x[i] = y[j], max{c[i– 1, j], c[i, j– 1]} otherwise. c[i, j] = Case x[i] ≠ y[ j]: best matching might use x[i] or y[j] (or neither) but not both. L x: 1 y: i 2 j != A T C H I N U C T I O N L c[i, j] = max{c[i– 1, j], c[i, j– 1]} MF G
Dynamic-programming algorithm B D C A B A A B C B D A B 0 0 0 0 Seq 1: ABCBDAB Seq 2: BDCABA a a+1 Match b c Max (b, c) Not Match
Dynamic-programming algorithm B D C A B A 0 0 0 0 A 0 0 1 1 B C B 0 0 0 1 1 1 1 2 2 2 2 3 1 2 2 3 LCS: Length 4 D 0 1 2 2 2 3 3 A 0 1 2 2 3 3 B 0 1 2 2 3 4 4 4 Seq 1: ABCBDAB Seq 2: BDCABA a a+1 Match b c Max (b, c) Not Match
Dynamic-programming algorithm A 0 0 B 0 0 D 0 0 C 0 0 A 0 1 B 0 1 1 C 0 1 1 1 2 2 2 B 0 1 1 D A B 0 0 0 1 1 1 2 2 2 2 3 3 LCS: BCBA 2 3 3 4 2 3 4 4 Seq 1: ABCBDAB Seq 2: BDCABA a a+1 Match b c Max (b, c) Not Match
Dynamic-programming algorithm B D C A B A 0 0 0 0 A 0 0 1 1 B 0 1 1 2 1 2 C 0 1 1 2 B 0 1 1 2 2 3 2 3 D 0 1 2 2 2 3 3 A 0 1 2 2 3 3 B 0 1 2 2 3 4 4 4 Multiple solutions are possible. Seq 1: ABCBDAB Seq 2: BDCABA a a+1 Match b c Max (b, c) Not Match
Dynamic-Programming Approach Define L[i, j] to be the length of the longest common subsequence of X[0. . i] and Y[0. . j]. Allow for -1 as an index, so L[-1, k] = 0 and L[k, -1]=0, to indicate that the null part of X or Y has no match with the other. Then we can define L[i, j] in the general case as follows: 1. If xi=yj, then L[i, j] = L[i-1, j-1] + 1 (we can add this match) 2. If xi≠yj, then L[i, j] = max{L[i-1, j], L[i, j-1]} (we have no match here) Case 1: Case 2:
An LCS Algorithm LCS(X, Y ): Input: Strings X and Y with n and m elements, respectively Output: For i = 0, …, n-1, j = 0, . . . , m-1, the length L[i, j] of a longest string that is a subsequence of both the string X[0. . i] = x 0 x 1 x 2…xi and the string Y [0. . j] = y 0 y 1 y 2…yj for i =1 to n-1 do L[i, -1] = 0 for j =0 to m-1 do L[-1, j] = 0 for i =0 to n-1 do for j =0 to m-1 do if xi = yj then L[i, j] = L[i-1, j-1] + 1 else L[i, j] = max{L[i-1, j] , L[i, j-1]} return array L
Analysisof LCS Algorithm • We have two nested loops The outer one iterates n times The inner one iterates m times A constant amount of work is done inside each iteration of the inner loop Thus, the total running time is O(nm) • Answer is contained in L[n, m] (and the subsequence can be recovered from the L table).
- Slides: 17