The Gapped Longest Common Subsequence YungHsing Peng Innovative
The Gapped Longest Common Subsequence Yung-Hsing Peng (彭永興) Innovative Digi. Tech-Enabled Applications and Services Institute for Information Industry (資策會-創新應用服務研究所) Date: Nov. 18 th, 2016 1
Today’s Menu • • The longest common subsequence (LCS) The fixed gap LCS (FGLCS) The variable gap LCS (VGLCS) Small talk 3
The LCS Problem R C L P C R R R P P L C P L R C Solvable in O(mn) time (Hirschberg, 1975) 4
Solving the LCS Problem n A = a 1 a 2 am and B = b 1 b 2 bn Li, j denotes the length of the LCS of a 1 a 2 ai and b 1 b 2 bj. Dynamic programming: Li, j = Li-1, j-1 + 1 if ai= bj max{ Li-1, j, Li, j-1 } if ai bj L 0, 0 = L 0, j = Li, 0 = 0 for 1 i m, 1 j n. n Time complexity: O(mn) n n 5
An Example R P P L C P L R C R 0 1 1 1 1 0 1 C 1 1 0 2 2 0 2 L 1 1 1 2 0 2 2 3 0 3 3 P 1 0 2 2 2 0 3 3 C 1 2 2 2 3 0 3 3 3 0 4 R 1 0 2 2 2 3 3 3 4 0 4 Dynamic programming: O(mn) time 6
The Fixed Gap LCS Problem R C L P C R R R P P L C P L k=2 k=3 R C R C k=1 Solvable in O(mn) time (Iliopoulos et al. , 2007) 7
The Matching Path R P P L C P L R C L P C R R A matching path represents a common subsequence. 8
Solving the FGLCS Problem K=1 R P P L C P L R C R 1 0 0 0 1 0 C 0 0 0 1 0 0 2 L 0 0 0 1 0 0 2 0 0 P 0 0 1 0 0 0 2 0 0 0 C 0 0 0 2 0 0 3 R 1 0 0 0 0 3 0 R 1 0 0 0 0 1 0 9
Solving the FGLCS Problem K=2 R P P L C P L R C R 1 0 0 0 1 0 C 0 0 0 1 0 0 2 L 0 0 0 2 0 0 P 0 0 2 0 0 0 3 0 0 0 C 0 0 0 3 0 0 4 R 1 0 0 0 0 4 0 Naïve dynamic programming: O(k 2 mn) time Iliopoulos gives an O(mn)-time algorithm. 10
The Max-Queue Maintain the strict decreasing sequence Q for finding the maximum of the last k numbers in sequence D. k=2 D=9 D = 9, 3, 7, 2, 5 D = 9, 3, 7, 2, 5, 8 Q=9 Q = 9, 3 Q = 9, 7 Q = 7, 2 Q = 7, 5 Q=8 Remove expired or dominated elements. The maximum among the last k numbers in D is the first number in Q. 11
9 3 7 2 5 8 9 9 9 7 7 8 5 4 6 3 7 3 5 5 6 6 7 7 1 0 6 8 2 9 1 1 6 8 8 9 4 2 1 5 5 7 4 4 4 5 5 7 6 3 5 0 4 1 6 6 6 5 5 4 2 6 0 3 1 4 2 6 6 6 3 4 K=2 Horizontal 9 9 9 7 7 8 9 9 9 8 8 9 5 5 6 8 8 9 6 6 6 6 5 7 K=2 Vertical 12
Results from Iliopoulos et al. With max-queue, the pink matrix can be determined on-the-fly in O(mn) time. FGLCS can be solved in O(mn) time. How about different k in different regions? 13
The Variable Gap LCS Problem 2 2 2 2 R C L P C R R R P P L C P L R C 2 2 2 2 2 1 3 0 0 2 R C L P C R R R P P L C P L R C 2 0 1 0 3 1 0 1 4 Special case General case Solvable in O(mn) time (Peng and Yang, 2014) 14
Solving the VGLCS Problem 2 0 1 0 3 1 0 1 4 R P P L C P L R C 1 R 0 1 0 0 3 C 0 0 0 2 0 L 0 0 1 0 0 0 0 P 0 0 1 0 0 0 2 C 0 0 2 0 0 0 3 0 R 1 0 0 0 0 1 0 0 2 R 1 0 0 0 0 3 0 0 Naïve dynamic programming: O(m 2 n 2) time We derive an O(mn)-time algorithm by ISMQ. 15
The Incremental Suffix Maximal Query D = 9, 3, 7, 2 SMQD(2) = 7 SMQD(4) = 2 (|D| = 4) D = 9, 3, 7, 2, 5 SMQD(2) = 7 SMQD(4) = 5 (|D| = 5) D = 9, 3, 7, 2, 5, 8 SMQD(2) = 8 SMQD(4) = 8 (|D| = 6) 16
Solving VGLCS with ISMQ 3 C 1 0 5 C C C 4 0 3 0 4 0 0 5 0 0 0 4 0 4 0 0 2 0 0 5 0 0 3 0 2 0 5 0 0 4 4 0 3 0 4 0 4 3 0 3 5 0 6 3 0 0 4 0 2 0 0 3 0 4 3 5 0 4 0 0 0 4 0 3 0 5 6 4 0 6 0 4 5 3 0 5 0 4 7 5 0 4 0 5 6 5 7 This algorithm takes O(αmn) time. 17
Approaches for Answering ISMQ Naïve search: O(α) = O(|D|) Balanced binary search tree: O(α) = O(log |D|) van Emde Boas tree: O(α) = O(loglog |D|) Our approach: O(α) = O(1) 18
D= 9, 3 SMQD = 9, 3 Tables for SMQD D= 9, 3, 7 SMQD = 9, 7, 7 Maintain SMQD by union-find operations. D= 9, 3, 7, 2 SMQD = 9, 7, 7, 2 D= 9, 3, 7, 2, 5 SMQD = 9, 7, 7, 5, 5 D= 9, 3, 7, 2, 5, 8 SMQD = 9, 8, 8, 8 19
The Union-find Problem find(9) = A A B A C D C find(5) = D 9 3 7 2 5 8 unite(3, 5, B) find(9) = A find(5) = B A C B C 9 3 7 2 5 8 unite(2, 5, E) find(9) = A find(5) = E A E E E 9 3 7 2 5 8 20
Union-find Operations make(x, S): Create a new singleton set {x} whose name is S. find(x): Retrieve the name of the unique set containing x. unite(x, y, S): Unite the two different sets containing x and y into one new set named S. 21
Our Union-find Approach for ISMQ make 9 D[1] D[2] D[3] D[4] 9 3 7 2 make 9 unite make 9 9 9 3 3 3 make 9 9 3 7 7 7 9 7 2 2 find: reports SMQD 22
Our Union-find Approach for ISMQ make 9 D[1] D[2] D[3] D[4] D[5] D[6] 9 3 7 2 5 8 unite 9 3 9 9 7 2 9 5 9 9 unite 9 3 7 7 9 7 7 2 2 5 9 3 7 8 5 5 5 unite 3 3 7 2 make 5 8 8 8 5 8 23
make 9 9 9 unite 9 3 3 9 9 3 7 7 unite 9 3 7 7 7 2 2 8 5 5 5 9 5 8 8 8 5 8 make: insert a tail node to the list unite: unite a node v with its previous node pre(v) Incremental tree set union (Gabow and Tarjan, 1985) 24
The Incremental Tree Set Union a A b B d D c C f F e E a A b B d D e E a A unite(c, f, C) b B g G d D e E c C f C g G unite(b, f, B) Forbidden because b and f are not parent and child c C f F g G 25
Results from Tarjan General set union Each union-find operation takes O(α) time, where β is the inverse ackerman function of |D|. Incremental tree set union Each union-find operation takes O(1) time. The list in our algorithm is an incremental tree. 26
Analysis for Our ISMQ Algorithm Suppose that D grows to size p, and there are q ISMQs in total. (1) There are p makes. (2) There are q finds. (3) There are no more than p – 1 unites. (4) For each unite, it takes O(1) time to locate v and pre(v). Total time: O(p + q) O(mn)-time algorithm for VGLCS 27
The Variable Elastic Gap LCS (VEGLCS) Upper bound for gaps Range for gaps The VEGLCS problem can be solved in O(mn) time. (Please refer to our paper) 28
Conclusion and Future Work We propose O(mn)-time algorithms for the VGLCS and the VEGLCS problem, based on ISMQ. Future: (1) To consider other applications of ISMQ • Gaps on LCS variants • Edit distance • Sequence alignment (2) To devise useful gap functions • Motif finding with some emphasized amino acids 29
Small Talk 淺談創新思維 30
它傻瓜, 你聰明! • By 1898, estimated 1. 5 million "camera fiends” (amateur shutterbugs) • "Kodak“(kodaking, kodakers, kodakery), 32 32
殺頭生意有人做! 刑事雙月刊 January-February 200 35
“I have not failed 10, 000 times. I have not failed once. I have succeeded in proving that those 10, 000 ways will not work. When I have eliminated the ways that will not work, I will find the way that will work. ” Edison accumulated 2, 332 patents worldwide. 1847 -1937 做多錯多錯多對多! 36
未來受問題左右 • How do we get ourselves to water Paradigm Shift! • How do we get water to come to us 37
最頂尖的科學家說 If I had only one hour to save the world, I would spend fifty-five minutes defining the problem, and only five minutes finding the solution. Albert Einstein 38
39
- Slides: 39