Overlap Matching By Itamar Nabriski A Amir R
Overlap Matching By Itamar Nabriski A. Amir, R. Cole, G. Landau, R. Hariharan, M. Lewenstein, E. Porat, Overlap Matching, Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms (2001) 279 -288, . 1
Lecture Structure Discrete Convolutions n Overlap Matching Problem Definition n Overlap Matching Algorithm n Reduction from Swap Matching n 2
Discrete Convolutions 3
Definition 1 Let T be a function whose domain is {0, …, n-1} Let P be a function whose domain is {0, …, m-1} (we’ll view them as arrays of numbers of length n and m respectively) The Convolution of T and P at index j is defined as follows: 4
Exempli Gratia T=Ronaldinho (n=10) P=Deco (m=4) (assume, for now, each letter represents a number) (Tx. P)[2]= R o n a l d i n h o D e c o n. D+ae+lc+do Thus Naïve Computional Time is O(m) 5
Computing All The Convolutions Since number of possible convolutions is O(n) Naïve Approach : For each j pay O(m) time, for a total of O(nm) Devious Approach : Using the “Fast Fourier Transform” (FFT) For each j pay O(log m) time, for a total of O(n log m) 6
Using Convolutions Preprocessing Before we perform convolutions on T and P we preprocess each letter using a constant number of constant time functions (total O(n)), retaining the running time of O(nlogm) Using several Convolutions For each index j we can preform a constant amount of convolutions, retaining the running time of O(nlogm) Postprocessing For each index j we can use a constant time function f to format the output of the constant number of convolutions, retaining a running time of O(nlogm) 7
Using Convolutions - Example Testing For Exact Matching at index j ∑ = {a, b} T = ababbaaa P = abba Thus, for example, T and P exactly match at index j = 2 : 2 a b b a 8
Using Convolutions - Example Testing For Exact Matching at index j Preprocessing functions (x is a letter): When we write prep(S), S being a string, we mean we apply prep to all characters of S 9
Using Convolutions - Example Testing For Exact Matching at index j Convolutions we will use: Postprocessing function f : For every index j, Iff there is an exact matching of P and T at index j. 10
Using Convolutions - Example Testing For Exact Matching at index j T = ababbaaa P = abba prepa(T)=10100111 prepa(P)=0110 prepb(T)=01011000 prepb(T)=1001 prepa(T) X prepa(P)[2]= 10100111 0110 = 1*0+0*1+1*0=0 prepb(T) X prepb(P)[2]= j =2 01011000 1001 = 0*1+1*0+0*1=0 F(0, 0) = 1 = exact match at 2 11
Overlap Matching Problem Definition 12
Stractural String n n n Linear structure made of segments Each segment can be marked or unmarked A Stractural String is a concatenation of segments Exampli Gratia T h e _ W a l l s _ O f _ J e r i c h o Thus, the actual characters are not important 0 1 2 3 4 5 6 7 8 9 10 11 12 12 14 15 16 17 18 19 13
Defintion of Overlap Matching Input n n Stractural String T (text) of length n Stractural String P (pattern) of length m m≤n Both T and P have some marked segments Output n All locations k in T where if P is aligned at k, all marked segments have overlaps of even length 14
Example T = Franz_Beckenbauer P = The_Kaiser 3 j=3 F r a n z _ B e c k e n b a u e r T h e _ K a i s e r Overlaps (j=3 is not a valid overlap match – has an odd overlap): F r a n z _ B e c k e n b a u e r T h e _ K a i s e r 15
Overlap Matching Algorithm 16
General Preprocessing Each segment can start at either an even or odd index and end at either an even or odd index We will produce from T four new segments Too, Tee, Toe and Teo Too will have 1’s in the place of all characters belonging to segments that start and end at an odd index and 0’s otherwise , for example: 3 7 Too = 0001111100 And analgously for the other segment types … 17
General Preprocessing Since the pattern P tends to move around we will need to treat its segment indexes a bit differently We will produce from P eight new segments POoo, POee, POoe, POeo, PEoo, PEee, PEoe and PEeo The big ‘O’ in POee means the all segments in P that start and end at an even location relative to T’s index, when P is aligned to an odd index of T (don’t worry there is an example in the next slide …) And analgously for the other segment types … 18
General Preprocessing T= 0 1 2 3 4 5 6 7 8 Too= 0 0 0 0 0 Tee= 0 0 1 0 0 Teo= 0 0 0 1 1 0 Toe= 0 1 1 0 0 0 19
General Preprocessing 0 P= 1 2 3 4 5 6 Since P is always aligned to T at some index j we treat’s P’s indexes relative to T, thus: j+0 j+1 j+2 j+3 j+4 j+5 j+6 P= Assume j is now odd, then for that location we will use the four PO’s: POoo = 0 0 0 0 POee = 0 0 0 1 1 1 0 POeo = 0 0 0 0 POoe = 1 1 0 0 0 20
General Preprocessing Thus for every location j we have 16 (a constant) possible number of Text. Pattern pairings: {Too, Tee, Toe, Tee} × {PXoo, PXee, PXoe, PXee} X=parity(j) If we can determine, using convolutions, for each pairing if it only contains even overlaps we can solve the Overlap Matching problem in O(n log m) time 21
Case 1 occurs when for Tab, Pcd either a=c or b=d This covers 12 of the 16 cases. We now show a solution for when a=c. This covers 8 cases, we use the solution on the reverse* strings of T and P (thus ‘a’ becomes ‘c’ and ‘b’ becomes ‘d’) to solve the 4 remaining cases. * Computing the reverse strings does not alter the run time (do it during general preprocessing) 22
Case 1 (a=c) For every two marked segments St in Tab starting at index x and Sp in Pcd starting at index y: |x-y| is always even (since even-even = even and odd-odd = even) We now create a convolution that will return 0 for index j iff there is no odd overlap at j 23
Case 1 (a=c) For every segment in Tab we replace the 1’s by an alrenating series of 1’s and -1’s beginning with 1. 0 0 0 1 1 0 0 0 1 -1 In case where we have only even (and/or no) overlaps: 0 0 1 -1 0 1 1 1 =1 -1=0 In case where we have at least one odd overlap: 3 1 5 6 -1 0 0 0 1 -1 1 0 1 1 1 =1– 1+1=1>0 24
Case 2 occurs when Toe, Peo (Teo, Poe is symmetric) If a segment in Toe is contained in a segment in Peo or vice versa then the overlap is even, otherwise overlap is odd. 2 0 1 4 8 11 1 0 0 1 1 1 0 0 1 1 0 0 1 0 25
Case 2 Containment Elimination Property Convolution at index j gives zero if all overlaps are containments, otherwise it gives a positive result. To achieve this we will actually use 3 convolutions, a combination of their output will give us the desired answer. 26
Case 2 Fleshing Out The Solution For each segment St in Toe that starts at index st, replace the segment’s 1’s by st, 1… 1, -st For each segment Sp in Peo that starts at index sp, replace the segment’s 1’s by sp, 1… 1, -sp 3 0 0 0 1 3 1 1 1 0 0 0 3 1 1 -3 27
Case 2 Containment: St, 1, 1, …………………. 1, -St Sp, 1, 1, ……………. 1, -Sp sp + (len(Sp)-2) + -sp = len(Sp)-2 St, 1, 1, …………. . 1, -St Sp, 1, 1, ……………. 1, -Sp st + (len(St)-2) + -st = len(St)-2 No Containment (overlap of length k): k-2 St, 1, 1, ……………. 1, St Sp, 1, 1, …………………. 1, Sp sp + (k-2) + -st k-2 St, 1, 1, ……………. 1, St Sp, 1, 1, …………………. 1, Sp st + (k-2) + -sp 28
Case 2 Problem 1 The indexes of the pattern Peo change for each index j , raising the preprocessing time to O(m) for each convolution! Problem 2 We need to find a way to remove “The size of the overlap -2” from the resulting convolution. Containment len(Sp)-2 len(St)-2 No Containment sp + (k-2) + -st st + (k-2) + -sp Remove “overlap - 2” 0 0 sp - st >0 st - sp >0 29
Case 2 Solving Problem 2 Perform another convolution, The “Overlap Length Convolution” subtract its value from the main convolution. Every segment both Toe and Peo is replaced by 0, 1, 1, …. 1, 0 giving us “size of overlap -2” for each overlap. 3 0 0 0 1 3 1 Overlap of length 4 : 0 0 1 1 0 0 0 1 1 0 = 0+1+1+0 = 2 = “overlap -2” 30
Case 2 Solving Problem 1 The trouble is with the pattern Peo segments whose indexes change in each index j. Instead treat the pattern segments relative to Peo. (“Zero Containment Convolution”) T P 0 0 3 4 0 3 1 1 -3 0 0 4 1 1 3 4 0 3 1 1 -3 0 0 2 1 1 0 1 2 -4 0 -2 0 31
Case 2 Solving Problem 1 We created a new problem, overlap convolutions can be negative and thus the overall convolution at index j can turn out to be zero when there is an odd overlap. 7 T P 0 0 0 1 1 7 1 1 -7 0 0 2 1 1 -2 0 0 2 = 2+1 -7 = -4 32
Case 2 Solving Problem 1 We want to get the benefits of both worlds. Towards that end we’ll add to the result a third convolution “The Shifting Convolution”. This simply corrects the problem caused by using the pattern indexes. Every segment in T is replaced by 1, 0… 0, 1 and every segment in P is replaced by 0, 1…, 1, 0 and the result is multiplied by index j. j T P 0 1 2 3 4 0 0 1 1 1 1 0 0 0 1 2 2*j =2*2=4 7 =2 This replenishes our “losses” 33
Case 2 Solving Problem 1 Thus, the convolution gives 0 for each containment overlap and 1 for each non-containment overlap. T P 0 0 1 2 1 0 0 0 0 0 1 1 1 0 0 0 1 2 1 0 0 0 1 0 1 1 1 0 0 0 0 =1 =0 Thus multiplying by j we return “one j” to each non-containment overlap 34
Case 2 Final Algorithm Thus we implement the “Containment Elimination Property” by: Zero Containment Convolution + Shifting Convolution Overlap Length Convolution = Containment Elimination Property 35
Amazing! He’s a master of the “Shifting Convolution” Very Powerful Technique! 36
Case 3 occurs when Too, Pee (Tee, Poo is symmetric) If a segment in Too is contained in a segment in Pee or vice versa then the overlap is odd, otherwise overlap is even. 0 1 2 3 4 1 1 1 0 0 1 1 1 0 7 8 10 13 0 1 1 1 0 0 37
Case 3 - Using Case 2 Containment: St, 1, 1, …………………. 1, -St Sp, 1, 1, ……………. 1, -Sp sp + (len(Sp)-2) + -sp = len(Sp)-2 St, 1, 1, …………. . 1, -St Sp, 1, 1, ……………. 1, -Sp st + (len(St)-2) + -st = len(St)-2 No Containment (overlap of length k): k-2 St, 1, 1, ……………. 1, St Sp, 1, 1, …………………. 1, Sp sp + (k-2) + -st k-2 St, 1, 1, ……………. 1, St Sp, 1, 1, …………………. 1, Sp st + (k-2) + -sp 38
Case 3 We’ll use the same convolution as in Case 2 and two additional ones: Conv 1: Every segment in Too of length len replace by 0, 1, 2, …, len-1. Replace Pee segments by 1, 0, …, 0. 1 T P 0 1 1 1 0 0 1 2 3 4 0 1 0 0 Conv 2: (Opposite of 1) Every segment in Pee of length len replace by 0, 1, 2, …, len-1. Replace Too segments by 1, 0, …, 0. 39
Case 3 The first convolution gives us the length of all areas like the one marked in green: 0 0 1 2 3 4 0 0 0 1 0 0 =3 It gives us for every two overlapping segments which St is “ahead” of Sp If, for some overlap, the first convolution is positive the second will be zero, and vice versa. 0 1 0 0 0 1 0 2 0 =0 40
Case 3 The reverse case (second convolution): 0 0 0 1 2 3 4 0 0 =3 0 =0 And the first one is now zero: 0 0 1 2 0 1 0 0 0 0 41
Case 3 This is true also for containments: 0 0 1 2 3 4 5 6 0 1 0 0 =3 The convolution from Case 2 gives the same value for non containments and zero for containments. 42
Case 3 Thus: Conv 1 + Conv 2 – Conv. Case 2 = positive = containments = odd overlap Conv 1 + Conv 2 – Conv. Case 2 = 0 = no containments = only even overlaps 43
Overlap Matching Algorithm Final Outcome Each Case (1, 2, 3) takes O(n log m) : 1. A constant number of preprocessing functions O(n) 2. A constant number of convolutions O(n log m) 3. A constant time computable function O(1) for a total runtime of O(n log m) 44
Swap Matching 45
Swap Matching C O N N E R _ M A C L E O D C N O N R E _ M A L C E D O 46
Swap Matching Formal Definition Let S =s 1, …, s 2 be a string over alphabet ∑ A swap permutation for S is a permutation π : {1, …, n} → {1, …, n} such that: 1. If π(i) = j then π(j) = I 2. For all i, π(i) member of { i-1 , i+1 } 3. If π(i) ≠ i then sπ(i) ≠ si 47
Swap Matching Lemma (will not be proven): A solution to swap matching over alphabet {a, b} of time O(f(n, m)) implies a solution of time O(log|∑|f(n, m)) over alphabet ∑. And there exists an algorithm to do so. A. Amir, Y. Aumann, G. Landau, M. Lewensten, N. Lewenstein, Pattern matching with swaps, J. Algorithms 37 (2) (2000) 247 -266. 48
Swap Matching Maximal Alternating Segment (MAS) a b b a a b a b b b 49
Swap Matching Lemma: The pattern P does NOT match in a particular alignment iff there exists a MAS A in T and MAS B in P such that: 1. The characters of A and B misalign in the overlap 2. The overlap is of odd length 50
Swap Matching Lemma Intuition Even overlap mismatch a a a b b b a Odd overlap mismatch a a a b b b a b 51
Swap Matching Proof →(by contradiction): Assume P is aligned to T at index j and we can’t swap match and the two MAS A, B do not exist: 1. All MAS’ overlaps match exactly – contradiction – we don’t even need to swap. 2. There exists at least one pair A, B that do not match exactly in an even overlap: w. l. o. g overlap. A=(ab)* overlap. B=(ba)* we can swap within the overlap boundaries and get the desired result - contradiction Thus, there must be one MAS A, B that have a misaligned odd overlap 52
Swap Matching Proof ←(by contradiction): Assume there exist MAS A, B that misalign in an odd overlap and P and T swap match at index j : w. l. o. g overlap. A=(ab)*a overlap. B=(ba)*b The we must swap with letters outside of the overlap but by definition of MAS this will not help and we can’t swap match. Contradiction. 53
Swap Matching Algorithm Construct from T: 1. Teven-a where all MAS with a’s on even indexes are marked segments. 2. Todd-a where all MAS with a’s on odd indexes are marked segments. 0 T 2 6 7 8 9 = a b b b b a a a b Teven-a = a b b b a a b Todd-a = a b b b a a b 54
Swap Matching Algorithm We provide a similar construction for P : Peven-a , Podd-a using P’s index ! When matching, if the index j of T is odd we will use one for the other (Peven-a becomes Podd-a and vice versa) 0 Peven-a = a 4 b b b a Aligned at T ’s index 3 it becomes Podd-a: 3 a 7 b b b a 55
Swap Matching Algorithm If index j is even, T swap matches P iff Teven-a overlap matches Podd-a at j and Todd-a overlap matches Peven-a at j. If index j is odd, T swap matches P iff Teven-a overlap matches Peven-a at j and Todd-a overlap matches Podd-a at j. 56
Swap Matching Algorithm – Why does it work? An even-a MAS and an odd-a MAS will never exactly match: 0 even-a MAS odd-a MAS a b a b a b By the lemma if their overlap is odd then swap matching is not possible and this is exactly what we examine using the Overlap Matching method 57
Swap Matching Algorithm Runtime O(n log m): 1. We pay O(n) to segmentize to MAS. 2. We pay O(n log m) to run overlap matching. Thus, for an alphabet ∑ we can swap match at O(n log m log|∑|) Improvement over previous deterministic upper bound of O(nm 1/3 log m log|∑|) A. Amir, Y. Aumann, G. Landau, M. Lewensten, N. Lewenstein, Pattern 58 matching with swaps, J. Algorithms 37 (2) (2000) 247 -266.
- Slides: 58