k Abelian pattern matching Revisited corrected and extended
k - Abelian pattern matching: Revisited, corrected, and extended Golnaz Badkobeh, Hideo Bannai, Maxime Crochemore, Tomohiro I, Shunsuke Inenaga, Shiho Sugimoto 1 © NEC Corporation 2019
Introduction ▌Since the seminal paper [Erdös, 1961], the study of Abelian equivalence on strings has attracted much attention, both in word combinatorial and string algorithmics. ▌k-Abelian equivalence [Huova et al. , 2011] is generalized Abelian Equivalence. 2 © NEC Corporation 2019
Introduction ▌Our work is based on the paper [Ehlers et al. , 2015]. l. This paper is the first work on k-Abelian pattern matching. l. We found out that some of Ehlers et al. ’s claimed bounds are unfortunately incorrect. ▌In this paper, we present corrected bounds and algorithms. 3 © NEC Corporation 2019
Abelilan Equivalence ▌Two strings u and v of equal length are said to be Abelian equivalent if the numbers of occurrences of each letter are equal in u and v. (denotes u =A v) ▌Examples: lemon =A m e l o n listen =A silent 2 =A 1 2 + 1 eleven plus two =A twelve plus one 11 + 4 © NEC Corporation 2019
Abelian Pattern Matching ▌Definition: Given a text T of length n and pattern P of length m, locate all factors of T that are Abelian equivalent to P. ▌Example: T = ababbcacbacb P = abccb ▌There are some algorithms to solve the Abelian pattern matching (also called jumbled pattern matching) [Amir et al. , 2016], [Butman et al. , 2004], [Kociumaka et al. , 2017]. 5 © NEC Corporation 2019
k-Abelian Equivalence ▌Definition: For a positive integer k, two strings u and v of equal length are said to be k-Abelian equivalent if the numbers of occurrences of each string of length at most k are equal in u and v. ▌Alternative definition: For a positive integer k, two strings u and v of equal length are said to be k-Abelian equivalent if the numbers of occurrences of each string of length k are equal in u and v and the prefixes of length k-1 of u and v are same. 6 © NEC Corporation 2019
k-Abelian Equivalence ▌Alternative definition: For a positive integer k, two strings u and v of equal length are said to be k-Abelian equivalent if the numbers of occurrences of each string of length k are equal in u and v and the prefixes of length k-1 of u and v are same. ▌Examples: (k = 3) u = abaababbaab v = abbaabaabab ▌Note: 1 -Abelian equivalence is Abelian equivalence 7 © NEC Corporation 2019
k-Abelian Equivalence ▌Alternative definition: For a positive integer k, two strings u and v of equal length are said to be k-Abelian equivalent if the numbers of occurrences of each string of length k are equal in u and v and the prefixes of length k-1 of u and v are same. ▌Examples: (k = 3) u = abaababbaab v = abbaabaabab ▌Note: 1 -Abelian equivalence is Abelian equivalence 8 © NEC Corporation 2019
k-Abelian Equivalence ▌Alternative definition: For a positive integer k, two strings u and v of equal length are said to be k-Abelian equivalent if the numbers of occurrences of each string of length k are equal in u and v and the prefixes of length k-1 of u and v are same. ▌Examples: (k = 3) u = abaababbaab v = abbaabaabab ▌Note: 1 -Abelian equivalence is Abelian equivalence 9 © NEC Corporation 2019
k-Abelian Equivalence ▌Alternative definition: For a positive integer k, two strings u and v of equal length are said to be k-Abelian equivalent if the numbers of occurrences of each string of length k are equal in u and v and the prefixes of length k-1 of u and v are same. ▌Examples: (k = 3) u = abaababbaab v = abbaabaabab ▌Note: 1 -Abelian equivalence is Abelian equivalence 10 © NEC Corporation 2019
k-Abelian Equivalence ▌Alternative definition: For a positive integer k, two strings u and v of equal length are said to be k-Abelian equivalent if the numbers of occurrences of each string of length k are equal in u and v and the prefixes of length k-1 of u and v are same. ▌Examples: (k = 3) u = abaababbaab v = abbaabaabab ▌Note: 1 -Abelian equivalence is Abelian equivalence 11 © NEC Corporation 2019
k-Abelian Equivalence ▌Alternative definition: For a positive integer k, two strings u and v of equal length are said to be k-Abelian equivalent if the numbers of occurrences of each string of length k are equal in u and v and the prefixes of length k-1 of u and v are same. ▌Examples: (k = 3) u = abaababbaab v = abbaabaabab ▌Note: 1 -Abelian equivalence is Abelian equivalence 12 © NEC Corporation 2019
Offline / Online k-Abelian Pattern Matching ▌Definition: Given a text T of length n and a pattern P of length m over an alphabet of size s and positive integer k, locate all factors of T that are k-Abelian equivalent to P. ▌Example: T = bcaababcaa P = abaab ▌Online: Given P as preprocess string and k for preprocess, and given a text T as query string, locate all factors of T that are k-Abelian equivalent to P 13 © NEC Corporation 2019
Offline k-Abelian Pattern Matching – Previous Work ▌Previous work [Ehlers et al. , 2015] clams: This problem can be solved in O(n+m) time and O(m) space. ▌To achieve O(m) computing space, compute for each 2 m length factor T[tm. . (t+2)m]. n T t 1 t 2 2 m 14 © NEC Corporation 2019 ・・・・・・・ tn/m
Offline k-Abelian Pattern Matching – Previous Work ▌For each 2 m length factor T[tm. . (t+2)m], compute encoded string of T[tm. . (t+2)m]$P. n T t 1 t 2 $ P encode 15 © NEC Corporation 2019 ・・・・・・・ tn/m
Offline k-Abelian Pattern Matching – Previous Work ▌For each 2 m length factor T[tm. . (t+2)m], compute encoded string of T[tm. . (t+2)m]$P. n T t 1 t 2 $ P encode ・・・・・・・ l. The lexicographical rank of the k-gram l. Example: (k = 2) bcaababcaa$abaab 561243461$2413 l. Abelian pattern matching on encoded string 16 © NEC Corporation 2019 tn/m
Offline k-Abelian Pattern Matching – Previous Work ▌To compute encoded strings and to test whether Abelian equivalent or not, they build the suffix array. l. Any existing linear-time suffix array constructing algorithms needs O(n) space for each factor. ▌Previous work [Ehlers et al. , 2015] clams: This problem can be solved in O(n+m) time and O(m) space. 17 © NEC Corporation 2019
Offline k-Abelian Pattern Matching – Our Algorithm ▌We give an O(m) space solution when pattern P is over an integer alphabet [1. . cm], where c is a constant. ▌We replace any letter in T which doesn’t appear in P with cm+1. alphabet size is cm+1. ▌Example: T = cabacbcaababcaadeba xabaxbxaababxaaxxba P = abaab 18 © NEC Corporation 2019
Offline k-Abelian Pattern Matching – Our Algorithm ▌For each t (0 ≤ t ≤ -2), construct the suffix tree of wt = T’[tm+1. . (t+2)m]$P. n T’ t 1 t 2 $ 2 m 19 © NEC Corporation 2019 P ・・・・・・・ suffix tree tn/m
Offline k-Abelian Pattern Matching – Our Algorithm ▌For each t (0 ≤ t ≤ -2), construct the suffix tree of wt = T’[tm+1. . (t+2)m]$P. ▌For each occurrence of a k-gram in P, construct a bucket. 20 © NEC Corporation 2019
Offline k-Abelian Pattern Matching – Our Algorithm ▌For each t (0 ≤ t ≤ -2), construct the suffix tree of wt = T’[tm+1. . (t+2)m]$P. ▌For each occurrence of a k-gram in P, construct a bucket. b a b 13 ▌Example: (k = 2) 5 wt = bxaababxaa$abaab 21 © NEC Corporation 2019 a b x x a a � � a b a x a � � 7 4 6 a b � 3
Offline k-Abelian Pattern Matching – Our Algorithm O(m) time and apace ▌For each t (0 ≤ t ≤ -2), construct the suffix tree of wt = T’[tm+1. . (t+2)m]$P. ▌For each occurrence of a k-gram in P, construct a bucket. ▌For every leaf in suffix tree, O(m) time b a compute its ancestor of depth k. a O(m) time x b a a b b � x a x b b a a � x a 13 b � ▌Example: (k = 2) � � 5 7 wt = bxaababxaa$abaab 4 22 © NEC Corporation 2019 6 3
Offline k-Abelian Pattern Matching – Our Algorithm ▌We check each factor of T’ of length m fulfill all the buckets. O(m) time ▌Keep track of a sliding window of length m. l. The positions in the buckets are removed b as soon as they are out of the window. ▌If all the buckets are fulfilled, a x we check if the prefixes a a of length k-1 are same. a b b � x 13 ▌Example: (k = 2) 5 wt = bxaababxaa$abaab 23 © NEC Corporation 2019 � a b a x a � � 7 4 6 a b � 3
Offline k-Abelian Pattern Matching – Our Algorithm ▌We check each factor of T’ of length m fulfill all the buckets. O(m) time ▌Keep track of a sliding window of length m. l. The positions in the buckets are removed b as soon as they are out of the window. ▌If all the buckets are fulfilled, a x we check if the prefixes a a of length k-1 are same. a b b � x 13 ▌Example: (k = 2) 5 wt = bxaababxaa$abaab 24 © NEC Corporation 2019 � a b a x a � � 7 4 6 a b � 3
Offline k-Abelian Pattern Matching – Our Algorithm ▌We check each factor of T’ of length m fulfill all the buckets. O(m) time ▌Keep track of a sliding window of length m. l. The positions in the buckets are removed b as soon as they are out of the window. ▌If all the buckets are fulfilled, a x we check if the prefixes a a of length k-1 are same. a b b � x 13 ▌Example: (k = 2) 5 wt = bxaababxaa$abaab 25 © NEC Corporation 2019 � a b a x a � � 7 4 6 a b � 3
Offline k-Abelian Pattern Matching – Our Algorithm ▌We check each factor of T’ of length m fulfill all the buckets. O(m) time ▌Keep track of a sliding window of length m. l. The positions in the buckets are removed b as soon as they are out of the window. ▌If all the buckets are fulfilled, a x we check if the prefixes a a of length k-1 are same. a b b � x 13 ▌Example: (k = 2) 5 wt = bxaababxaa$abaab 26 © NEC Corporation 2019 � a b a x a � � 7 4 6 a 3 b a b � 3
Offline k-Abelian Pattern Matching – Our Algorithm ▌We check each factor of T’ of length m fulfill all the buckets. O(m) time ▌Keep track of a sliding window of length m. l. The positions in the buckets are removed b as soon as they are out of the window. ▌If all the buckets are fulfilled, a x we check if the prefixes a a of length k-1 are same. a b b � x 13 ▌Example: (k = 2) 5 wt = bxaababxaa$abaab 27 © NEC Corporation 2019 � a b 4 a x b a x a � � 7 4 6 a 3 b a b � 3
Offline k-Abelian Pattern Matching – Our Algorithm ▌We check each factor of T’ of length m fulfill all the buckets. O(m) time ▌Keep track of a sliding window of length m. l. The positions in the buckets are removed b as soon as they are out of the window. ▌If all the buckets are fulfilled, a x we check if the prefixes a 5 a of length k-1 are same. a b b � x 13 ▌Example: (k = 2) 5 wt = bxaababxaa$abaab 28 © NEC Corporation 2019 � a b 4 a x b a x a � � 7 4 6 a 3 b a b � 3
Offline k-Abelian Pattern Matching – Our Algorithm ▌We check each factor of T’ of length m fulfill all the buckets. O(m) time ▌Keep track of a sliding window of length m. l. The positions in the buckets are removed b as soon as they are out of the window. ▌If all the buckets are fulfilled, a x we check if the prefixes a 5 a of length k-1 are same. a b b � x 13 ▌Example: (k = 2) 5 wt = bxaababxaa$abaab 29 © NEC Corporation 2019 � a b 4 6 a x b a x a � � 7 4 6 a 3 b a b � 3
Offline k-Abelian Pattern Matching – Our Algorithm ▌We check each factor of T’ of length m fulfill all the buckets. O(m) time ▌Keep track of a sliding window of length m. l. The positions in the buckets are removed b as soon as they are out of the window. ▌If all the buckets are fulfilled, a x we check if the prefixes a 5 a of length k-1 are same. a b b � x 13 ▌Example: (k = 2) 5 wt = bxaababxaa$abaab 30 © NEC Corporation 2019 � a b 4 6 a x b a x a � � 7 4 6 a 3 b a b � 3
Offline k-Abelian Pattern Matching – Our Algorithm ▌Theorem: Let P be a pattern of length m over an integer alphabet [1. . cm] with any positive constant c, and T be a text of length n over an arbitrary integer alphabet. Then, for a given integer k>0, we can solve offline k-pattern matching in O(n+m) time using O(m) space. 31 © NEC Corporation 2019
Online k-Abelian Pattern Matching – Previous Work ▌Previous work [Ehlers et al. , 2015] claims: This problem can be computed in O(m) preprocessing time, O(m) working space and O(loglogs) time per text letter. 32 © NEC Corporation 2019
Online k-Abelian Pattern Matching – Previous Work ▌Previous work [Ehlers et al. , 2015] claims: This problem can be computed in O(m) preprocessing time, O(m) working space and O(loglogs) time per text letter. b l. Compute the suffix tree of P. a l. Traverse with T. a b a l. Example: (k = 2) P = abaab T = bxaababxaa 33 © NEC Corporation 2019 a b 3 a a b b
Online k-Abelian Pattern Matching – Previous Work ▌Previous work [Ehlers et al. , 2015] claims: This problem can be computed in O(m) preprocessing time, O(m) working space and O(loglogs) time per text letter. l. To traverse quickly, they use van Emde Boas structure. For an integer universe U = [1. . u], requires Θ(u) space. 34 © NEC Corporation 2019
Online k-Abelian Pattern Matching – Previous Work ▌Previous work [Ehlers et al. , 2015] claims: This problem can be computed in O(m) preprocessing time, O(m) working space and O(loglogs) time per text letter. l. To traverse quickly, they use van Emde Boas structure. For an integer universe U = [1. . u], requires Θ(u) space. l. For each node of the suffix tree, the universe size u is equal to s. l. Suffix tree of P of length m can be O(m) nodes. l. This approach must use O(ms) space. 35 © NEC Corporation 2019
Extended k-Abelian Equivalence ▌Definition: For a positive integer k, two strings u and v of equal length are said to be k-Abelian equivalent if the numbers of occurrences of each string of length k are equal in u and v. ▌Examples: (k=3) u = abaababbaab v = baabaabbaba 36 © NEC Corporation 2019 the prefixes of length k-1 of u and v are same
Extended k-Abelian Pattern Matching– Previous Work ▌Previous work [Ehlers et al. , 2015] claims: This problem can be computed in O(mlogk) preprocessing time, O(ms) working space and O(1) worst-case time per text letter. ▌Compute the k-truncated suffix tree of P and traverse with T. ▌Relies on the result of Gawrychowski et al. in 2014 for the constant-time weighted ancestor queries on suffix tree. 37 © NEC Corporation 2019
Extended k-Abelian Pattern Matching– Previous Work ▌Gawrychowski et al. proposed the algorithm which returns in constant time the node of suffix tree of P that corresponds to the factor P[i. . j] for given i and j. l. Previous work [Ehlers et al. , 2015] claims that its preprocessing time is O(mlogk) time for k-truncated suffix tree. l. Gawrychowski et al. ’s paper does not consider construction time. l. It seems challenging to construct in O(mlogk) time. One of the authors wondered that O(mlog 3 k) or O(mlog 4 k) construction time might be plausible. 38 © NEC Corporation 2019
Extended k-Abelian Pattern Matching– Our Algorithm ▌Instead the data structure proposed by Gawrychowski et al. , we use a weighted ancestor data structure. l. If there is a dynamic predecessor data structure for a set of m integers, then weighted ancestor queries on a weighted tree with m nodes can be answered in O(pred(m, m)) time with O(m) space [Kopelowitz and Lewenstein, 2007] where pred(m, m) is query/updates time of the dynamic predecessor query. 39 © NEC Corporation 2019
Extended k-Abelian Pattern Matching– Our Algorithm ▌Instead the data structure proposed by Gawrychowski et al. , we use a weighted ancestor data structure. l. If there is a dynamic predecessor data structure for a set of m integers, then weighted ancestor queries on a weighted tree with m nodes can be answered in O(pred(m, m)) time with O(m) space [Kopelowitz and Lewenstein, 2007] where pred(m, m) is query/updates time of the dynamic predecessor query. ▌If we use the dynamic predecessor data structure by Beame and Fich, extended k-Abelian pattern matching can be solved in O( ) preprocessing time, O(m) working space, and O( ) worst case time per text letter. 40 © NEC Corporation 2019
Extended k-Abelian Pattern Matching– Our Algorithm ▌Instead the data structure proposed by Gawrychowski et al. , we use a weighted ancestor data structure. l. If there is a dynamic predecessor data structure for a set of m integers, then weighted ancestor queries on a weighted tree with m nodes can be answered in O(pred(m, m)) time with O(m) space [Kopelowitz and Lewenstein, 2007] where pred(m, m) is query/updates time of the dynamic predecessor query. 41 © NEC Corporation 2019
Extended k-Abelian Pattern Matching– Our Algorithm ▌Instead the data structure proposed by Gawrychowski et al. , we use a weighted ancestor data structure. l. If there is a dynamic predecessor data structure for a set of m integers, then weighted ancestor queries on a weighted tree with m nodes can be answered in O(pred(m, m)) time with O(m) space [Kopelowitz and Lewenstein, 2007] where pred(m, m) is query/updates time of the dynamic predecessor query. ▌If we use y-fast trie [Willard, 1983] in conjunction with cuckoo hashing [Pagh and Rodler, 2004], extended k. Abelian pattern matching can be solved in O(mloglogm) expexted preprocessing time, O(m) working space, and O(loglogm) worst case time per text letter. 42 © NEC Corporation 2019
Extended k-Abelian Pattern Matching– Previous Work ▌Theorem: Given a static pattern P of length m over integer alphabet [1. . s], and a positive integer k, extended k-Abelian pattern matching can be solved in -- O(ms) preprocessing time, O(ms) working space, and O(logk) worst-case per text letter --O(mloglogm) preprocessing time, O(m) working space, and O(logk) worst-case per text letter --O( and O( ) preprocessing time, O(m) working space, ) worst-case per text letter --O(mloglogm) expected preprocessing time, O(m) working space, and O(loglogm) worst-case time per text letter 43 © NEC Corporation 2019
Conclusion ▌For offline k-Abelian Pattern matching, we present that let P be a pattern of length m over an integer alphabet [1. . cm] with any positive constant c, and T be a text of length n over an arbitrary integer alphabet. Then, for a given integer k>0, we can solve offline k-pattern matching in O(n+m) time using O(m) space. 44 © NEC Corporation 2019
Conclusion ▌For online k-Abelian Pattern matching, we present that corrected complexity of the algorithm which proposed in pervious work [Ehlers et al. , 2015]. 45 © NEC Corporation 2019
Conclusion ▌For extended k-Abelian Pattern matching, we present that let P be a pattern of length m and T be a text of length n over an arbitrary integer alphabet. Then, for a given integer k>0, we can solve extended k-pattern matching in -- O(ms) preprocessing time, O(ms) working space, and O(logk) worst-case per text letter --O(mloglogm) preprocessing time, O(m) working space, and O(logk) worst-case per text letter --O( ) preprocessing time, O(m) working space, and O( ) worst-case per text letter --O(mloglogm) expected preprocessing time, O(m) working space, and O(loglogm) worst-case time per text letter 46 © NEC Corporation 2019
- Slides: 46