Efficient Computation of Substring Equivalence Classes with Suffix

Efficient Computation of Substring Equivalence Classes with Suffix Arrays Kazuyuki Narisawa, Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda Kyushu University, Japan

Contents • • Introduction Problem definition Suffix tree based algorithm Simulation by suffix array Computational experiment Application Summary

Main contribution Time and space efficient computation of substring equivalence classes [Blumer et al. 1987] with suffix arrays • Linear time and space • is faster and requires less memory than suffix tree and CDAWG based methods.

Equivalence relation and classes Substrings with essentially identical occurrence in w example Betty–bought–a–bit–of–better–butter–and– made–a–better–batter–after–breakfast. bet [–better–b]

Problem • Input : string w of length n • Output: the equivalence classes on w • Difficulty ▫ The total number of elements in the equivalence classes (shortly ECs) is O(n 2). • Solution • The number of the ECs is O(n). • Each EC can be succinctly represented in O(1) space.

Succinct representation of the ECs • representative ・・・ the longest element(maximal extension) • minimal strings ・・・ the elements which belong to another EC when the left or right most character is deleted [x] = Substring( x ) ∩ ( y is∪ Superstring( y ) ) minimal the elements of [x] can be enumerated with the representative and minimal strings example Betty-bought-a-bit-of-betterbutter-and-made-a-betterbatter-after-breakfast. representative minimal strings - be t t e r - b - b e t t e r - be t t e be t t e et te - be t t e et te r r r r r - b b

Problem • Input : string w of length n • Output: succinct representations of the equivalence classes on w ▫ additionally, we will output size ( the number of elements in each EC) frequency ( the number of occurrences of the elements in each EC ) of each EC

Possible solutions • Suffix Tree[Weiner 1973] • Compact Directed Acyclic Word Graph (CDAWG) [Blumer et al. 1985] • ECs can be computed with either of the data structures in linear time and space.

Suffix tree (with suffix link) ababbc$ ba a b b b c a c $ b 1 c $ 3 7 Ignore leaves here because they form a trivial EC. c$ b b a b b c $ 2 $ c$ 11 10 9 a b b c $ 5 6 b c a $ b b c $ 4 8

Equivalence classes on suffix tree ababbc$ ba EC def. a b b b c a c $ b 1 c $ 3 7 Essentially same occurrence substrings c$ b b a b b c $ 2 a b b c $ 5 6 $ c$ b a b b c $ 4 11 EC abb 9 EC babb c babb bab $ bab ba 10 ba 8

Suffix tree algorithm 1. foreach node v in suffix tree { 2. if(node v is representative of EC [v]≡ ) { 3. follow suffix link; 4. while(node is in EC [v]≡) { 5. follow suffix link; 6. compute size and minimal strings; 7. } 8. } 9. output succinct representation of EC [v]≡; 10. }

Algorithm with suffix tree output representative ? representative, in other same EC ? number ofinincoming $ link. EC follow suffix a minimal strings, b same leaves representative ? to representative suffix links = 1 or not ? number b ? c $back size, frequency number of incoming 11 suffix links = 1 or 10 not ? a b b b c a c $ b 1 c $ 3 7 b a b not b c$ 9 representative Suffix tree requires large memory space. continue suffix a b c treeb traversal a $ b a b b c $ 2 b c $ 5 6 b b c $ 4 8

Suffix array [Manber and Myers 1993] • Can simulate traversal on suffix tree using lcp and rank arrays [Kasai et al. 2003] • Can simulate traversal on suffix links using additional data structure: suffix link table [Abouelhoda et al. 2004]

Suffix array ababbc$ lexicographically sort suffixes Suffix Array 1 2 3 4 5 6 7 8 9 10 11 1 3 7 2 6 5 4 8 9 10 11 a b b b a b b c $ b a b bba b a bb bc $ a b b b a b b c $ b ba b b c $ c 10 $ b b bbb a b b ac $ b a b b c b $ b b c a c a b b c $ $ b b a $ $ b b b 1 b b c c c $ ab b $ $ c c b $ b c 3 $ 7 b $ c 5 4 $ c $ 2 6 $ $ 11 9 8 a b b b a b b c $ b a b b c $ b b b a b b c $ $

Lcp array ababbc$ ba a b b b c a c $ b 1 c $ 3 7 c$ b b a b b c $ 2 c $ 6 a b b c $ 5 lcp[i]：the length of the longest common prefix of i th and (i – 1) th suffixes $ c$ 11 10 b c a $ b b c $ 4 9 Lcp Array 1 2 3 4 5 6 7 8 9 10 11 -1 2 3 0 4 1 2 2 1 0 0 Suffix Array 8 1 2 3 4 5 6 7 8 9 10 11 1 3 7 2 6 5 4 8 9 10 11

Rank array ababbc$ rank[SA[i]] = i ba a b b b c a c $ b 1 c $ 3 7 c$ b b a b b c $ 2 c $ 6 a b b c $ 5 $ c$ 11 10 b c a $ b b c $ 4 9 Rank Array 1 2 3 4 5 6 7 8 9 10 11 1 4 2 7 6 5 3 8 9 10 11 Suffix Array 1 2 3 4 5 6 7 8 9 10 11 1 3 7 2 6 5 4 8 9 10 11 8

Suffix array has less information Information available during traversal for each data structure, when visiting node v Suffix Tree 1. label from root to each node 2. label from parent to each node 3. num. leaves in each subtree 4. parent of each node 5. children of each node 6. suffix link of each node Suffix Array 1. length of label from root to v 2. length of label from root to the parent of v 3. left most leaf ID in subtree rooted at v 4. right most leaf ID in subtree rooted at v

Suffix array has less information ba b c$ $ length of parent label from root： 1 a b b b c a $ b b 1 c $ 1 3 2 a b b c $ 7 3 b a b b c $ 2 4 b c$ 10 10 11 11 9 9 label length a b from c root： 4 b a $ b b c b $ c c $ $ 5 4 8 7 8 6 6 5

Suffix array algorithm 1. foreach v in suffix tree (simulated by suffix array){ difficulty 1 2. if(node v is representative of EC [v]≡) { difficulty 2 follow suffix link; 3. 4. while(node is in EC [v]≡) { 5. follow suffix link; 6. difficulty 3 compute size and minimal strings; 7. } These are difficult because suffix array has less information. 8. } 9. output succinct representation of EC [v]≡; 10. }

Solving difficulty 1 (representative judge) v Suffix Array l– 1 r– 1 index L’= rank(l – 1) R’= rank(r – 1) l L = rank(l) r R = rank(r)

Solving difficulty 2 (equivalence relation judge) ax: label from root v Suffix Array index l L = rank(l) r R = rank(r) l+1 r+1 L’ = rank(l+1) R’ = rank(r+1)

Solving difficulty 3 (size computation) case 1 case 2 case 3 size = sum of this Suffix Array index label length of parent l L r R r’ R+1 lcp(R + 1) l L lcp(L) r R

Computational experiment • Comparison of algorithms ▫ suffix tree ▫ CDAWG ▫ suffix array • Data ▫ two English and two Genome corpora Canterbury corpus, Protein corpus • Machine spec. ▫ Red Hat Linux ▫ CPU 2. 8 GHz, 1 GB memory

Experimental result data name cantrby/ plrabn 1 2 Protein Corpus/ sc large/ bible. txt large/ E. coli size (MB) data structure time(sec) construction enumeration 0. 47 suffix tree CDAWG suffix array 0. 95 0. 97 0. 43 0. 21 0. 18 0. 14 2. 8 suffix tree CDAWG suffix array 12. 08 12. 76 3. 08 3. 9 suffix tree CDAWG suffix array 4. 5 suffix tree CDAWG suffix array total memory （MB） 1. 16 1. 15 0. 57 21. 446 9. 278 5. 392 1. 43 13. 51 1. 12 13. 88 0. 63 3. 71 121. 887 69. 648 33. 192 7. 33 6. 68 4. 71 2. 23 1. 62 1. 50 9. 56 8. 30 6. 21 191. 869 56. 255 46. 319 8. 17 8. 58 5. 95 2. 91 11. 08 2. 31 10. 89 1. 46 7. 41 232. 467 139. 802 53. 086

the size of the equivalence classes formed by spams are larger than that of non spams. çThis is Japanese “Sushi” using spam, but this spam does not relate to this study. the number of the equivalence class Application : spam detection the size of the equivalence class

Application : spam detection “Unsupervised Spam Detection based on String Alienness Measures” by Kazuyuki Narisawa, Hideo Bannai, Kohei Hatano and Masayuki Takeda if you are interested in our study and want to come the conference, you should search not “DS 07” but “Discovery Science 2007”.

Summary • Presented an algorithm for computing the equivalence classes with suffix array ▫ simulating traversal on suffix tree + suffix links ▫ using only lcp and rank arrays ▫ running in linear time and space • Compared with other data structures ▫ less memory ▫ faster computation • Can be applied to spam detection[ DS ’ 07 ]

Thank You

Compute size of the EC $ ba a b b b c a c $ b 1 c $ 3 7 c$ sum of the length of label from parent 11 to each node b b a b b c $ 2 c$ 10 1+3=4 9 a b b c $ 5 6 b c a $ b b c $ 4 8

Compute minimal strings of the EC z x x 1 x 2 xk y y 1 yk z 1 z 2 zm

suffix tree • each node has: ▫ ▫ ▫ parent leftmost child right sibling suffix link label of the incoming edge