A Subquadratic Sequence Alignment Algorithm Global alignment Alignment

A Sub-quadratic Sequence Alignment Algorithm

Global alignment Alignment graph for S = aacgacga, T = ctacgaga 0 c 1 t 2 a 3 c 4 g 5 a 6 g 7 a 8 a 1 a 2 c 3 g 4 a 5 c 6 g 7 a 8 V(i, j) = max { V(i-1, j-1) + (S[i], T[j]), V(i-1, j) + (S[i], -), V(i, j-1) + (-, T[j]) } Complexity: O(n 2)

FOUR RUSSIAN ALGORITHM

UNRESTRICTED SCORING FUNCTION

Main idea: Compress the sequences LZ-78: Divide the sequence into distinct words • S = aacgacga 1 2 3 4 a ac g acg Trie • T = ctacgaga a 1 2 3 4 5 c t a cg ag Trie 0 a 1 g 0 c 3 c 1 g 2 g 4 4 The number of distinct words: a t 2 3 g 5 a

Main idea 1 0 a 2 3 ac g 4 acg a 1 c • Compute the alignment score in each block • Propagate the scores between the adjacent blocks 2 t 3 a 3/2 3/4 c 4 g Trie for S a 5 g 5/2 g 5/4 c a 1 0 Trie for T g a g 3 c 3 2 5 4 g a t 2 1 4 a a g 0 a c a a c g a a g c g

Main idea • Compress the sequence into words • Pre-compute the score for each block • Do alignment between blocks • Note: – Replace normal characters by words – Operate on blocks

LZ-78 COMPRESS THE SEQUENCE

LZ-78: Divide the sequence into distinct words • S = aacgacga 1 2 3 4 a ac g acg Trie • T = ctacgaga a 1 2 3 4 5 c t a cg ag Trie 0 a 1 g 0 c 3 c 1 g 2 g 4 4 The number of distinct words: a t 2 3 g 5 a

LZ-78 • Theorem (Lempel and Ziv): – Constant alphabet sequence S – The maximal number of distinct phrases in S is O(n/log n). • Tighter upper bound: O(hn/log n) – h is the entropy factor – a real number, 0 < h 1 – Entropy is small sequence is repetitive

COMPUTE THE ALIGNMENT SCORE IN EACH BLOCK

Compute the alignment score in each block 3 2 1 0 a ac g 4 acg a • 1 c 2 t 3 a 3/2 3/4 5/2 5/4 c 4 g 5 a g a a a g a c a a c g a a g c g

• Given – Input border: I – Block • Compute – Output border: O 2 a I g a 3 c 4 g 5 5 1 4 0 3 0 O 1 G 2

Matrices • I[i] : is the input border value • DIST[i, j] : weight of the optimal path – From entry i of the input border – To entry j of its output border • OUT[i, j] : merges the information from input row I and DIST 2 – OUT[i, j]=I[i] + DIST[i, j] a • O[j] = max{OUT[i, j] for i=1. . n} I g a 3 c 4 1 g 5 5 4 0 0 O 1 G 2 3

DIST and OUT matrix example Block – sub-sequences “acg”, “ag” 2 a g I a I (input borders) 4 g 0 3 1 0 G 2 OUT matrix 1 2 3 4 5 I 0=1 1 0 -1 -2 - - △ I 1=2 1 1 0 1 -1 -3 I 2=3 1 3 3 4 2 0 0 -2 -2 I 3=2 -12 0 0 -2 0 -1 -1 I 4=1 -13 -1 1 0 0 △ -2 -1 0 I 5=3 -14 -14 1 2 3 4 5 I 0 0 -1 -2 -3 △ △ I 1 -1 -1 -2 -1 -3 I 2 -2 0 0 1 I 3 △ -2 -2 I 4 △ △ I 5 △ △ 5 4 0 0 5 1 O DIST matrix c 3 max col O 0 O 1 O 2 O 3 O 4 O 5 1 3 3 4 2 3

• For each block, given two sub-sequence S 1, S 2 • Compute (from scratch) DIST in (n*m) time • Given I and DIST, compute OUT in (n*m) time • Given OUT[i, j], Compute O in (m*n) time

Revise 1 0 a 2 3 ac g 4 acg 1 c 2 t 3 a c 4 g 5 a g a 4/4 5/3 5/4 a • Compress the sequence • Pre-compute DIST[i, j] for each block • Compute border values of each blocks • Remaining questions – How to compute DIST[i, j] efficiently? – How to compute O[j] from I[i] and DIST[i, j] efficiently?

COMPUTE O[J] EFFICIENTLY

Compute O[j] efficiently • For each block of two sub-sequences S 1, S 2 • Given – I[i] – DIST[i, j] • Compute – O[j]

DIST and OUT matrix example Block – sub-sequences “acg”, “ag” 2 a g I a I (input borders) 4 g 0 3 1 0 G 2 OUT matrix 1 2 3 4 5 I 0=1 1 0 -1 -2 - - △ I 1=2 1 1 0 1 -1 -3 I 2=3 1 3 3 4 2 0 0 -2 -2 I 3=2 -12 0 0 -2 0 -1 -1 I 4=1 -13 -1 1 0 0 △ -2 -1 0 I 5=3 -14 -14 1 2 3 4 5 I 0 0 -1 -2 -3 △ △ I 1 -1 -1 -2 -1 -3 I 2 -2 0 0 1 I 3 △ -2 -2 I 4 △ △ I 5 △ △ 5 4 0 0 5 1 O DIST matrix c 3 max col O 0 O 1 O 2 O 3 O 4 O 5 1 3 3 4 2 3

Compute O without explicit OUT Block – sub-sequences “acg”, “ag” 2 a g I a 4 g 1 2 3 4 5 I 0 0 -1 -2 -3 △ △ I 0=1 I 1 -1 -1 -2 -1 -3 △ I 1=2 I 2 -2 0 0 1 -1 -3 I 2=3 I 3 △ -2 -2 0 -2 -2 I 3=2 I 4 △ △ -2 0 -1 -1 I 4=1 I 5 △ △ △ -2 -1 0 I 5=3 5 4 0 3 1 0 G 2 I (input borders) 0 5 1 O DIST matrix c 3 SMAWK O 0 O 1 O 2 O 3 O 4 O 5 1 3 3 4 2 3

• Given DIST[i, j], I[i] we can compute O[j] in O(n+m) – Without creating OUT[i, j] • How? Why?

Why? • Aggarwal, Park and Schmidt observed that DIST and OUT matrices are Monge arrays. • Definition: a matrix M[0…m, 0…n] is totally monotone if either condition 1 or 2 below holds for all a, b=0…m; c, d=0…n: 1. Convex condition: M[a, c] M[b, c] M[a, d] M[b, d] for all a<b and c<d. 2. Concave condition: M[a, c] M[b, c] M[a, d] M[b, d] for all a<b and c<d.

How? • Aggarwal et. al. gave a recursive algorithm, called SMAWK, which can find all row and column maxima of a totally monotone matrix by querying only O(n) elements of the matrix.

• Why DIST[i, j] is totally monotone? 2 a I g a 3 c 4 g 5 5 1 4 0 3 0 O 1 G 2 a b The concave condition If b-c is better than a-c, then b-d is better than a-d. c d

Other problem • Rectangle problem of DIST 0 1 2 3 4 5 I 0 0 -1 -2 -3 △ △ I 1 -1 -1 -2 -1 -3 △ I 2 -2 0 0 1 -1 -3 I 3 △ -2 -2 0 -2 -2 I 4 △ △ -2 0 -1 -1 I 5 △ △ △ -2 -1 0 • Set upper right corner of OUT to - • Set lower left corner of OUT to -(n+i-1)*k • Preserve the totally monotone property of OUT

COMPUTE DIST[I, J] EFFICIENTLY

Compute DIST[i, j] for block(5/4) 1 0 a 2 3 ac g 4 acg 1 c 2 t 3 a g 3/2 5/4 0 1 a g 3 c g 3 2 5 g 0 1 4 a 5/2 c a 4 3/4 c 4 g a 5 g Trie for T Trie for S a c a a g c c a a g c g t 2

DIST matrix I 0 = 1 0 -1 -2 -3 Δ Δ I 1 = 2 -1 -1 -2 Δ I 2 = 3 -2 0 0 1 -1 -3 I 3 = 2 Δ -2 -2 0 -2 -2 I 4 = 1 Δ Δ -2 0 -1 -1 I 5 = 3 Δ Δ Δ -2 -1 0

• Only column m in DIST[i, j] is new • DIST block can be updated in O(m+n)

MANTAINING DIRECT ACCESS TO DIST TABLE

Trie for S 1 0 a 2 ac 3 g 4 acg 0 a a 3 1 c g 2 t 5 3 a 4 c g 5 a g c Trie for T t a 2 c g 2 g 4 a -3 -1 1 0 0 -2 g 1 1 4 0 3

Trie for S 1 0 2 a ac 3 g 4 0 a acg a 3 1 c g 2 t 5 3 a 4 c g 5 a g Trie for T t c a 2 c g 2 g 4 a 0 -1 -2 -3 -1 -1 -2 -2 0 0 1 -1 -3 -2 -2 0 -2 -2 -2 0 -1 -1 -2 -1 0 g 1 1 4 0 3

Trie for S 1 0 2 a ac 3 g 4 a acg a 3 1 c g 2 t 5 3 a 4 c g 5 0 Trie for T t c 0 a 2 g 1 1 c g 2 4 g 4 DIST a g a 0 -1 -2 -3 -1 -1 -2 -2 0 0 1 -1 -3 -2 -2 0 -2 -2 -2 0 -1 -1 -2 -1 0 3

Complexity • • Assume |S| = |T| = n Number of words in S, T = O(hn/log n) Number of blocks in alignment graph O(h 2 n 2/(log n)2) For each block – Update new DIST block O(t = size of the border) – Create direct access table O(t) • Propagating I/O across blocks – SMAWK O(t) • Sum of the sizes of all borders is O(hn 2/log n) • Total complexity: O(hn 2/log n)

Other extensions • Trace • Reducing the space complexity for discrete scoring • Local alignment

References • Crochemore, M. ; Landau, G. M. & Ziv-Ukelson, M. A sub-quadratic sequence alignment algorithm for unrestricted cost matrices ACM-SIAM, 2002, 679 -688 • Some pictures from 葉恆青