Prologo Paolo Ferragina Dipartimento di Informatica Universit di

Prologo Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

Pre-history of string processing ! n Collection of strings n Documents n Books n

An XML excerpt <dblp> <book> <author> Donald E. Knuth </author> <title> The Te. Xbook

The Query-Log graph Dept CS pisa #clicks, time, country, . . . www. di.

In all cases. . . n Some structure: relation among items n Trees, (hyper-)graphs,

Virtually enlarge M Paolo Ferragina, Università di Pisa [Zobel et al, ’ 07]

Do you use (z)grep? [de. Moura et al, ’ 00] gzip Huff. Word I

In our lectures we are interested not only in the storage issue: + Random

Seven years ago. . . [now, J. ACM 05] Opportunistic Data Structures with Applications

Our starting point was. . . Ken Church (AT&T, 1995) said “If I compress

In these lectures. . A path consisting of five steps 1) 2) 3) 4)

String Storage Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di

A basic problem Given a dictionary D of strings, of variable length, compress them

Front-coding Practitioners use the following approach: n Sort the dictionary strings n Strip-off the

Locality-preserving FC Bender et al, 2006 Drop bucketing + optimal string decompression n Compress

Locality-preserving FC Bender et al, 2006 A simple incremental encoding algorithm [ where e

Random access to LPFC We call C the LPFC-string, n = #strings in C,

Re-phrasing our problem C is the LPFC-string, n = #strings in C, m =

A basic problem ! Jacobson, ‘ 89 Select 1(3) = 8 B 00101010101111111000001101010111000. .

m = |B| n = #1 s The Bit-Vector Index Goal. B is read-only,

The Bit-Vector Index B m = |B| n = #1 s 00101010101111111000001101010111000. . size

Compressed String Storage Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università

FC versus Gzip a a c b a b c b a c Dictionary

The emprirical entropy H 0(S) = ∑i (mi/m) log 2 (mi/m) Frequency in S

The empirical entropy Hk ü Compress S up to Hk(S) Use Huffman or Arithmetic

Entropy-bounded string storage [Ferragina-Venturini, ‘ 07] Goal. Given a string S[1, m] drawn from

The storage scheme • # blocks = m/b = O(m / logs m) •

Bounding |V| in terms of Hk(S) n Introduce the statistical encoder Ek(S): n Compute

Part #2: Take-home Msg n Given a binary string B, we can Pointer-less data

(Compressed) String Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università

What do we mean by “Indexing” ? q Word-based indexes, here a notion of

The Problem Given a text T, we wish to devise a (compressed) representation for

The Suffix Array Prop 1. All suffixes of T having prefix P are contiguous.

Searching a pattern Indirected binary search on SA: O(p) time per suffix cmp SA

Listing of the occurrences SA occ=2 T = mississippi# 4 7 12 11 8

Text mining Lcp[1, N-1] stores the LCP length between suffixes adjacent in SA Lcp

What about space occupancy? T = mississippi# SA 12 11 8 5 2 1

An elegant mathematical tool Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina,

The Burrows-Wheeler Transform (1994) Take the text T = mississippi#m ssissippi#mis issippi#missi sippi#mississ ppi#mississip

A famous example Paolo Ferragina, Università di Pisa

A useful tool: L F # i i m p p s s unknown

The BWT is invertible F # i i m p p s s unknown

How to compute the BWT ? SA BWT matrix 12 #mississipp i#mississip ippi#missis issippi#mis

Compressing L seems promising. . . Key observation: l L is locally homogeneous L

An encoding example T = mississippimississippi # at 16 L = ipppssssssmmmii#pppiiissssssiiiiii Mtf =

Why it works. . . Key observation: l L is locally homogeneous L is

Be back on indexing: BWT SA SA BWT matrix 12 #mississipp i#mississip ippi#missis issippi#mis

Implement the LF-mapping F # i i m p p s s unknown mississipp

Rank and Select on strings R If is small (i. e. constant) v Build

Substring search in T (Count the pattern occurrences) P[ j ] F P =

[Ferragina-Manzini, Focs ’ 00] The FM-index [Ferragina-Manzini, JACM ‘ 05] The result (on small

Is this a technological breakthrough ? Paolo Ferragina, Università di Pisa [December 2003] [January

The question then was. . . How to turn these challenging and mature theoretical

Joint effort of Navarro’s group All implemented indexes follow a carefully designed API which

Some figures over hundreds of MBs of data: • Count(P) takes 5 msecs/char, ≈

We need your applications. . . Paolo Ferragina, Università di Pisa

Part #5: Take-home msg. . . Data type This is a powerful paradigm to

(Compressed) Tree Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università

Where we are. . . A data structure is “opportunistic” if it indexes a

Another data format: XML <dblp> <book> <author> Donald E. Knuth </author> <title> The Te.

A tree interpretation. . . R XML document exploration Tree navigation R XML document

A key concern: Verbosity. . . Paolo Ferragina, Università di Pisa IEEE Computer, April

The problem, in practice. . . We wish to devise a (compressed) representation for

A transform for labeled trees [Ferragina et al, 2005] XBW-transform on trees BW-transform on

The XBW-Transform C B D c Sa A c b a a B D

The XBW-Transform C B D c Slast Sa A c b a a B

XBW is highly compressible Slast Sa Sp /author/article/dblp Donald Knuth /author/article/dblp Kurt Mehlhorn. .

XBzip – a simple XML compressor Tags, Attributes and = XBW is compressible: Pcdata

XBzip = XBW + PPM [Ferragina et al, 2006] String compressors are not so

Some structural properties C Slast Sa B D c A c b a a

XBW is navigational C C Slast Sa B D c B A c b

Subpath search in XBW C P[i+1] P=BD B A B B fr D c

Subpath search in XBW C P[i+1] Slast Sa P=BD B A B fr D

XBzip. Index: XBW + FM-index [Ferragina et al, 2006] Under patenting by Pisa +

Part #6: Take-home msg. . . Data type This is a powerful paradigm to

I/O issues Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di

What about I/O-issues ? B-tree is ubiquitous in large-scale applications: – Atomic keys: integers,

The B-tree P[1, p] Search(P) • O((p/B) log 2 n) I/Os • O(occ/B) I/Os

On small sets. . . [Ferguson, 92] Scan FC(D) : n If P[L[x]]=1, then

On larger sets. . . Patricia Trie Space = Q(#D) words A Search(P): •

Succinct PT smaller height in practice. . . not opportunistic: (#D log |D|) bits

Slides: 87

Download presentation

Prologo Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

Pre-history of string processing ! n Collection of strings n Documents n Books n Emails n Source code n DNA sequences n . . . Paolo Ferragina, Università di Pisa

An XML excerpt <dblp> <book> <author> Donald E. Knuth </author> <title> The Te. Xbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Donald E. Knuth </author> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> 293 -326 </pages> <year> 1975 </year> <journal> Artificial Intelligence </journal> </article> . . . </dblp> size ≈ 100 Mb #leaves ≥ 7 Mil for 75 Mb #internal nodes ≥ 4 Mil for 25 Mb depth ≤ 7 Paolo Ferragina, Università di Pisa

The Query-Log graph Dept CS pisa #clicks, time, country, . . . www. di. unipi. it/index. html n Query. Log (Yahoo! dataset, 2005) n #links: ≈70 Mil n #nodes: ≈50 Mil n Dictionary of URLs: 24 Mil, 56. 3 avg/chars, 1. 6 Gb n Dictionary of terms: 44 Mil, 7. 3 avg/chars, 307 Mb n Dictionary of Infos: 2. 6 Gb Paolo Ferragina, Università di Pisa

In all cases. . . n Some structure: relation among items n Trees, (hyper-)graphs, . . . n Some data: (meta-)information about the items n Large space (I/O, cache, compression, . . . ) Labels on nodes and/or edges n Various operations to be supported n Given node u n n Given an edge (i, j) n n Retrieve its label, Fw(u), Bw(u), … Id String Check its existence, Retrieve its label, … Given a string p: n n search for all nodes/edges whose label includes p search for adjacent nodes whose label equals p Paolo Ferragina, Università di Pisa Index

Paolo Ferragina, Università di Pisa

Virtually enlarge M Paolo Ferragina, Università di Pisa [Zobel et al, ’ 07]

Do you use (z)grep? [de. Moura et al, ’ 00] gzip Huff. Word I I ≈1 Gb data Grep takes 29 secs (scan the uncompressed data) Zgrep takes 33 secs (gunzip the data | grep) Cgrep takes 8 secs (scan directly the compressed) Paolo Ferragina, Università di Pisa

In our lectures we are interested not only in the storage issue: + Random Access + Search Paolo Ferragina, Università di Pisa Data Compression + Data Structures

Seven years ago. . . [now, J. ACM 05] Opportunistic Data Structures with Applications P. Ferragina, G. Manzini Paolo Ferragina, Università di Pisa Nowadays several papers: theory & experiments (see Navarro-Makinen’s survey)

Our starting point was. . . Ken Church (AT&T, 1995) said “If I compress the Suffix Array with Gzip I do not save anything. But the underlying text is compressible. . What’s going on? ” Practitioners use many “squeezing heuristics” that compress data and still support fast access to them Can we “automate” and “guarantee” the process ? Paolo Ferragina, Università di Pisa

In these lectures. . A path consisting of five steps 1) 2) 3) 4) 5) Muthu’s challenge!! The problem What practitioners do and why they did not use “theory” What theoreticians then did Experiments The moral ; -) At the end, hopefully, you’ll bring at home: ü Algorithmic tools to compress & index data Data aware measures to evaluate them Algorithmic reductions: Theorists and practitioners love them! ü No ultimate receipts !! ü ü Paolo Ferragina, Università di Pisa

String Storage Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

A basic problem Given a dictionary D of strings, of variable length, compress them in a way that we can efficiently support Id string n Hash Table n n (Minimal) ordered perfect hashing n n Need D to avoid false-positive and for id string Need D for id string, or check (Compacted) Trie n Need D for edge match Yet the dictionary D needs to be stored Ø its space is not negligible Ø I/O- or cache-misses in retrieval Paolo Ferragina, Università di Pisa

Front-coding Practitioners use the following approach: n Sort the dictionary strings n Strip-off the shared prefixes [e. g. host reversal? ] n Introduce some bucketing, to ensure fast random access uk-2002 crawl ≈250 Mb http: //checkmate. com/All_Natural/Applied. html http: //checkmate. com/All_Natural/Aroma 1. html http: //checkmate. com/All_Natural/Aromatic_Art. html http: //checkmate. com/All_Natural/Ayate. html http: //checkmate. com/All_Natural/Ayer_Soap. html http: //checkmate. com/All_Natural/Ayurvedic_Soap. html http: //checkmate. com/All_Natural/Bath_Salt_Bulk. html http: //checkmate. com/All_Natural/Bath_Salts. html http: //checkmate. com/All/Essence_Oils. html http: //checkmate. com/All/Mineral_Bath_Crystals. html http: //checkmate. com/All/Mineral_Bath_Salt. html http: //checkmate. com/All/Mineral_Cream. html 30 35% 0 33 34 38 38 34 35 35 33 42 25 25 38 33 http: //checkmate. com/All_Natural/ Applied. html roma. html 1. html tic_Art. html yate. html er_Soap. html urvedic_Soap. html Bath_Salt_Bulk. html s. html Essence_Oils. html Mineral_Bath_Crystals. html Salt. html Cream. html 0 http: //checkmate. com/All/Natural/Washcloth. html. . . Do we need bucketing ? Experimental tuning http: //checkmate. com/All/Natural/Washcloth. html . . . Paolo Ferragina, Università di Pisa gzip ≈ 12% Be back on this later on!

Locality-preserving FC Bender et al, 2006 Drop bucketing + optimal string decompression n Compress D up to (1+e) FC(D) bits n Decompress any string S in 1+|S|/e time A simple incremental encoding algorithm [ where e = 2/(c-2) ] I. Assume to have FC(S 1, . . . , Si-1) II. Given Si, we proceed backward for X=c |Si| chars in FC Ø Two cases X=c |Si| Si = copied FC-coded copied Paolo Ferragina, Università di Pisa = FCs

Locality-preserving FC Bender et al, 2006 A simple incremental encoding algorithm [ where e = 2/(c-2) ] n Assume to have FC(S 1, . . . , Si-1) n Given Si, we proceed backward for X=c |Si| chars in FC n If Si is decoded, then we add FC(Si) else we add Si Ø Decoding is unaffected!! Z X=c |Si| ---- Space occupancy (sketch) n FC-encoded strings are OK! n Partition the copied strings in (un)crowded n Let Si be crowded, and Z its preceding copied string: X/2 n |Z| ≥ X/2 ≥ (c/2) |Si| ≤ (2/c) |Z| n Hence, length of crowded strings decreases geometrically !! Si crowded n Consider chains of copied: |uncrowd*| ≤ (c/c-2) |uncrowd| n Charge chain-cost to X/2 = (c/2) |uncrowd| chars before uncrowd (ie FC-chars) Paolo Ferragina, Università di Pisa

Random access to LPFC We call C the LPFC-string, n = #strings in C, m = total length of C How do we Random Access the compressed C ? n Get(i): return the position of the i-th string in C (id string) n Previous(j), Next(j): return the position of the string preceding or following character C[j] Classical answers ; -) n Pointers to positions of copied-strings in C n Space is O(n log m) bits n Access time is O(1) + O(|S|/e) n Some form of bucketing. . . Trade-off n Space is O((n/b) log m) bits n Access time is O(b) + O(|S|/e) Paolo Ferragina, Università di Pisa No trade-off !

Re-phrasing our problem C is the LPFC-string, n = #strings in C, m = total length of C Support the following operations on C: n. Get(i): return the position of the i-th string in C n. Previous(j), Next(j): return the position of the string prec/following C[j] Proper integer encodings C= B= 0 http: //checkmate. com/All_Natural/ 33 Applied. html 34 roma. html 38 1. html 38 tic_Art. html. . 1 00000000000000 10 0000 10 000000000. . see 1(4) Moffat Rank 1(36) = 2 Select = 51 ‘ 07 • Rank 1(x) = number of 1 in B[1, x] • Select 1(y) = position of the y-th 1 in B Uniquely-decodable Int-codes: • Get(i) = Select 1(i) • Previous(j) = Select 1(Rank 1(j) -1) • Next(j) Paolo Ferragina, Università di Pisa = Select 1(Rank 1(j) +1) • g-code(6) = 00 110 • 2 log x +1 bits • d-code(33) = g(3) 110 • log x + 2 loglog x + O(1) bits Look at them as • No recursivity, in practice: pointerless data structures • |g(x)|>|d(x)|, for x > 31 • Huffman on the lengths

A basic problem ! Jacobson, ‘ 89 Select 1(3) = 8 B 00101010101111111000001101010111000. . m = |B| n = #1 s Rank 1(7) = 4 • Rankb(i) = number of b in B[1, i] • Selectb(i) = position of the i-th b in B n Considering b=1 is enough: Select 0(B, i) = #1 Select in B B 1 ≤ min{m-n, n} 1(B 1, Rank 1(B 0, i-1)) + i 0 and Select 1 is similar |B!!0|+|B 1| = m n Rank 0(i)= i – Rank 1(i) n Any Select Rank 1 and Select 1 over two binary arrays: n B =0100001110010011111110 n B 0 = 1 n B 1 = Paolo Ferragina, Università di Pisa 0001 1 01 01 1 1, |B 0|= m-n 0 0 0 1 , |B 1|= n

A basic problem ! Jacobson, ‘ 89 Select 1(3) = 8 B 00101010101111111000001101010111000. . Rank 1(7) = 4 • Rank 1(i) = number of 1 s in B[1, i] • Select 1(i) = position of the i-th 1 in B m = |B| n = #1 s n Given an integer set, we set B as its characteristic vector n pred(x) = Select 1(Rank 1(x-1)) LBs can be inherited [Patrascu-Thorup, ‘ 06] Paolo Ferragina, Università di Pisa

m = |B| n = #1 s The Bit-Vector Index Goal. B is read-only, and the additional index takes o(m) bits. Rank B 001010101011 111110001011010111000. . Z (absolute) Rank 1 8 4 5 7 9 17 z (bucket-relative) Rank 1 n Setting Z = poly(log m) and z=(1/2) log m: n Space is |B| + (m/Z) log m + (m/z) log Z + o(m) v block pos #1 0000 1 0 . . 1011 2 1 . . m + O(m loglog m / log m) bits n Rank time is O(1) n The term o(m) is crucial in practice Paolo Ferragina, Università di Pisa ? ?

The Bit-Vector Index B m = |B| n = #1 s 00101010101111111000001101010111000. . size r k consecutive 1 s n Sparse case: If r > k 2 store explicitly the position of the k 1 s n Dense case: k ≤ r ≤ k 2, recurse. . . One level is enough!! . . . still need a table of size o(m). n Setting k ≈ polylog m n Space is m + o(m), and B is not touched! n Select time is O(1) LPFC + Rank. Select takes [1+o(1)] extra bits per FC-char Paolo Ferragina, Università di Pisa There exists a Bit-Vector Index taking |B| + o(|B|) bits and constant time for Rank/Select. B is read-only!

Compressed String Storage Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

FC versus Gzip a a c b a b c b a c Dictionary (all substrings starting here) <6, 3, c> Two features: n Repetitiveness is deployed at any position n Window is used for (practical) computational reasons On the previous dataset of URLs (ie. uk-2002) n FC achieves >30% n Gzip achieves 12% n PPM achieves 7% No random access to substrings May be combine the best of the two worlds? Paolo Ferragina, Università di Pisa

The emprirical entropy H 0(S) = ∑i (mi/m) log 2 (mi/m) Frequency in S of the i-th symbol v m H 0(S) is the best you can hope for a memoryless compressor v We know that Huffman or Arithmetic come close to this bound H 0 cannot distinguish between Ax. By and a random with x A and y B We get a better compression using a codeword that depends on the k symbols preceding the one to be compressed (context) Paolo Ferragina, Università di Pisa

The empirical entropy Hk ü Compress S up to Hk(S) Use Huffman or Arithmetic compress all S[w] up to their H 0 Hk(S) = (1/|S|) ∑|w|=k | S[w] | H 0(S[w]) v S[w] = string of symbols that follow the substring w in S Example: Given S = “mississippi”, we have S[“is”] = ss Follow ≈ Precede How much is “operational” ? Paolo Ferragina, Università di Pisa

Entropy-bounded string storage [Ferragina-Venturini, ‘ 07] Goal. Given a string S[1, m] drawn from an alphabet of size s n encode S within m Hk(S) + o(m log s) bits, with k ≤ … n extract any substring of L symbols in optimal Q(L / log m) time This encoding fully-replaces S in the RAM model ! Two corollaries n Compressed Rank/Select data structures n n n B was read-only in the simplest R/S scheme We get |B| Hk(B) + o(|B|) bits and R/S in O(1) time Compressed Front-Coding + random access n Promising: FC+Gzip saves 16% over gzip on uk-2002 Paolo Ferragina, Università di Pisa

The storage scheme • # blocks = m/b = O(m / logs m) • #distinct blocks = O(sb) = O(m½) Decoding is easy: • R/S on B to determine cw position in V • Retrieve cw from V • Decoded block is T[2 len(cw) + cw] S V B T e 0 1 00 01 10 11 000 a b d g. . . a b a d g b b -- 0 -- 1 00 0. . . 1 01 001 01. . . |B| ≤ |S| log s, #1 in B = #blocks = o(|S|) frequency • b = ½ logs m cw a T+V+B take |V|+o(|S| log s) bits

Bounding |V| in terms of Hk(S) n Introduce the statistical encoder Ek(S): n Compute F(i)= freq of S[i] within its k-th order context S[i-k, i-1] n Encode every block B[1, b] of S as follows 1) Write B[1, k] explicitly 2) Encode B[k+1, b] by Arithmetic using the k-th order frequencies >> Some algebra (m/b) * (k log s) + m Hk(S) + 2 (m/b) bits n Ek(S) is worse than our encoding V n Ek assigns unique cw to blocks n These cw are a subset of {0, 1}* Ø Our cw are the shortest of {0, 1}* |S| Hk(S) + o(|S| log Golden rule of data compression Paolo Ferragina, Università di Pisa |V| ≤ |Ek(S)| ≤ |S| Hk(S) + o(|S| log s) bits s)

Part #2: Take-home Msg n Given a binary string B, we can Pointer-less data structure n Store B in |B| Hk(B) + o(|B|) bits n Support Rank & Select in constant time n Access any substring of B in optimal time n Given a string S on n n , we can Always better than S on RAM Store S in |S| Hk(S) + o(|S| log | |) bits, where k ≤ a log| | |S| Access any substring of S in optimal time Experimentally • 107 select / sec • 106 rank / sec Paolo Ferragina, Università di Pisa

(Compressed) String Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

What do we mean by “Indexing” ? q Word-based indexes, here a notion of “word” must be devised ! » Inverted files, Signature files, Bitmaps. q Full-text indexes, no constraint on text and queries ! » Suffix Array, Suffix tree, String B-tree, . . . Paolo Ferragina, Università di Pisa

The Problem Given a text T, we wish to devise a (compressed) representation for T that efficiently supports the following operations: ü Count(P): How many times string P occurs in T as a substring? ü Locate(P): List the positions of the occurrences of P in T ? ü Visualize(i, j): Print T[i, j] R Time-efficient solutions, but not compressed v Suffix Arrays, Suffix Trees, . . . v. . . many others. . . R Space-efficient solutions, but not time efficient v ZGrep: uncompress and then grep it v CGrep, NGrep: pattern-matching over compressed text Paolo Ferragina, Università di Pisa

The Suffix Array Prop 1. All suffixes of T having prefix P are contiguous. Prop 2. Starting position is the lexicographic one of P. 5 Q(N 2) space SA SUF(T) 12 11 8 5 2 1 10 9 7 4 6 3 # i# ippi# ississippi# mississippi# sissippi# ssissippi# Paolo Ferragina, Università di Pisa T = mississippi# suffix pointer P=si Suffix Array • SA: Q(N log 2 N) bits • Text T: N chars In practice, a total of 5 N bytes

Searching a pattern Indirected binary search on SA: O(p) time per suffix cmp SA 12 11 8 5 2 1 10 9 7 4 6 3 Paolo Ferragina, Università di Pisa T = mississippi# P is larger 2 accesses per step P = si

Searching a pattern Indirected binary search on SA: O(p) time per suffix cmp SA 12 11 8 5 2 1 10 9 7 4 6 3 Paolo Ferragina, Università di Pisa T = mississippi# P is smaller P = si Suffix Array search • O(log 2 N) binary-search steps • Each step takes O(p) char cmp overall, O(p log 2 N) time + [Manber-Myers, ’ 90] |S| [Cole et al, ’ 06]

Listing of the occurrences SA occ=2 T = mississippi# 4 7 12 11 8 5 where # < 2 1 10 P# = si# 9 7 sippi 4 sissippi 6 3 P$ = si$ Paolo Ferragina, Università di Pisa Suffix Array search • listing takes O (occ) time <$

Text mining Lcp[1, N-1] stores the LCP length between suffixes adjacent in SA Lcp 0 0 1 4 0 0 1 0 2 1 3 T=mississip p i # SA 12 11 8 5 2 1 10 9 7 4 6 3 1 2 3 4 5 6 7 8 9 10 11 12 issippi ississippi • Does it exist a repeated substring of length ≥ L ? • Search for Lcp[i] ≥ L • Does it exist a substring of length ≥ L occurring ≥ C times ? • Search for Lcp[i, i+C-1] whose entries are ≥ L Paolo Ferragina, Università di Pisa

What about space occupancy? T = mississippi# SA 12 11 8 5 2 1 10 9 7 4 6 3 SA + T take Q(N log 2 N) bits Do we need such an amount ? 1) # permutations on {1, 2, . . . , N} = N! 2) SA cannot be any permutation of {1, . . . , N} 3) #SA # texts = | |N LB from #texts = (N log | |) bits LB from compression = (N Hk(T)) bits Paolo Ferragina, Università di Pisa Very far

An elegant mathematical tool Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

The Burrows-Wheeler Transform (1994) Take the text T = mississippi#m ssissippi#mis issippi#missi sippi#mississ ppi#mississip i#mississippi Paolo Ferragina, Università di Pisa Sort the rows F #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#miss ssissippi#m L i p s s m # p i s s i i T

A famous example Paolo Ferragina, Università di Pisa

A useful tool: L F # i i m p p s s unknown mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missippi#mi sippi#miss sissippi#m Paolo Ferragina, Università di Pisa L i p s s m # p i s s i i F mapping How do we map L’s onto F’s chars ? . . . Need to distinguish equal chars in F. . . Take two equal L’s chars Rotate rightward their rows Same relative order !!

The BWT is invertible F # i i m p p s s unknown mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missippi#mi sippi#miss sissippi#m Paolo Ferragina, Università di Pisa L i p s s m # p i s s i i Two key properties: 1. LF-array maps L’s to F’s chars 2. L[ i ] precedes F[ i ] in T Reconstruct T backward: T =. . i ppi # Invert. BWT(L) Compute LF[0, n-1]; r = 0; i = n; while (i>0) { T[i] = L[r]; r = LF[r]; i--; }

How to compute the BWT ? SA BWT matrix 12 #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#miss ssissippi#m 11 8 5 2 1 10 9 7 4 6 3 Role of # Paolo Ferragina, Università di Pisa L i p s s m # p i s s i i We said that: L[i] precedes F[i] in T L[3] = T[ 7 ] = T[ SA[3] – 1 ] Given SA, we have L[i] = T[SA[i]-1] Elegant but inefficient Obvious inefficiencies: • O(n 3) time in the worst-case • O(n 2) cache misses or I/O faults

Compressing L seems promising. . . Key observation: l L is locally homogeneous L is highly compressible Algorithm Bzip : Move-to-Front coding of L Run-Length coding Statistical coder R Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression ! Paolo Ferragina, Università di Pisa

An encoding example T = mississippimississippi # at 16 L = ipppssssssmmmii#pppiiissssssiiiiii Mtf = 020030000030030200300300000100000 Mtf = 030040000040040300400400000200000 Bin(6)=110, Wheeler’s code RLE 0 = 02131031302131310110 Arithmetic/Huffman su |S|+1 simboli. . . Paolo Ferragina, Università di Pisa Alphabet | |+1

Why it works. . . Key observation: l L is locally homogeneous L is highly compressible Each piece a context Compress pieces up to their H 0 , we achieve Hk(T) MTF + RLE avoids the need to partition BWT Paolo Ferragina, Università di Pisa

Be back on indexing: BWT SA SA BWT matrix 12 #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#miss ssissippi#m 11 8 5 2 1 10 9 7 4 6 3 Paolo Ferragina, Università di Pisa L i p s s m # p i s s i i L includes SA and T. Can we search within L ?

Implement the LF-mapping F # i i m p p s s unknown mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missippi#mi sippi#miss sissippi#m L i p s s m # p i s s i i [Ferragina-Manzini] F start + # i m p s 1 2 6 7 9 The oracle Rank( s , 9 ) = 3 How do we map L[9] F[11] We need Generalized R&S Paolo Ferragina, Università di Pisa

Rank and Select on strings R If is small (i. e. constant) v Build binary Rank data structure per symbol of ü Rank takes O(1) time and entropy-bounded space R If is large (words ? ) [Grossi-Gupta-Vitter, ’ 03] v Need a smarter solution: Wavelet Tree data structure Another step of reduction: >> Reduce Rank&Select over arbitrary strings. . . to Rank&Select over binary strings Binary R/S are key tools Paolo Ferragina, Università di Pisa >> tons of papers <<

Substring search in T (Count the pattern occurrences) P[ j ] F P = si o First step rows prefixed by char “i” fr lr occ=2 [lr-fr+1] fr lr unknown #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#miss ssissippi#m Paolo Ferragina, Università di Pisa L i p s s m # p i s s i i ai Av le b la f in # i m p s 1 2 6 7 9 Inductive step: Given fr, lr for P[j+1, p] Take c=P[j] Find the first c in L[fr, lr] Find the last c in L[fr, lr] L-to-F mapping of these chars Rank is enough

[Ferragina-Manzini, Focs ’ 00] The FM-index [Ferragina-Manzini, JACM ‘ 05] The result (on small alphabets): ü Count(P): O(p) time ü Locate(P): O(occ log 1+e N) time ü Visualize(i, i+L): O( L + log 1+e N) time ü Space occupancy: O( N Hk(T) ) + o(N) bits o(N) if T compressible Index does not depend on k bound holds for all k, simultaneously New concept: The FM-index is an opportunistic data structure Paolo Ferragina, Università di Pisa Survey of Navarro-Makinen contains many compressed index variants

Is this a technological breakthrough ? Paolo Ferragina, Università di Pisa [December 2003] [January 2005]

The question then was. . . How to turn these challenging and mature theoretical achievements into a technological breakthrought ? R Engineered implementations R Flexible API to allow reuse and development R Framework for extensive testing Paolo Ferragina, Università di Pisa

Joint effort of Navarro’s group All implemented indexes follow a carefully designed API which We engineered the best known indexes: offers: build, count, locate, extract, . . . FMI, CSA, SSA, AF-FMI, RL-FM, LZ, . . A group of variagate Some texts tools is have available, been designed their to sizes range from 50 Mb to 2 Gb plan, execute and check the automatically index performance over the text collections Paolo Ferragina, Università di Pisa >400 downloads >50 registered

Some figures over hundreds of MBs of data: • Count(P) takes 5 msecs/char, ≈ 42% space • Extract takes 20 msecs/char 10 times slower! • Locate(P) takes 50 msecs/occ, +10% space 50 times slower! Trade-off is possible !!! Paolo Ferragina, Università di Pisa

We need your applications. . . Paolo Ferragina, Università di Pisa

Part #5: Take-home msg. . . Data type This is a powerful paradigm to design compressed indexes: Indexing 1. Transform the input in few arrays 2. Index (+ Compress) the arrays to support rank/select ops Compressed Indexing Compression and I/Os Paolo Ferragina, Università di Pisa Compression and query distribution/flow Other data types: Labeled Trees 2 D

(Compressed) Tree Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

Where we are. . . A data structure is “opportunistic” if it indexes a text T within compressed space and supports three kinds of queries: ü Count(P): Count the occurrences of P occurs in T ü Locate(P): List the occurrences of P in T ü Display(i, j): Print T[i, j] R Key tools: Burrows-Wheeler Transform + Suffix Array R Key idea: reduce P’s queries to few rank/select queries on BWT(T) R Space complexity: function the k-th order empirical entropy of T Paolo Ferragina, Università di Pisa

Another data format: XML <dblp> <book> <author> Donald E. Knuth </author> <title> The Te. Xbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Donald E. Knuth </author> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> 293 -326 </pages> <year> 1975 </year> <volume> 6 </volume> <journal> Artificial Intelligence </journal> </article> . . . </dblp> Paolo Ferragina, Università di Pisa [W 3 C ‘ 98]

A tree interpretation. . . R XML document exploration Tree navigation R XML document search Labeled subpath searches Paolo Ferragina, Università di Pisa Subset of XPath [W 3 C]

A key concern: Verbosity. . . Paolo Ferragina, Università di Pisa IEEE Computer, April 2005

The problem, in practice. . . We wish to devise a (compressed) representation for T that efficiently supports the following operations: ü ü ü Navigational operations: parent(u), child(u, i, c) Subpath searches over a sequence of k labels Content searches: subpath search + substring R XML-aware compressors (like XMill, Xml. Ppm, Scm. Ppm, . . . ) need the whole decompression for navigation and search R XML-queriable compressors (like XPress, XGrind, XQzip, . . . ) achieve poor compression and need the scan of the whole (compressed) file Theory? Paolo Ferragina, Università di Pisa XML-native search engines need this tool as a core block for query optimization and (compressed) storage of information

A transform for labeled trees [Ferragina et al, 2005] XBW-transform on trees BW-transform on strings The XBW-transform linearizes T in 2 arrays such that: R the compression of T reduces to the compression of these two arrays (e. g. gzip, bzip 2, ppm, . . . ) R the indexing of T reduces to implement generalized rank/select over these two arrays Paolo Ferragina, Università di Pisa Rank&Select are again crucial

The XBW-Transform C B D c Sa A c b a a B D D c b a Step 1. Visit the tree in pre-order. For each node, write down its label and the labels on its upward path Paolo Ferragina, Università di Pisa Permutation of tree nodes C B D c a c A b a D c B D b a Sp e C BC DB DB BC C AC AC AC DA C BC DB BC C C upward labeled paths

The XBW-Transform C B D c Sa A c b a a B D D c b Step 2. Stably sort according to Sp Paolo Ferragina, Università di Pisa a C b a D D c D a B A B c c a b Sp e AC AC AC BC BC C DA DB DB DB C C upward labeled paths

The XBW-Transform C B D c Slast Sa A c b a a B D D c b a Key Stepfact 3. Add a binary array to Slast marking Nodes correspond items in <Sthe last, Sa> rows corresponding to last children Paolo Ferragina, Università di Pisa 1 0 0 1 0 1 0 0 1 1 C b a D D c D a B A B c c a b Sp e AC AC AC BC BC C XBW can be C built and inverted C in optimal O(t) time DAC DBC DBC XBW takes optimal t log | | + t bits

XBW is highly compressible Slast Sa Sp /author/article/dblp Donald Knuth /author/article/dblp Kurt Mehlhorn. . . /author/book/dblp Kurt Mehlhorn /author/book/dblp John Kleinberg /author/book/dblp Kurt Mehlhorn. . . Theoretically, we could extend the definition of Hk to labeled trees /journal/article/dblp Journal of the ACM Algorithmica by taking as k-context of a/journal/article/dblp node its leading path of k-length /journal/article/dblp Journal of the ACM. . . (related to Markov random fields over trees). . . /pages/article/dblp 120 -128 /pages/article/dblp 137 -157. . . /publisher/journal/dblp ACM Press /publisher/journal/dblp IEEE Press. . . /year/book/dblp 1977 /year/journal/dblp XBW is compressible: 2000. . . Sa is locally homogeneous Paolo Ferragina, Università di Pisa XBW Slast has some structure and is small

XBzip – a simple XML compressor Tags, Attributes and = XBW is compressible: Pcdata Paolo Ferragina, Università di Pisa Compress Sa with PPM Slast is small. . .

XBzip = XBW + PPM [Ferragina et al, 2006] String compressors are not so bad: within 5% Paolo Ferragina, Università di Pisa Deploy huge literature on string compression

Some structural properties C Slast Sa B D c A c b a a B D D c b a Two useful properties: • Children are contiguous and delimited by 1 s • Children reflect the order of their parents Paolo Ferragina, Università di Pisa 1 0 0 1 0 1 0 0 1 1 C b a D D c D a B A B c c a b XBW Sp e AC AC AC BC BC C DA DB DB DB C C

XBW is navigational C C Slast Sa B D c B A c b a a D D c b a XBW is navigational: • Rank-Select data structures on Slast and Sa • The array C of | | integers Paolo Ferragina, Università di Pisa 1 0 0 1 0 1 0 0 1 1 C b a D D c D a B A B c c a b XBW Sp A 2 B 5 C 9 D 12 e AC AC A CSelect in S last B C the 2° item 1 B C from here. . . BC BC C Get_children C C Rank(B, Sa)=2 DAC DBC DBC

Subpath search in XBW C P[i+1] P=BD B A B B fr D c b a D D a lr C c Sp Slast Sa a c b Inductive step: Pick the next char in P[i+1], i. e. ‘D’ Search for the first and last ‘D’ in Sa[fr, lr] Jump to their children Paolo Ferragina, Università di Pisa 1 0 0 1 0 1 0 0 1 1 C b a D D c D a B A B c c a b e AC AC AC BC BC C DA DB DB DB XBW-index A 2 B 5 C 9 D 12 Rows whose Sp starts with ‘B’ Jump to their children C C

Subpath search in XBW C P[i+1] Slast Sa P=BD B A B fr D c b a D D a lr c a c b Inductive step: Pick the next char[reduction in P[i+1], toi. e. ‘D’ indexing] XBW indexing string Search for the first and last ‘D’ in S [fr, lr] Rank and Select data structuresa Jump to their children are enough to navigate and search T Paolo Ferragina, Università di Pisa fr lr 1 0 0 1 0 1 0 0 1 1 C b a D D c D a B A B c c a b Sp e AC AC AC BC BC C DA DB DB DB A 2 B 5 C 9 D 12 2° D 3° D Look at Slast to find Jump to the 2° and 3° their children 1 s after 12 C C Rows whose Sp starts with ‘D B’ XBW-index. Two occurrences because of two 1 s

XBzip. Index: XBW + FM-index [Ferragina et al, 2006] Under patenting by Pisa + Rutgers DBLP: 1. 75 bytes/node, Pathways: 0. 31 bytes/node, News: 3. 91 Paolo Ferragina, Università di Pisa bytes/node Upto 36% improvement in compression ratio Query (counting) time 8 ms, Navigation time 3 ms

Part #6: Take-home msg. . . Data type This is a powerful paradigm to design compressed indexes: Indexing 1. Transform the input in few arrays [Kosaraju, Focs ‘ 89] 2. Index (+ Compress) the arrays to support rank/select ops Compressed Indexing More ops Paolo Ferragina, Università di Pisa Strong connection More experiments and Applications Other data types: 2 D, Labeled graphs

I/O issues Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

What about I/O-issues ? B-tree is ubiquitous in large-scale applications: – Atomic keys: integers, reals, . . . – Prefix B-tree: bounded length keys ( 255 chars) String B-tree = B-tree + Patricia Trie – – Unbounded length keys I/O-optimal prefix searches Efficient string updates Guaranteed optimal page fill ratio They are not opportunistic [Bender et al FC] Paolo Ferragina, Università di Pisa [Ferragina-Grossi, 95] Variants for various models

The B-tree P[1, p] Search(P) • O((p/B) log 2 n) I/Os • O(occ/B) I/Os O(p/B log 2 B) I/Os 29 29 29 1 9 5 2 2 26 13 26 10 13 20 18 20 25 4 Paolo Ferragina, Università di Pisa 7 13 pattern to search 20 16 28 8 25 3 6 6 O(log. B n) levels 23 18 12 15 22 18 3 14 3 27 24 11 14 21 23 21 17 23

On small sets. . . [Ferguson, 92] Scan FC(D) : n If P[L[x]]=1, then { x++ } else { jump; } n Compare P and S[x] Max_lcp n If P[Max_lcp+1] = 0 go left, else go right, until L[] ≤ Max_lcp L P 0 1 1 x=2 x=3 x=4 0 2 4 5 0 0 1 0 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 Init x = 1 Paolo Ferragina, Università di Pisa Correct 2 jump 1 1 1 0 4 is the candidate position, Mlcp=3 Time is #D + |P| ≤ |FC(D)| Just S[x] needs to be decoded !!

On larger sets. . . Patricia Trie Space = Q(#D) words A Search(P): • Phase 1: tree navigation • Phase 2: Compute LCP • Phase 3: tree navigation A A 1 string checked + Space PT ≈ #D Paolo Ferragina, Università di Pisa 1 A G A 0 C C A G A 3 4 A A G G A A A C G G P’s position G C A G A 6 G 5 G C A G G Two-phase search: P = GCAC G 3 5 2 G 4 0 4 G A 6 G C G G A 6 G max LCP with P 7 G C G G G A 6

Succinct PT smaller height in practice. . . not opportunistic: (#D log |D|) bits The String B-tree + P[1, p] Search(P) • O((p/B) log. B n) I/Os • O(occ/B) I/Os It is dynamic. . . 13 20 18 PT 29 2 1 9 26 13 20 25 PT 5 2 26 3 O(log. B n) levels 23 PT PT 29 O(p/B) I/Os PT 29 pattern to search 10 4 PT 6 PT 7 13 Lexicographic position of P Paolo Ferragina, Università di Pisa 20 16 28 18 3 14 PT 8 25 6 12 15 22 18 21 23 PT 3 27 24 11 PT 14 21 17 23