Prologo Paolo Ferragina Dipartimento di Informatica Universit di

  • Slides: 87
Download presentation
Prologo Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

Prologo Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

Pre-history of string processing ! n Collection of strings n Documents n Books n

Pre-history of string processing ! n Collection of strings n Documents n Books n Emails n Source code n DNA sequences n . . . Paolo Ferragina, Università di Pisa

An XML excerpt <dblp> <book> <author> Donald E. Knuth </author> <title> The Te. Xbook

An XML excerpt <dblp> <book> <author> Donald E. Knuth </author> <title> The Te. Xbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Donald E. Knuth </author> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> 293 -326 </pages> <year> 1975 </year> <journal> Artificial Intelligence </journal> </article> . . . </dblp> size ≈ 100 Mb #leaves ≥ 7 Mil for 75 Mb #internal nodes ≥ 4 Mil for 25 Mb depth ≤ 7 Paolo Ferragina, Università di Pisa

The Query-Log graph Dept CS pisa #clicks, time, country, . . . www. di.

The Query-Log graph Dept CS pisa #clicks, time, country, . . . www. di. unipi. it/index. html n Query. Log (Yahoo! dataset, 2005) n #links: ≈70 Mil n #nodes: ≈50 Mil n Dictionary of URLs: 24 Mil, 56. 3 avg/chars, 1. 6 Gb n Dictionary of terms: 44 Mil, 7. 3 avg/chars, 307 Mb n Dictionary of Infos: 2. 6 Gb Paolo Ferragina, Università di Pisa

In all cases. . . n Some structure: relation among items n Trees, (hyper-)graphs,

In all cases. . . n Some structure: relation among items n Trees, (hyper-)graphs, . . . n Some data: (meta-)information about the items n Large space (I/O, cache, compression, . . . ) Labels on nodes and/or edges n Various operations to be supported n Given node u n n Given an edge (i, j) n n Retrieve its label, Fw(u), Bw(u), … Id String Check its existence, Retrieve its label, … Given a string p: n n search for all nodes/edges whose label includes p search for adjacent nodes whose label equals p Paolo Ferragina, Università di Pisa Index

Paolo Ferragina, Università di Pisa

Paolo Ferragina, Università di Pisa

Paolo Ferragina, Università di Pisa

Paolo Ferragina, Università di Pisa

Paolo Ferragina, Università di Pisa

Paolo Ferragina, Università di Pisa

Virtually enlarge M Paolo Ferragina, Università di Pisa [Zobel et al, ’ 07]

Virtually enlarge M Paolo Ferragina, Università di Pisa [Zobel et al, ’ 07]

Do you use (z)grep? [de. Moura et al, ’ 00] gzip Huff. Word I

Do you use (z)grep? [de. Moura et al, ’ 00] gzip Huff. Word I I ≈1 Gb data Grep takes 29 secs (scan the uncompressed data) Zgrep takes 33 secs (gunzip the data | grep) Cgrep takes 8 secs (scan directly the compressed) Paolo Ferragina, Università di Pisa

In our lectures we are interested not only in the storage issue: + Random

In our lectures we are interested not only in the storage issue: + Random Access + Search Paolo Ferragina, Università di Pisa Data Compression + Data Structures

Seven years ago. . . [now, J. ACM 05] Opportunistic Data Structures with Applications

Seven years ago. . . [now, J. ACM 05] Opportunistic Data Structures with Applications P. Ferragina, G. Manzini Paolo Ferragina, Università di Pisa Nowadays several papers: theory & experiments (see Navarro-Makinen’s survey)

Our starting point was. . . Ken Church (AT&T, 1995) said “If I compress

Our starting point was. . . Ken Church (AT&T, 1995) said “If I compress the Suffix Array with Gzip I do not save anything. But the underlying text is compressible. . What’s going on? ” Practitioners use many “squeezing heuristics” that compress data and still support fast access to them Can we “automate” and “guarantee” the process ? Paolo Ferragina, Università di Pisa

In these lectures. . A path consisting of five steps 1) 2) 3) 4)

In these lectures. . A path consisting of five steps 1) 2) 3) 4) 5) Muthu’s challenge!! The problem What practitioners do and why they did not use “theory” What theoreticians then did Experiments The moral ; -) At the end, hopefully, you’ll bring at home: ü Algorithmic tools to compress & index data Data aware measures to evaluate them Algorithmic reductions: Theorists and practitioners love them! ü No ultimate receipts !! ü ü Paolo Ferragina, Università di Pisa

String Storage Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di

String Storage Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

A basic problem Given a dictionary D of strings, of variable length, compress them

A basic problem Given a dictionary D of strings, of variable length, compress them in a way that we can efficiently support Id string n Hash Table n n (Minimal) ordered perfect hashing n n Need D to avoid false-positive and for id string Need D for id string, or check (Compacted) Trie n Need D for edge match Yet the dictionary D needs to be stored Ø its space is not negligible Ø I/O- or cache-misses in retrieval Paolo Ferragina, Università di Pisa

Front-coding Practitioners use the following approach: n Sort the dictionary strings n Strip-off the

Front-coding Practitioners use the following approach: n Sort the dictionary strings n Strip-off the shared prefixes [e. g. host reversal? ] n Introduce some bucketing, to ensure fast random access uk-2002 crawl ≈250 Mb http: //checkmate. com/All_Natural/Applied. html http: //checkmate. com/All_Natural/Aroma 1. html http: //checkmate. com/All_Natural/Aromatic_Art. html http: //checkmate. com/All_Natural/Ayate. html http: //checkmate. com/All_Natural/Ayer_Soap. html http: //checkmate. com/All_Natural/Ayurvedic_Soap. html http: //checkmate. com/All_Natural/Bath_Salt_Bulk. html http: //checkmate. com/All_Natural/Bath_Salts. html http: //checkmate. com/All/Essence_Oils. html http: //checkmate. com/All/Mineral_Bath_Crystals. html http: //checkmate. com/All/Mineral_Bath_Salt. html http: //checkmate. com/All/Mineral_Cream. html 30 35% 0 33 34 38 38 34 35 35 33 42 25 25 38 33 http: //checkmate. com/All_Natural/ Applied. html roma. html 1. html tic_Art. html yate. html er_Soap. html urvedic_Soap. html Bath_Salt_Bulk. html s. html Essence_Oils. html Mineral_Bath_Crystals. html Salt. html Cream. html 0 http: //checkmate. com/All/Natural/Washcloth. html. . . Do we need bucketing ? Experimental tuning http: //checkmate. com/All/Natural/Washcloth. html . . . Paolo Ferragina, Università di Pisa gzip ≈ 12% Be back on this later on!

Locality-preserving FC Bender et al, 2006 Drop bucketing + optimal string decompression n Compress

Locality-preserving FC Bender et al, 2006 Drop bucketing + optimal string decompression n Compress D up to (1+e) FC(D) bits n Decompress any string S in 1+|S|/e time A simple incremental encoding algorithm [ where e = 2/(c-2) ] I. Assume to have FC(S 1, . . . , Si-1) II. Given Si, we proceed backward for X=c |Si| chars in FC Ø Two cases X=c |Si| Si = copied FC-coded copied Paolo Ferragina, Università di Pisa = FCs

Locality-preserving FC Bender et al, 2006 A simple incremental encoding algorithm [ where e

Locality-preserving FC Bender et al, 2006 A simple incremental encoding algorithm [ where e = 2/(c-2) ] n Assume to have FC(S 1, . . . , Si-1) n Given Si, we proceed backward for X=c |Si| chars in FC n If Si is decoded, then we add FC(Si) else we add Si Ø Decoding is unaffected!! Z X=c |Si| ---- Space occupancy (sketch) n FC-encoded strings are OK! n Partition the copied strings in (un)crowded n Let Si be crowded, and Z its preceding copied string: X/2 n |Z| ≥ X/2 ≥ (c/2) |Si| ≤ (2/c) |Z| n Hence, length of crowded strings decreases geometrically !! Si crowded n Consider chains of copied: |uncrowd*| ≤ (c/c-2) |uncrowd| n Charge chain-cost to X/2 = (c/2) |uncrowd| chars before uncrowd (ie FC-chars) Paolo Ferragina, Università di Pisa

Random access to LPFC We call C the LPFC-string, n = #strings in C,

Random access to LPFC We call C the LPFC-string, n = #strings in C, m = total length of C How do we Random Access the compressed C ? n Get(i): return the position of the i-th string in C (id string) n Previous(j), Next(j): return the position of the string preceding or following character C[j] Classical answers ; -) n Pointers to positions of copied-strings in C n Space is O(n log m) bits n Access time is O(1) + O(|S|/e) n Some form of bucketing. . . Trade-off n Space is O((n/b) log m) bits n Access time is O(b) + O(|S|/e) Paolo Ferragina, Università di Pisa No trade-off !

Re-phrasing our problem C is the LPFC-string, n = #strings in C, m =

Re-phrasing our problem C is the LPFC-string, n = #strings in C, m = total length of C Support the following operations on C: n. Get(i): return the position of the i-th string in C n. Previous(j), Next(j): return the position of the string prec/following C[j] Proper integer encodings C= B= 0 http: //checkmate. com/All_Natural/ 33 Applied. html 34 roma. html 38 1. html 38 tic_Art. html. . 1 00000000000000 10 0000 10 000000000. . see 1(4) Moffat Rank 1(36) = 2 Select = 51 ‘ 07 • Rank 1(x) = number of 1 in B[1, x] • Select 1(y) = position of the y-th 1 in B Uniquely-decodable Int-codes: • Get(i) = Select 1(i) • Previous(j) = Select 1(Rank 1(j) -1) • Next(j) Paolo Ferragina, Università di Pisa = Select 1(Rank 1(j) +1) • g-code(6) = 00 110 • 2 log x +1 bits • d-code(33) = g(3) 110 • log x + 2 loglog x + O(1) bits Look at them as • No recursivity, in practice: pointerless data structures • |g(x)|>|d(x)|, for x > 31 • Huffman on the lengths

A basic problem ! Jacobson, ‘ 89 Select 1(3) = 8 B 00101010101111111000001101010111000. .

A basic problem ! Jacobson, ‘ 89 Select 1(3) = 8 B 00101010101111111000001101010111000. . m = |B| n = #1 s Rank 1(7) = 4 • Rankb(i) = number of b in B[1, i] • Selectb(i) = position of the i-th b in B n Considering b=1 is enough: Select 0(B, i) = #1 Select in B B 1 ≤ min{m-n, n} 1(B 1, Rank 1(B 0, i-1)) + i 0 and Select 1 is similar |B!!0|+|B 1| = m n Rank 0(i)= i – Rank 1(i) n Any Select Rank 1 and Select 1 over two binary arrays: n B =0100001110010011111110 n B 0 = 1 n B 1 = Paolo Ferragina, Università di Pisa 0001 1 01 01 1 1, |B 0|= m-n 0 0 0 1 , |B 1|= n

A basic problem ! Jacobson, ‘ 89 Select 1(3) = 8 B 00101010101111111000001101010111000. .

A basic problem ! Jacobson, ‘ 89 Select 1(3) = 8 B 00101010101111111000001101010111000. . Rank 1(7) = 4 • Rank 1(i) = number of 1 s in B[1, i] • Select 1(i) = position of the i-th 1 in B m = |B| n = #1 s n Given an integer set, we set B as its characteristic vector n pred(x) = Select 1(Rank 1(x-1)) LBs can be inherited [Patrascu-Thorup, ‘ 06] Paolo Ferragina, Università di Pisa

m = |B| n = #1 s The Bit-Vector Index Goal. B is read-only,

m = |B| n = #1 s The Bit-Vector Index Goal. B is read-only, and the additional index takes o(m) bits. Rank B 001010101011 111110001011010111000. . Z (absolute) Rank 1 8 4 5 7 9 17 z (bucket-relative) Rank 1 n Setting Z = poly(log m) and z=(1/2) log m: n Space is |B| + (m/Z) log m + (m/z) log Z + o(m) v block pos #1 0000 1 0 . . 1011 2 1 . . m + O(m loglog m / log m) bits n Rank time is O(1) n The term o(m) is crucial in practice Paolo Ferragina, Università di Pisa ? ?

The Bit-Vector Index B m = |B| n = #1 s 00101010101111111000001101010111000. . size

The Bit-Vector Index B m = |B| n = #1 s 00101010101111111000001101010111000. . size r k consecutive 1 s n Sparse case: If r > k 2 store explicitly the position of the k 1 s n Dense case: k ≤ r ≤ k 2, recurse. . . One level is enough!! . . . still need a table of size o(m). n Setting k ≈ polylog m n Space is m + o(m), and B is not touched! n Select time is O(1) LPFC + Rank. Select takes [1+o(1)] extra bits per FC-char Paolo Ferragina, Università di Pisa There exists a Bit-Vector Index taking |B| + o(|B|) bits and constant time for Rank/Select. B is read-only!

Compressed String Storage Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università

Compressed String Storage Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

FC versus Gzip a a c b a b c b a c Dictionary

FC versus Gzip a a c b a b c b a c Dictionary (all substrings starting here) <6, 3, c> Two features: n Repetitiveness is deployed at any position n Window is used for (practical) computational reasons On the previous dataset of URLs (ie. uk-2002) n FC achieves >30% n Gzip achieves 12% n PPM achieves 7% No random access to substrings May be combine the best of the two worlds? Paolo Ferragina, Università di Pisa

The emprirical entropy H 0(S) = ∑i (mi/m) log 2 (mi/m) Frequency in S

The emprirical entropy H 0(S) = ∑i (mi/m) log 2 (mi/m) Frequency in S of the i-th symbol v m H 0(S) is the best you can hope for a memoryless compressor v We know that Huffman or Arithmetic come close to this bound H 0 cannot distinguish between Ax. By and a random with x A and y B We get a better compression using a codeword that depends on the k symbols preceding the one to be compressed (context) Paolo Ferragina, Università di Pisa

The empirical entropy Hk ü Compress S up to Hk(S) Use Huffman or Arithmetic

The empirical entropy Hk ü Compress S up to Hk(S) Use Huffman or Arithmetic compress all S[w] up to their H 0 Hk(S) = (1/|S|) ∑|w|=k | S[w] | H 0(S[w]) v S[w] = string of symbols that follow the substring w in S Example: Given S = “mississippi”, we have S[“is”] = ss Follow ≈ Precede How much is “operational” ? Paolo Ferragina, Università di Pisa

Entropy-bounded string storage [Ferragina-Venturini, ‘ 07] Goal. Given a string S[1, m] drawn from

Entropy-bounded string storage [Ferragina-Venturini, ‘ 07] Goal. Given a string S[1, m] drawn from an alphabet of size s n encode S within m Hk(S) + o(m log s) bits, with k ≤ … n extract any substring of L symbols in optimal Q(L / log m) time This encoding fully-replaces S in the RAM model ! Two corollaries n Compressed Rank/Select data structures n n n B was read-only in the simplest R/S scheme We get |B| Hk(B) + o(|B|) bits and R/S in O(1) time Compressed Front-Coding + random access n Promising: FC+Gzip saves 16% over gzip on uk-2002 Paolo Ferragina, Università di Pisa

The storage scheme • # blocks = m/b = O(m / logs m) •

The storage scheme • # blocks = m/b = O(m / logs m) • #distinct blocks = O(sb) = O(m½) Decoding is easy: • R/S on B to determine cw position in V • Retrieve cw from V • Decoded block is T[2 len(cw) + cw] S V B T e 0 1 00 01 10 11 000 a b d g. . . a b a d g b b -- 0 -- 1 00 0. . . 1 01 001 01. . . |B| ≤ |S| log s, #1 in B = #blocks = o(|S|) frequency • b = ½ logs m cw a T+V+B take |V|+o(|S| log s) bits

Bounding |V| in terms of Hk(S) n Introduce the statistical encoder Ek(S): n Compute

Bounding |V| in terms of Hk(S) n Introduce the statistical encoder Ek(S): n Compute F(i)= freq of S[i] within its k-th order context S[i-k, i-1] n Encode every block B[1, b] of S as follows 1) Write B[1, k] explicitly 2) Encode B[k+1, b] by Arithmetic using the k-th order frequencies >> Some algebra (m/b) * (k log s) + m Hk(S) + 2 (m/b) bits n Ek(S) is worse than our encoding V n Ek assigns unique cw to blocks n These cw are a subset of {0, 1}* Ø Our cw are the shortest of {0, 1}* |S| Hk(S) + o(|S| log Golden rule of data compression Paolo Ferragina, Università di Pisa |V| ≤ |Ek(S)| ≤ |S| Hk(S) + o(|S| log s) bits s)

Part #2: Take-home Msg n Given a binary string B, we can Pointer-less data

Part #2: Take-home Msg n Given a binary string B, we can Pointer-less data structure n Store B in |B| Hk(B) + o(|B|) bits n Support Rank & Select in constant time n Access any substring of B in optimal time n Given a string S on n n , we can Always better than S on RAM Store S in |S| Hk(S) + o(|S| log | |) bits, where k ≤ a log| | |S| Access any substring of S in optimal time Experimentally • 107 select / sec • 106 rank / sec Paolo Ferragina, Università di Pisa

(Compressed) String Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università

(Compressed) String Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

What do we mean by “Indexing” ? q Word-based indexes, here a notion of

What do we mean by “Indexing” ? q Word-based indexes, here a notion of “word” must be devised ! » Inverted files, Signature files, Bitmaps. q Full-text indexes, no constraint on text and queries ! » Suffix Array, Suffix tree, String B-tree, . . . Paolo Ferragina, Università di Pisa

The Problem Given a text T, we wish to devise a (compressed) representation for

The Problem Given a text T, we wish to devise a (compressed) representation for T that efficiently supports the following operations: ü Count(P): How many times string P occurs in T as a substring? ü Locate(P): List the positions of the occurrences of P in T ? ü Visualize(i, j): Print T[i, j] R Time-efficient solutions, but not compressed v Suffix Arrays, Suffix Trees, . . . v. . . many others. . . R Space-efficient solutions, but not time efficient v ZGrep: uncompress and then grep it v CGrep, NGrep: pattern-matching over compressed text Paolo Ferragina, Università di Pisa

The Suffix Array Prop 1. All suffixes of T having prefix P are contiguous.

The Suffix Array Prop 1. All suffixes of T having prefix P are contiguous. Prop 2. Starting position is the lexicographic one of P. 5 Q(N 2) space SA SUF(T) 12 11 8 5 2 1 10 9 7 4 6 3 # i# ippi# ississippi# mississippi# sissippi# ssissippi# Paolo Ferragina, Università di Pisa T = mississippi# suffix pointer P=si Suffix Array • SA: Q(N log 2 N) bits • Text T: N chars In practice, a total of 5 N bytes

Searching a pattern Indirected binary search on SA: O(p) time per suffix cmp SA

Searching a pattern Indirected binary search on SA: O(p) time per suffix cmp SA 12 11 8 5 2 1 10 9 7 4 6 3 Paolo Ferragina, Università di Pisa T = mississippi# P is larger 2 accesses per step P = si

Searching a pattern Indirected binary search on SA: O(p) time per suffix cmp SA

Searching a pattern Indirected binary search on SA: O(p) time per suffix cmp SA 12 11 8 5 2 1 10 9 7 4 6 3 Paolo Ferragina, Università di Pisa T = mississippi# P is smaller P = si Suffix Array search • O(log 2 N) binary-search steps • Each step takes O(p) char cmp overall, O(p log 2 N) time + [Manber-Myers, ’ 90] |S| [Cole et al, ’ 06]

Listing of the occurrences SA occ=2 T = mississippi# 4 7 12 11 8

Listing of the occurrences SA occ=2 T = mississippi# 4 7 12 11 8 5 where # < 2 1 10 P# = si# 9 7 sippi 4 sissippi 6 3 P$ = si$ Paolo Ferragina, Università di Pisa Suffix Array search • listing takes O (occ) time <$

Text mining Lcp[1, N-1] stores the LCP length between suffixes adjacent in SA Lcp

Text mining Lcp[1, N-1] stores the LCP length between suffixes adjacent in SA Lcp 0 0 1 4 0 0 1 0 2 1 3 T=mississip p i # SA 12 11 8 5 2 1 10 9 7 4 6 3 1 2 3 4 5 6 7 8 9 10 11 12 issippi ississippi • Does it exist a repeated substring of length ≥ L ? • Search for Lcp[i] ≥ L • Does it exist a substring of length ≥ L occurring ≥ C times ? • Search for Lcp[i, i+C-1] whose entries are ≥ L Paolo Ferragina, Università di Pisa

What about space occupancy? T = mississippi# SA 12 11 8 5 2 1

What about space occupancy? T = mississippi# SA 12 11 8 5 2 1 10 9 7 4 6 3 SA + T take Q(N log 2 N) bits Do we need such an amount ? 1) # permutations on {1, 2, . . . , N} = N! 2) SA cannot be any permutation of {1, . . . , N} 3) #SA # texts = | |N LB from #texts = (N log | |) bits LB from compression = (N Hk(T)) bits Paolo Ferragina, Università di Pisa Very far

An elegant mathematical tool Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina,

An elegant mathematical tool Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

The Burrows-Wheeler Transform (1994) Take the text T = mississippi#m ssissippi#mis issippi#missi sippi#mississ ppi#mississip

The Burrows-Wheeler Transform (1994) Take the text T = mississippi#m ssissippi#mis issippi#missi sippi#mississ ppi#mississip i#mississippi Paolo Ferragina, Università di Pisa Sort the rows F #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#miss ssissippi#m L i p s s m # p i s s i i T

A famous example Paolo Ferragina, Università di Pisa

A famous example Paolo Ferragina, Università di Pisa

A useful tool: L F # i i m p p s s unknown

A useful tool: L F # i i m p p s s unknown mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missippi#mi sippi#miss sissippi#m Paolo Ferragina, Università di Pisa L i p s s m # p i s s i i F mapping How do we map L’s onto F’s chars ? . . . Need to distinguish equal chars in F. . . Take two equal L’s chars Rotate rightward their rows Same relative order !!

The BWT is invertible F # i i m p p s s unknown

The BWT is invertible F # i i m p p s s unknown mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missippi#mi sippi#miss sissippi#m Paolo Ferragina, Università di Pisa L i p s s m # p i s s i i Two key properties: 1. LF-array maps L’s to F’s chars 2. L[ i ] precedes F[ i ] in T Reconstruct T backward: T =. . i ppi # Invert. BWT(L) Compute LF[0, n-1]; r = 0; i = n; while (i>0) { T[i] = L[r]; r = LF[r]; i--; }

How to compute the BWT ? SA BWT matrix 12 #mississipp i#mississip ippi#missis issippi#mis

How to compute the BWT ? SA BWT matrix 12 #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#miss ssissippi#m 11 8 5 2 1 10 9 7 4 6 3 Role of # Paolo Ferragina, Università di Pisa L i p s s m # p i s s i i We said that: L[i] precedes F[i] in T L[3] = T[ 7 ] = T[ SA[3] – 1 ] Given SA, we have L[i] = T[SA[i]-1] Elegant but inefficient Obvious inefficiencies: • O(n 3) time in the worst-case • O(n 2) cache misses or I/O faults

Compressing L seems promising. . . Key observation: l L is locally homogeneous L

Compressing L seems promising. . . Key observation: l L is locally homogeneous L is highly compressible Algorithm Bzip : Move-to-Front coding of L Run-Length coding Statistical coder R Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression ! Paolo Ferragina, Università di Pisa

An encoding example T = mississippimississippi # at 16 L = ipppssssssmmmii#pppiiissssssiiiiii Mtf =

An encoding example T = mississippimississippi # at 16 L = ipppssssssmmmii#pppiiissssssiiiiii Mtf = 020030000030030200300300000100000 Mtf = 030040000040040300400400000200000 Bin(6)=110, Wheeler’s code RLE 0 = 02131031302131310110 Arithmetic/Huffman su |S|+1 simboli. . . Paolo Ferragina, Università di Pisa Alphabet | |+1

Why it works. . . Key observation: l L is locally homogeneous L is

Why it works. . . Key observation: l L is locally homogeneous L is highly compressible Each piece a context Compress pieces up to their H 0 , we achieve Hk(T) MTF + RLE avoids the need to partition BWT Paolo Ferragina, Università di Pisa

Be back on indexing: BWT SA SA BWT matrix 12 #mississipp i#mississip ippi#missis issippi#mis

Be back on indexing: BWT SA SA BWT matrix 12 #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#miss ssissippi#m 11 8 5 2 1 10 9 7 4 6 3 Paolo Ferragina, Università di Pisa L i p s s m # p i s s i i L includes SA and T. Can we search within L ?

Implement the LF-mapping F # i i m p p s s unknown mississipp

Implement the LF-mapping F # i i m p p s s unknown mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missippi#mi sippi#miss sissippi#m L i p s s m # p i s s i i [Ferragina-Manzini] F start + # i m p s 1 2 6 7 9 The oracle Rank( s , 9 ) = 3 How do we map L[9] F[11] We need Generalized R&S Paolo Ferragina, Università di Pisa

Rank and Select on strings R If is small (i. e. constant) v Build

Rank and Select on strings R If is small (i. e. constant) v Build binary Rank data structure per symbol of ü Rank takes O(1) time and entropy-bounded space R If is large (words ? ) [Grossi-Gupta-Vitter, ’ 03] v Need a smarter solution: Wavelet Tree data structure Another step of reduction: >> Reduce Rank&Select over arbitrary strings. . . to Rank&Select over binary strings Binary R/S are key tools Paolo Ferragina, Università di Pisa >> tons of papers <<

Substring search in T (Count the pattern occurrences) P[ j ] F P =

Substring search in T (Count the pattern occurrences) P[ j ] F P = si o First step rows prefixed by char “i” fr lr occ=2 [lr-fr+1] fr lr unknown #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#miss ssissippi#m Paolo Ferragina, Università di Pisa L i p s s m # p i s s i i ai Av le b la f in # i m p s 1 2 6 7 9 Inductive step: Given fr, lr for P[j+1, p] Take c=P[j] Find the first c in L[fr, lr] Find the last c in L[fr, lr] L-to-F mapping of these chars Rank is enough

[Ferragina-Manzini, Focs ’ 00] The FM-index [Ferragina-Manzini, JACM ‘ 05] The result (on small

[Ferragina-Manzini, Focs ’ 00] The FM-index [Ferragina-Manzini, JACM ‘ 05] The result (on small alphabets): ü Count(P): O(p) time ü Locate(P): O(occ log 1+e N) time ü Visualize(i, i+L): O( L + log 1+e N) time ü Space occupancy: O( N Hk(T) ) + o(N) bits o(N) if T compressible Index does not depend on k bound holds for all k, simultaneously New concept: The FM-index is an opportunistic data structure Paolo Ferragina, Università di Pisa Survey of Navarro-Makinen contains many compressed index variants

Is this a technological breakthrough ? Paolo Ferragina, Università di Pisa [December 2003] [January

Is this a technological breakthrough ? Paolo Ferragina, Università di Pisa [December 2003] [January 2005]

The question then was. . . How to turn these challenging and mature theoretical

The question then was. . . How to turn these challenging and mature theoretical achievements into a technological breakthrought ? R Engineered implementations R Flexible API to allow reuse and development R Framework for extensive testing Paolo Ferragina, Università di Pisa

Joint effort of Navarro’s group All implemented indexes follow a carefully designed API which

Joint effort of Navarro’s group All implemented indexes follow a carefully designed API which We engineered the best known indexes: offers: build, count, locate, extract, . . . FMI, CSA, SSA, AF-FMI, RL-FM, LZ, . . A group of variagate Some texts tools is have available, been designed their to sizes range from 50 Mb to 2 Gb plan, execute and check the automatically index performance over the text collections Paolo Ferragina, Università di Pisa >400 downloads >50 registered

Some figures over hundreds of MBs of data: • Count(P) takes 5 msecs/char, ≈

Some figures over hundreds of MBs of data: • Count(P) takes 5 msecs/char, ≈ 42% space • Extract takes 20 msecs/char 10 times slower! • Locate(P) takes 50 msecs/occ, +10% space 50 times slower! Trade-off is possible !!! Paolo Ferragina, Università di Pisa

We need your applications. . . Paolo Ferragina, Università di Pisa

We need your applications. . . Paolo Ferragina, Università di Pisa

Part #5: Take-home msg. . . Data type This is a powerful paradigm to

Part #5: Take-home msg. . . Data type This is a powerful paradigm to design compressed indexes: Indexing 1. Transform the input in few arrays 2. Index (+ Compress) the arrays to support rank/select ops Compressed Indexing Compression and I/Os Paolo Ferragina, Università di Pisa Compression and query distribution/flow Other data types: Labeled Trees 2 D

(Compressed) Tree Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università

(Compressed) Tree Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

Where we are. . . A data structure is “opportunistic” if it indexes a

Where we are. . . A data structure is “opportunistic” if it indexes a text T within compressed space and supports three kinds of queries: ü Count(P): Count the occurrences of P occurs in T ü Locate(P): List the occurrences of P in T ü Display(i, j): Print T[i, j] R Key tools: Burrows-Wheeler Transform + Suffix Array R Key idea: reduce P’s queries to few rank/select queries on BWT(T) R Space complexity: function the k-th order empirical entropy of T Paolo Ferragina, Università di Pisa

Another data format: XML <dblp> <book> <author> Donald E. Knuth </author> <title> The Te.

Another data format: XML <dblp> <book> <author> Donald E. Knuth </author> <title> The Te. Xbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Donald E. Knuth </author> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> 293 -326 </pages> <year> 1975 </year> <volume> 6 </volume> <journal> Artificial Intelligence </journal> </article> . . . </dblp> Paolo Ferragina, Università di Pisa [W 3 C ‘ 98]

A tree interpretation. . . R XML document exploration Tree navigation R XML document

A tree interpretation. . . R XML document exploration Tree navigation R XML document search Labeled subpath searches Paolo Ferragina, Università di Pisa Subset of XPath [W 3 C]

A key concern: Verbosity. . . Paolo Ferragina, Università di Pisa IEEE Computer, April

A key concern: Verbosity. . . Paolo Ferragina, Università di Pisa IEEE Computer, April 2005

The problem, in practice. . . We wish to devise a (compressed) representation for

The problem, in practice. . . We wish to devise a (compressed) representation for T that efficiently supports the following operations: ü ü ü Navigational operations: parent(u), child(u, i, c) Subpath searches over a sequence of k labels Content searches: subpath search + substring R XML-aware compressors (like XMill, Xml. Ppm, Scm. Ppm, . . . ) need the whole decompression for navigation and search R XML-queriable compressors (like XPress, XGrind, XQzip, . . . ) achieve poor compression and need the scan of the whole (compressed) file Theory? Paolo Ferragina, Università di Pisa XML-native search engines need this tool as a core block for query optimization and (compressed) storage of information

A transform for labeled trees [Ferragina et al, 2005] XBW-transform on trees BW-transform on

A transform for labeled trees [Ferragina et al, 2005] XBW-transform on trees BW-transform on strings The XBW-transform linearizes T in 2 arrays such that: R the compression of T reduces to the compression of these two arrays (e. g. gzip, bzip 2, ppm, . . . ) R the indexing of T reduces to implement generalized rank/select over these two arrays Paolo Ferragina, Università di Pisa Rank&Select are again crucial

The XBW-Transform C B D c Sa A c b a a B D

The XBW-Transform C B D c Sa A c b a a B D D c b a Step 1. Visit the tree in pre-order. For each node, write down its label and the labels on its upward path Paolo Ferragina, Università di Pisa Permutation of tree nodes C B D c a c A b a D c B D b a Sp e C BC DB DB BC C AC AC AC DA C BC DB BC C C upward labeled paths

The XBW-Transform C B D c Sa A c b a a B D

The XBW-Transform C B D c Sa A c b a a B D D c b Step 2. Stably sort according to Sp Paolo Ferragina, Università di Pisa a C b a D D c D a B A B c c a b Sp e AC AC AC BC BC C DA DB DB DB C C upward labeled paths

The XBW-Transform C B D c Slast Sa A c b a a B

The XBW-Transform C B D c Slast Sa A c b a a B D D c b a Key Stepfact 3. Add a binary array to Slast marking Nodes correspond items in <Sthe last, Sa> rows corresponding to last children Paolo Ferragina, Università di Pisa 1 0 0 1 0 1 0 0 1 1 C b a D D c D a B A B c c a b Sp e AC AC AC BC BC C XBW can be C built and inverted C in optimal O(t) time DAC DBC DBC XBW takes optimal t log | | + t bits

XBW is highly compressible Slast Sa Sp /author/article/dblp Donald Knuth /author/article/dblp Kurt Mehlhorn. .

XBW is highly compressible Slast Sa Sp /author/article/dblp Donald Knuth /author/article/dblp Kurt Mehlhorn. . . /author/book/dblp Kurt Mehlhorn /author/book/dblp John Kleinberg /author/book/dblp Kurt Mehlhorn. . . Theoretically, we could extend the definition of Hk to labeled trees /journal/article/dblp Journal of the ACM Algorithmica by taking as k-context of a/journal/article/dblp node its leading path of k-length /journal/article/dblp Journal of the ACM. . . (related to Markov random fields over trees). . . /pages/article/dblp 120 -128 /pages/article/dblp 137 -157. . . /publisher/journal/dblp ACM Press /publisher/journal/dblp IEEE Press. . . /year/book/dblp 1977 /year/journal/dblp XBW is compressible: 2000. . . Sa is locally homogeneous Paolo Ferragina, Università di Pisa XBW Slast has some structure and is small

XBzip – a simple XML compressor Tags, Attributes and = XBW is compressible: Pcdata

XBzip – a simple XML compressor Tags, Attributes and = XBW is compressible: Pcdata Paolo Ferragina, Università di Pisa Compress Sa with PPM Slast is small. . .

XBzip = XBW + PPM [Ferragina et al, 2006] String compressors are not so

XBzip = XBW + PPM [Ferragina et al, 2006] String compressors are not so bad: within 5% Paolo Ferragina, Università di Pisa Deploy huge literature on string compression

Some structural properties C Slast Sa B D c A c b a a

Some structural properties C Slast Sa B D c A c b a a B D D c b a Two useful properties: • Children are contiguous and delimited by 1 s • Children reflect the order of their parents Paolo Ferragina, Università di Pisa 1 0 0 1 0 1 0 0 1 1 C b a D D c D a B A B c c a b XBW Sp e AC AC AC BC BC C DA DB DB DB C C

XBW is navigational C C Slast Sa B D c B A c b

XBW is navigational C C Slast Sa B D c B A c b a a D D c b a XBW is navigational: • Rank-Select data structures on Slast and Sa • The array C of | | integers Paolo Ferragina, Università di Pisa 1 0 0 1 0 1 0 0 1 1 C b a D D c D a B A B c c a b XBW Sp A 2 B 5 C 9 D 12 e AC AC A CSelect in S last B C the 2° item 1 B C from here. . . BC BC C Get_children C C Rank(B, Sa)=2 DAC DBC DBC

Subpath search in XBW C P[i+1] P=BD B A B B fr D c

Subpath search in XBW C P[i+1] P=BD B A B B fr D c b a D D a lr C c Sp Slast Sa a c b Inductive step: Pick the next char in P[i+1], i. e. ‘D’ Search for the first and last ‘D’ in Sa[fr, lr] Jump to their children Paolo Ferragina, Università di Pisa 1 0 0 1 0 1 0 0 1 1 C b a D D c D a B A B c c a b e AC AC AC BC BC C DA DB DB DB XBW-index A 2 B 5 C 9 D 12 Rows whose Sp starts with ‘B’ Jump to their children C C

Subpath search in XBW C P[i+1] Slast Sa P=BD B A B fr D

Subpath search in XBW C P[i+1] Slast Sa P=BD B A B fr D c b a D D a lr c a c b Inductive step: Pick the next char[reduction in P[i+1], toi. e. ‘D’ indexing] XBW indexing string Search for the first and last ‘D’ in S [fr, lr] Rank and Select data structuresa Jump to their children are enough to navigate and search T Paolo Ferragina, Università di Pisa fr lr 1 0 0 1 0 1 0 0 1 1 C b a D D c D a B A B c c a b Sp e AC AC AC BC BC C DA DB DB DB A 2 B 5 C 9 D 12 2° D 3° D Look at Slast to find Jump to the 2° and 3° their children 1 s after 12 C C Rows whose Sp starts with ‘D B’ XBW-index. Two occurrences because of two 1 s

XBzip. Index: XBW + FM-index [Ferragina et al, 2006] Under patenting by Pisa +

XBzip. Index: XBW + FM-index [Ferragina et al, 2006] Under patenting by Pisa + Rutgers DBLP: 1. 75 bytes/node, Pathways: 0. 31 bytes/node, News: 3. 91 Paolo Ferragina, Università di Pisa bytes/node Upto 36% improvement in compression ratio Query (counting) time 8 ms, Navigation time 3 ms

Part #6: Take-home msg. . . Data type This is a powerful paradigm to

Part #6: Take-home msg. . . Data type This is a powerful paradigm to design compressed indexes: Indexing 1. Transform the input in few arrays [Kosaraju, Focs ‘ 89] 2. Index (+ Compress) the arrays to support rank/select ops Compressed Indexing More ops Paolo Ferragina, Università di Pisa Strong connection More experiments and Applications Other data types: 2 D, Labeled graphs

I/O issues Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di

I/O issues Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

What about I/O-issues ? B-tree is ubiquitous in large-scale applications: – Atomic keys: integers,

What about I/O-issues ? B-tree is ubiquitous in large-scale applications: – Atomic keys: integers, reals, . . . – Prefix B-tree: bounded length keys ( 255 chars) String B-tree = B-tree + Patricia Trie – – Unbounded length keys I/O-optimal prefix searches Efficient string updates Guaranteed optimal page fill ratio They are not opportunistic [Bender et al FC] Paolo Ferragina, Università di Pisa [Ferragina-Grossi, 95] Variants for various models

The B-tree P[1, p] Search(P) • O((p/B) log 2 n) I/Os • O(occ/B) I/Os

The B-tree P[1, p] Search(P) • O((p/B) log 2 n) I/Os • O(occ/B) I/Os O(p/B log 2 B) I/Os 29 29 29 1 9 5 2 2 26 13 26 10 13 20 18 20 25 4 Paolo Ferragina, Università di Pisa 7 13 pattern to search 20 16 28 8 25 3 6 6 O(log. B n) levels 23 18 12 15 22 18 3 14 3 27 24 11 14 21 23 21 17 23

On small sets. . . [Ferguson, 92] Scan FC(D) : n If P[L[x]]=1, then

On small sets. . . [Ferguson, 92] Scan FC(D) : n If P[L[x]]=1, then { x++ } else { jump; } n Compare P and S[x] Max_lcp n If P[Max_lcp+1] = 0 go left, else go right, until L[] ≤ Max_lcp L P 0 1 1 x=2 x=3 x=4 0 2 4 5 0 0 1 0 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 Init x = 1 Paolo Ferragina, Università di Pisa Correct 2 jump 1 1 1 0 4 is the candidate position, Mlcp=3 Time is #D + |P| ≤ |FC(D)| Just S[x] needs to be decoded !!

On larger sets. . . Patricia Trie Space = Q(#D) words A Search(P): •

On larger sets. . . Patricia Trie Space = Q(#D) words A Search(P): • Phase 1: tree navigation • Phase 2: Compute LCP • Phase 3: tree navigation A A 1 string checked + Space PT ≈ #D Paolo Ferragina, Università di Pisa 1 A G A 0 C C A G A 3 4 A A G G A A A C G G P’s position G C A G A 6 G 5 G C A G G Two-phase search: P = GCAC G 3 5 2 G 4 0 4 G A 6 G C G G A 6 G max LCP with P 7 G C G G G A 6

Succinct PT smaller height in practice. . . not opportunistic: (#D log |D|) bits

Succinct PT smaller height in practice. . . not opportunistic: (#D log |D|) bits The String B-tree + P[1, p] Search(P) • O((p/B) log. B n) I/Os • O(occ/B) I/Os It is dynamic. . . 13 20 18 PT 29 2 1 9 26 13 20 25 PT 5 2 26 3 O(log. B n) levels 23 PT PT 29 O(p/B) I/Os PT 29 pattern to search 10 4 PT 6 PT 7 13 Lexicographic position of P Paolo Ferragina, Università di Pisa 20 16 28 18 3 14 PT 8 25 6 12 15 22 18 21 23 PT 3 27 24 11 PT 14 21 17 23