Web Algorithmics Dictionarybased compressors LZ 77 a a

Web Algorithmics Dictionary-based compressors

LZ 77 a a c a b c a a a a c <6, 3, a> Dictionary (all substrings starting here) a a c a b c a a a a c c <3, 4, c> Algorithm’s step: n n Output <dist, len, next-char> Advance by len + 1 A buffer “window” has fixed length and moves

LZ 77 Decoding Decoder keeps same dictionary window as encoder. n Finds substring and inserts a copy of it What if l > d? (overlap with text to be compressed) n n E. g. seen = abcd, next codeword is (2, 9, e) Simply copy starting at the cursor for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i] n Output is correct: abcdcdcdce

Lempel-Ziv Algorithms Keep a “dictionary” of recently-seen strings. The differences are: n How the dictionary is stored n How it is extended n How it is indexed n How elements are removed No explicit frequency estimation LZ-algos are asymptotically optimal, i. e. their compression ratio goes to H(S) for n !!

You find this at: www. gzip. org/zlib/

LZ 77 Optimizations used by gzip LZSS: Output one of the following formats (0, position, length) or (1, char) Typically uses the second format if length < 3. Special greedy: possibly use shorter match so that next match is better Hash Table for speed-up searches on triplets Triples are coded with Huffman’s code

Web Algorithmics Some special compressors Spatial vs Temporal Locality

g-code for integer encoding Length-1 n x > 0 and Length = log 2 x +1 e. g. , 9 represented as <000, 1001>. n n g-code for x takes 2 log 2 x +1 bits (ie. factor of 2 from optimal) Optimal for Pr(x) = 1/2 x 2, and i. i. d integers

It is a prefix-free encoding… n Given the following sequence of g-coded integers, reconstruct the original sequence: 000100000110000011101100111 8 6 3 59 7

Streaming compression Still you need to determine and sort all terms…. Can we do everything in one pass ? n Move-to-Front (MTF): n n As a freq-sorting approximator As a caching strategy As a compressor Run-Length-Encoding (RLE): n FAX compression

Move to Front Coding Transforms a char sequence into an integer sequence, that can then be var-length coded n n Start with the list of symbols L=[a, b, c, d, …] For each input symbol s 1) output the position of s in L 2) move s to the front of L Properties: There is a memory n Exploit temporal locality, and it is dynamic n X = 1 n 2 n 3 n… nn Huff = O(n 2 log n), MTF = O(n log n) + n 2

MTF: how good is it ? No much worse than Huffman. . . but it may be far better Encode the integers via d-coding: |g(i)| ≤ 2 * log i + 1 Put S in the front and consider the cost of encoding: By Jensen’s:

Run Length Encoding (RLE) If spatial locality is very high, then abbbaacccca => (a, 1), (b, 3), (a, 2), (c, 4), (a, 1) In case of binary strings just numbers and one bit Properties: n n Exploit spatial locality, and it is a dynamic code X = 1 n 2 n 3 n… nn There is a memory Huff(X) = O(n 2 log n) > Rle(X) = O( n (1+log n) )

Web Algorithmics Burrows-Wheeler Transform

The big (unconscious) step. . .

The Burrows-Wheeler Transform Let us given a text T = mississippi#m ssissippi#mis issippi#missi sippi#mississ ppi#mississip i#mississippi Sort the rows F # i i m p p s s (1994) mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missippi#mi sippi#miss sissippi#m L i p s s m # p i s s i i T

A famous example Much longer. . .

Compressing L seems promising. . . Key observation: l L is locally homogeneous L is highly compressible Algorithm Bzip : Move-to-Front coding of L Run-Length coding Statistical coder R Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

How to compute the BWT ? SA BWT matrix 12 #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#miss ssissippi#m 11 8 5 2 1 10 9 7 4 6 3 L i p s s m # p i s s i i We said that: L[i] precedes F[i] in T L[3] = T[ 7 ] Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ? SA 12 # 11 8 5 2 1 10 9 7 4 6 3 i# ippi# ississippi# mississippi pi# ppi# sissippi# ssissippi# Elegant but inefficient Input: T = mississippi# Obvious inefficiencies: • Q(n 2 log n) time in the worst-case • Q(n log n) cache misses or I/O faults

A useful tool: L F # i i m p p s s unknown mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missippi#mi sippi#miss sissippi#m L i p s s m # p i s s i i F mapping How do we map L’s onto F’s chars ? . . . Need to distinguish equal chars in F. . . Take two equal L’s chars Rotate rightward their rows Same relative order !!

The BWT is invertible F # i i m p p s s unknown mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missippi#mi sippi#miss sissippi#m L i p s s m # p i s s i i Two key properties: 1. LF-array maps L’s to F’s chars 2. L[ i ] precedes F[ i ] in T Reconstruct T backward: T =. . i ppi # Invert. BWT(L) Compute LF[0, n-1]; r = 0; i = n; while (i>0) { T[i] = L[r]; r = LF[r]; i--; }

An encoding example T = mississippimississippi L = ipppssssssmmmii#pppiiissssssiiiiii # at 16 Mtf = [i, m, p, s] Mtf = 020030000030030200300300000100000 Mtf = 030040000040040300400400000200000 Bin(6)=110, Wheeler’s code RLE 0 = 03141041403141410210 Alphabet |S|+1 Bzip 2 -output = Arithmetic/Huffman on |S|+1 symbols. . . plus g(16), plus the original Mtf-list (i, m, p, s)

You find this in your Linux distribution