Basics of Data Compression Paolo Ferragina Dipartimento di
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa
Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every symbol e. g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ? A uniquely decodable code can always be uniquely decomposed into their codewords.
Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another one e. g a = 0, b = 100, c = 101, d = 11 Can be viewed as a binary trie 0 a 1 0 1 b c d
Average Length For a code C with codeword length L[s], the average length is defined as p(A) =. 7 [0], p(B) = p(C) = p(D) =. 1 [1 --] La =. 7 * 1 +. 3 * 3 = 1. 6 bit (Huffman achieves 1. 5 bit) We say that a prefix code C is optimal if for all prefix codes C’, La(C) La(C’)
0 <= H <= log |S| H -> 0, skewed distribution H max, uniform distribution Entropy (Shannon, 1948) For a source S emitting symbols with probability p(s), the self information of s is: bits Lower probability higher information Entropy is the weighted average of i(s) 0 -th order empirical entropy of string T
Performance: Compression ratio = #bits in output / #bits in input Compression performance: We relate entropy against compression ratio. Sha n nno Avg cw length In practice Empirical H vs Compression ratio p(A) =. 7, p(B) = p(C) = p(D) =. 1 H ≈ 1. 36 bits An optimal code is surely one that…
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5. 3 and a paper
g-code for integer encoding Length-1 n x > 0 and Length = log 2 x +1 e. g. , 9 represented as <000, 1001>. n n g-code for x takes 2 log 2 x +1 bits (ie. factor of 2 from optimal) Optimal for Pr(x) = 1/2 x 2, and i. i. d integers
It is a prefix-free encoding… n Given the following sequence of g-coded integers, reconstruct the original sequence: 000100000110000011101100111 8 6 3 59 7
d-code for integer encoding n n Use g-coding to reduce the length of the first field Useful for medium-sized integers e. g. , 19 represented as <00, 101, 10011>. n n d-coding x takes about log 2 x + 2 log 2( log 2 x ) + 2 bits. Optimal for Pr(x) = 1/2 x(log x)2, and i. i. d integers
Variable-byte codes [10. 2 bits per TREC 12] n Wish to get very fast (de)compress byte-align n Given a binary representation of an integer n n n Append 0 s to front, to get a multiple-of-7 number of bits Form groups of 7 -bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e. g. , v=214+1 binary(v) = 100000001 100000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!
PFor. Delta coding Use b (e. g. 2) bits to encode 128 numbers or create exceptions 3 11 42 2 3 3 1 1 … 10 11 11 01 01 … 3 3 23 1 11 11 a block of 128 numbers = 256 bits = 32 bytes Translate data: [base, base + 2 b-1] [0, 2 b-1] Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions 2 01 10 42 23
A basic problem ! T Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#. . • Array of pointers • (log m) bits per string, (n log m) bits overall. • We could drop the separating NULL Space = 32 * n bits v Independent of string-length distribution J It is effective for few strings L It is bad for medium/large sets of strings
A basic problem ! T Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#. . X Abaco. Battle. Car. Cold. Cod. Defense. Google. Yahoo. . B 100000100100000010000. . T 10#2#5#6#20#31#3#3#. . We could drop msb X 10101110101011111. . B 100010100100100001010. . We aim at achieving ≈n log(m/n) bits
Rank/Select 1(3) = 8 B 001010101011111110000011010101. . Rank 1(6) = 2 • Rankb(i) = number of b in B[1, i] m = |B| n = #1 • Selectb(i) = position of the i-th b in B Do exist data structures that solve this problem in O(1) query time and n log(m/n) + o(m) bits of space
Elias-Fano useful for Rank/Select z = 3, w=2 In unary If w = log (m/n) and z = log n, where m = |B| and n = #1 then - L takes n log (m/n) bits - H takes 2 n bits 0 1 2 3 4 5 6 7
If you wish to play with Rank and Select A next lecture… m/10 + n log m/n Rank in 0. 4 msec, Select in < 1 msec For binary search cfr 2 n + n log (m/n) only select vs 32 n bits of explicit pointers
- Slides: 17