Index construction Compression of postings Paolo Ferragina Dipartimento

  • Slides: 7
Download presentation
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading

Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5. 3 and a paper

Sec. 3. 1 Delta encoding 1 1 2 7 20 14 … Then you

Sec. 3. 1 Delta encoding 1 1 2 7 20 14 … Then you compress the resulting integers with variable-length prefix-free codes, as follows…

g-code Binary Length-1 n Binary length x > 0 and Binary length = log

g-code Binary Length-1 n Binary length x > 0 and Binary length = log 2 x +1 e. g. , 9 represented as 0001001. n n g-code for x takes 2 log 2 x +1 bits (ie. factor of 2 from optimal) Optimal for Pr(x) = 1/2 x 2, and i. i. d integers

It is a prefix-free encoding… n Given the following sequence of g-coded integers, reconstruct

It is a prefix-free encoding… n Given the following sequence of g-coded integers, reconstruct the original sequence: 000100000110000011101100111 8 6 3 59 7

d-code Binary length n n Use g-coding to reduce the length of the first

d-code Binary length n n Use g-coding to reduce the length of the first field Useful for medium-sized integers e. g. , 19 represented as 0010110011. n n d-coding x takes about log 2 x + 2 log 2( log 2 x ) + 2 bits. Optimal for Pr(x) = 1/2 x(log x)2, and i. i. d integers

Variable-byte codes n Wish to get very fast (de)compress byte-align n Given a binary

Variable-byte codes n Wish to get very fast (de)compress byte-align n Given a binary representation of an integer n n n Append 0 s to front, to get a multiple-of-7 number of bits Form groups of 7 -bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e. g. , v=214+1 binary(v) = 100000001 100000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !! T-nibble: We could design this code over t-bits, not just t=8

PFor. Delta coding Use b (e. g. 2) bits to encode 128 numbers or

PFor. Delta coding Use b (e. g. 2) bits to encode 128 numbers or create exceptions 2 10 42 2 1 1 … 10 01 01 … 2 2 23 1 10 10 a block of 128 numbers = 256 bits = 32 bytes Translate data: [base, base + 2 b-2] [0, 2 b-2] Encode exceptions with value 2 b-1 Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions 2 01 10 42 23