Index construction Compression of postings Paolo Ferragina Dipartimento

  • Slides: 15
Download presentation
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa

Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa

Sec. 3. 1 Delta encoding 1 1 2 7 20 14 … Then you

Sec. 3. 1 Delta encoding 1 1 2 7 20 14 … Then you compress the resulting integers with variable-length prefix-free codes, as follows…

Variable-byte codes n Wish to get very fast (de)compress byte-align n Given a binary

Variable-byte codes n Wish to get very fast (de)compress byte-align n Given a binary representation of an integer n n n Append 0 s to front, to get a multiple-of-7 number of bits Form groups of 7 -bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e. g. , v=214+1 binary(v) = 100000001 100000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !! T-nibble: We could design this code over t-bits, not just t=8

PFor. Delta coding Use b (e. g. 2) bits to encode 128 numbers or

PFor. Delta coding Use b (e. g. 2) bits to encode 128 numbers or create exceptions 2 10 42 2 1 1 … 10 01 01 … 2 2 23 1 10 10 a block of 128 numbers = 256 bits = 32 bytes Translate data: [base, base + 2 b-2] [0, 2 b-2] Encode exceptions with value 2 b-1 Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions 2 01 10 42 23

g-code Binary Length-1 n Binary length x > 0 and Binary length = log

g-code Binary Length-1 n Binary length x > 0 and Binary length = log 2 x +1 e. g. , 9 represented as 0001001. n n g-code for x takes 2 log 2 x +1 bits (ie. factor of 2 from optimal) Optimal for Pr(x) = 1/2 x 2, and i. i. d integers

It is a prefix-free encoding… n Given the following sequence of g-coded integers, reconstruct

It is a prefix-free encoding… n Given the following sequence of g-coded integers, reconstruct the original sequence: 000100000110000011101100111 8 6 3 59 7

d-code Binary length n n Use g-coding to reduce the length of the first

d-code Binary length n n Use g-coding to reduce the length of the first field Useful for medium-sized integers e. g. , 19 represented as 0010110011. n n d-coding x takes about log 2 x + 2 log 2( log 2 x ) + 2 bits. Optimal for Pr(x) = 1/2 x(log x)2, and i. i. d integers

Elias-Fano z = 3, w=2 In unary If w = log (m/n) and z

Elias-Fano z = 3, w=2 In unary If w = log (m/n) and z = log n, where m = |B| and n = #1 then - L takes n w = n log (m/n) bits - H takes n 1 s + n 0 s = 2 n bits 0 1 2 3 4 5 6 7 (Select 1 on H) How to get the i-th number ? Take the i-th group of w bits in L and then represent the value (Select 1(H, i) – i) in z bits

Rank and Select data structures

Rank and Select data structures

A basic problem ! D Abaco, Battle, Car, Cold, Cod. . Array of n

A basic problem ! D Abaco, Battle, Car, Cold, Cod. . Array of n string pointers to strings of total length m • (n log m) bits = 32 n bits. • it depends on the number of strings • it is independent of string length D Abaco Battle Car Cold Cod. . B 100000 1000 100. . Spaces are introduced for simplicity

Rank/Select Wish to index the bit vector B (possibly compressed). Select 1(3) = 8

Rank/Select Wish to index the bit vector B (possibly compressed). Select 1(3) = 8 B 001010101011111110000011010101. . Rank 1(6) = 2 • Rankb(i) = number of b in B[1, i] • Selectb(i) = position of the i-th b in B Two approaches: (1) Takes |B| + o(|B|) bits of space, (2) Aims at achieving n log(m/n) bits m = |B| n = #1

The Bit-Vector Index: |B| + o(|B|) m = |B| n = #1 s Goal.

The Bit-Vector Index: |B| + o(|B|) m = |B| n = #1 s Goal. B is read-only, and the additional index takes o(m) bits. Rank B 001010101011 111110001011010111000. . Z (absolute) Rank 1 18 8 4 5 8 z (bucket-relative) Rank 1 n Setting Z = poly(log m) and z=(1/2) log m: n n n block pos #1 0000 1 0 . . 1011 2 1 . . Extra space is + (m/Z) log m + (m/z) log Z + o(m) There exists a Bit-Vector Index v + O(m loglog m / log m) = o(m) bits taking o(m) extra bits Rank time is O(1) and constant time for Rank/Select. and read-only! Term o(m) is crucial in practice, BB is is needed untouched (not compressed)

The Select operation m = |B| n = #1 s B 00101010101111111000001101010111001. . size

The Select operation m = |B| n = #1 s B 00101010101111111000001101010111001. . size r is variable until the subarray includes k 1 s n Sparse case: If r > k 2 , we store explicitly the position of the k 1 s n Dense case: k ≤ r ≤ k 2, recurse. . . One level is enough!! . . . still need a table of size o(m). n Setting k ≈ polylog m n Extra space is + o(m), and B is not touched! n Select time is O(1) There exists a Bit-Vector Index taking o(m) extra bits and constant time for Rank/Select. B is needed and read-only!

Via Elias-Fano (B is not needed) Recall that by setting w = log (m/n)

Via Elias-Fano (B is not needed) Recall that by setting w = log (m/n) and z = log n, where m = |B| and n = #1 then - Space = n log (m/n) bits + 2 n bits z = 3, w=2 0 1 2 3 4 5 6 7 (Build Select 1 on H so we need extra |H| + o(|H|) bits = 2 n + o(n) bits ) Select 1(i) on B uses L and (Select 1(H, i) – i) in +o(n) space Rank 1(i) on B Needs binary search over B

If you wish to play with Rank and Select m/10 + n log (m/n)

If you wish to play with Rank and Select m/10 + n log (m/n) Rank in 0. 4 msec, Select in < 1 msec vs 32 n bits of explicit pointers