Inverted File Compression In Managing Gigabytes Inverted File
Inverted File Compression In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철
Inverted File Compression • Inverted file entry – <t; ft; [d 1, d 2, …, dft]> • t : term, ft : # of documents • dk : document no. where dk < dk+1 – < elephant; 8; [3, 5, 20, 21, 23, 76, 77, 78] > => < elephant; 8; [3, 2, 15, 1, 2, 53, 1, 1] > • gap = dk+1 - dk • Two compression classes – Global Methods V. S Local Methods
Summary of coding methods
Unary code • Simple method – fixed representation of the positive integer – log N (bits) • Unary code – gap이 x일 때, x-1 bit의 1과 1 bit의 0으로 표현 – lx = (x - 1) + 1, Pr[x] = 2 -x – eg) x = 9 일 때, => 1111 0
code • code – 1 + log x bit의 unary code와 log x bit의 binary code(x - 2 log x )로 표현 – lx = 1 + log x , Pr[x] = 1/2 x 2 – eg) x = 9 일 때, log x = 3, x - 2 log x =1 => 1110 001 – V = <1, 2, 4, 8, 16, …> or V = <1, 2, 2, 4, 4, 4, 8, …> or ….
code • code – code와 표현 방법이 유사. – 1 + log x bit의 unary code대신에 code를 사용하고, log x bit의 binary code(x - 2 log x )로 표현 – lx = 1 + 2 log(1 + log x ) + log x , Pr[x] = 1/2 x(log x)2 – eg) x = 9 일 때, => 11000 001
Global Bernoulli model • Pr[x] = (1 -p)x-1 p, p : gap x 가 나타날 확률 • Golomb code – q + 1 bit의 unary code와 + log b or log b bit의 binary code – q = (x - 1) / b , r = x - q b - 1 – b. A = log(2 - p) / - log(1 - p) 0. 69(N n / f) – eg) b=3, r=0(0), 1(10), 2(11) b=6, r=0(00), 1(01), 2(100), 3(101), 4(110), 5(111) x=9이면, q = 2, r = 2 따라서, 110 11
Global “observed frequency” model • Based on observed frequency of appear gap size • Use arithmetic or Huffman code • In theory – better compression method • In practice – slightly better than and code
Local Bernoulli model • The frequency of term t, ft , is known – Bernoulli model on each individual inverted file entry can be used • Very common words are encoded with b=1. – Tantamount bitvector – thus, inverted file can never worse than bitvector. • Necessary to store the parameter ft – b can be used during decoding
Skewed Bernoulli model • Bernoulli model의 vector VG = <b, b, b, …> (a) (b) (c) Word position in Bible : (a)bridegroom; (b)Jezebel; (c) twelfth • VT = <b, 2 b, 4 b, 2 ib, …> • slightly worse than the Golomb code
Local hyperbolic model • Pr[x] = / x, x = 1, 2, …, m – = 1 / (loge(m+1)+0. 5772) – m is largest gap • Better performance • more complex to implement • requires the use of arithmetic coding
Local “observed frequency” model • • The ultimate in local modeling batched frequency request more memory space best compression method
Performance of Index Compression Methods Method Bits per pointer Bible Global methods Unary 920 Binary 16. 00 Bernoulli 11. 65 GNUbib Comact 264 490 1719 15. 00 20. 00 9. 67 18. 00 10. 58 5. 69 12. 61 6. 55 4. 48 6. 43 6. 26 5. 08 6. 19 TREC 4. 36
Compression of bitmaps • Bitmaps : Hierarchical bitvetor compression기법으로 압축 (a) original bitvector (b) hierarchical structure (c) flattened tree as a string of bits
- Slides: 14