Managing Gigabytes 3 Indexing Inverted File Indexing Indexing

Managing Gigabytes 3. Indexing

Inverted File Indexing • • Indexing의 방법 – Inverted files (postings files or concordance) – signature files – bitmaps Inverted File의 구성 – inverted file entry – < t, doc# > • t : term, doc# : term이 나타난 document number

Inverted File Indexing • Inverted File (or Postings File) – require a lexicon or vocabulary – store a list of pointers – 표기 : < t ; [(x; y 1, y 2, …)] > • t : term, x : 문서번호, yi : word number – hierarchical structure • Document# ; volume# ; chapter# ; paragraph# ; sentence# ; word# • vector(d, v, c, p, s, w)

Inverted File Indexing • Index 는 점점 더 증가해왔다. 그 이유로는 – 각각의 pointer 에 대해 더 많은 information 요구r – several words appear more than once • Uncompressed Inverted file – 50 or 100% of original text • space requirement – N : 문서의 수, f : pointers의 수 – space =

unary code • Simple method • – fixed representation of the positive integer – log N (bits) – gap이 x일 때, x-1 bit의 1과 1 bit의 0으로 표현 – lx = (x - 1) + 1, Pr[x] = 2 -x – 예) x = 9 일 때, => 1111 0 – N • n bits 소비, 일반적으로 아주 크다. <Table 3. 5> integer에 대한 예제 code Gap x Unary 1 2 3 4 5 6 7 8 9 10 0 10 1110 111110 11111110 1111111110 Coding Method 0 10 1 110 00 110 01 110 10 11 1110 000 1110 001 1110 0 100 1 101 00 101 01 10 101 11 11000 001 11000 010 Golomb b=3 b=6 00 0 10 0 01 0 100 10 0 0 101 10 10 0 110 10 111 110 00 110 10 10 01 110 11 10 100 1110 0 10 101

code • code – 1 + log x bit의 unary code와 log x bit의 binary code(x - 2 log x )로 표현 – lx = 1 + log x , Pr[x] = 1/2 x 2 – 예) x = 9 일 때, log x = 3, x - 2 log x =1 => 1110 001 – V = <1, 2, 4, 8, 16, …> or V = <1, 2, 2, 4, 4, 4, 8, …> or ….

Global Bernoulli model • Pr[x] = (1 -p)x-1 p, p : gap x 가 나타날 확률 • Golomb code • q + 1 bit의 unary code와 + log b or log b bit의 binary code – q = (x - 1) / b , r = x - q b - 1 – b. A = log(2 - p) / - log(1 - p) 0. 69(N n / f) – 예) b=3, r=0(0), 1(10), 2(11) b=6, r=0(00), 1(01), 2(100), 3(101), 4(110), 5(111) x=9이면, q = 2, r = 2 따라서, 110 11

Global “observed frequency” model • • • Based on observed frequency of appear gap size arithmetic code 나 Huffman code 를 사용한다. 이론적으로는 나은 압축 method나 실제적으로는 and code 보다 약간 낮다.

Local Bernoulli model • • • The frequency of term t, ft , is known – 개개의 inverted file entry 에 대해서 Bernoulli model 이 사용되어 진다. 가장 일반적인 Words 는 b=1 로 encode 된다. – Bitvector 와 같다. – 따라서 , inverted file 은 결코 bitvector보다 떨어지지 않는 다. parameter ft저장할 필요가 있다. – b 는 decoding 동안에 계속 사용되기 때문

Skewed Bernoulli model and Local “observed frequency” model • Skewed Bernoulli model – – Bernoulli model의 vector VG = <b, b, b, …> Figure 3. 5 VT = <b, 2 b, 4 b, 2 ib, …> Golomb code를 사용하는 것보다 약간 떨어짐 • Local “observed frequency” model – – The ultimate local modeling batched frequency 많은 more memory space 요구 최상의 압축 방법

Local hyperbolic model • Pr[x] = / x, x = 1, 2, …, m – = 1 / (loge(m+1)+0. 5772) – m is largest gap • Bernoulli 보다 나은 수행 • 하나 좀더 복잡한 적용 • arithmetic coding의 사용이 요구된다 • 아직 이 model에서는 알려진 Huffman code 가 없음

Signature files • • Signature files 은 <Table 3. 9>의 형태를 가짐 문서내의 각 단어는 hashed value를 가진다. Term Hash String cold 1000 0010 0100 old 1000 0100 0000 각 문서는 signature file를 가진다. 문서 안의 term의 hash 값을 모두 ‘OR’연산 <Table 3. 9> 예) ’cold’ <Table 3. 8>의 hash bit set와 문서의 Descriptor 와 ‘AND’ 연산, hash bit set과 일치하면 성공 단점 : False match 발생 ‘old’ 검색 시 2, 3, 5, 6 문서를 검색. . 2, 5 문서에 존재하지않음

Signature files • 예) ’cold’ <Table 3. 8>의 hash bit set와 문서의 Descriptor 와 ‘AND’연산, hash bit set과 일치하면 성공 • 단점 : False match 발생 • ‘old’ 검색 시 2, 3, 5, 6 문서를 검색. . 2, 5 문서에 존재하지않음 Document Text descriptor 1 Pease porridge hot, pease porridge cold. 1100 1111 0010 0101 2 Pease porridge in the pot. 1110 1111 0110 0001 3 Nine days old. 1010 1100 0100 1100 4 Some like it hot, some like it cold. 1100 1110 1010 0111 5 Some like in the pot. 1110 1111 1110 0011 6 Nine days old. 1010 1100 0100 1100

Three valued logic • Three valued logic – – – Boolean operation을 이용 => Yes, Maybe, No Yes : 문서내에 term 존재 Maybe : term이 있을 수도 있다. (false match 해결을 위해) No : 존재하지 않는다. 아래 표에서 “(some OR NOT hot) AND pease” 평가하면 적절한 해결책 제시못함. Doc s H p NOT h s OR NOT h (s OR NOT h) AND p 1 M M M 2 M M M 3 N N N Y Y N 4 M M N 5 M M M 6 N N N Y Y N

Bitslice • • • set 된 bit의 위치인 Pos 와 slice 로 구성. 예) ‘cold’ => 1, 14 111111 AND 110110 AND 101101 => 100100 1, 4번 문서 내에 존재. 장점 : signature files에 대한 접근속도를 향상 – 전체 signature files을 탐색하지 않아도 된다. Pos Slice 1 111111 5 111111 9 000110 13 001001 2 110110 6 111111 10 011011 14 101101 3 011011 7 110110 11 110110 15 000110 4 000000 8 110010 12 000000 16 110110

signature files의 크기 • signature files 크기에 영향을 주는 parameter – b : the number of bitslices(주로 6~12 내) – b의 크기가 커지면 signature files size가 충분히 작더라도 query processing time 느려짐 -> fine balance 필요 – q : query 때마다 있다고 예측되는 term의 최소수 ->q의 크기가 커지면 signature files size는 작아진다 – z : 각각의 query 마다 예측되는 false match의 수 • z 가 크면 query processing time이 느려진다. • z 가 작으면 query processing time이 빨라진다. b 6 8 10 12 16 20 Bit per pointer Bible GNUbib Compact TREC 30. 6 25. 0 22. 8 21. 9 21. 6 22. 1 34. 9 27. 7 24. 9 23. 7 23. 1 23. 4 44. 9 33. 9 29. 6 27. 5 26. 1 54. 0 39. 2 33. 4 30. 6 28. 5 28. 1 Signature file sizes assuming z=1, q=1

Bitmap • Bitmap이란? • Term 과 bitvector로 구성되어있다. Number 1 2 3 4 5 Term cold days hot in it Bitvector 1001001 10010010 000110 6 7 8 9 10 11 12 13 like nine old pease porridge pot some the 000110 001001 110000 010010 000110 010010 예) ‘pot’ and ‘some’ => 010010 and 00011 => 000010 => 5번 문서에 ‘pot’ 와 ‘ some’ term 이 있다.

Compression of signature files and bitmap • Signature files – Bitslicing : Golombo 기법을 사용 – 손실 압축 기법이다. • Bitmap : hierarchical bitvector compression • <Figure 3. 7> 참조 • 64개의 bits 를 22 bits 로 압축 => 1100, 0101, 1010, 10, 11, 00, 01

Case folding, Stemming, Stop words • Case folding : 대문자나 소문자로 통일 • 예) Car -> CAR • Stemming : suffix, tense, plurality 제거 – 예) Students -> student • Case folding 과 stemming 은 index size 에 영향 – Inverted file의 size를 줄여준다. – 이유 : inverted file entry 가 fewer, dense • Stop words : 자주 발생하는 빈도를 이용한 방법 – 많은 space 절약 – 예) the, is, are, of…