Signature Files Information Retrieval Data Structures and Algorithms

  • Slides: 28
Download presentation
Signature Files Information Retrieval: Data Structures and Algorithms by W. B. Frakes and R.

Signature Files Information Retrieval: Data Structures and Algorithms by W. B. Frakes and R. Baeza-Yates (Eds. ) Englewood Cliffs, NJ: Prentice Hall, 1992. (Chapters 4) 1

Signature Files l Characteristics » Word-oriented index structures based on hashing » Low overhead

Signature Files l Characteristics » Word-oriented index structures based on hashing » Low overhead (10%~20% over the text size) at the cost of forcing a sequential search over the index » Suitable for not very large texts » Inverted files outperform signature files for most applications 2

Structure l l l Use superimposed coding to create signature. Each text is divided

Structure l l l Use superimposed coding to create signature. Each text is divided into logical blocks. A block contains n distinct non-common words. Each word yields “word signature”. A word signature is a B-bit pattern, with m 1 -bit. » Each word is divided into successive, overlapping triplets. e. g. free --> fr, fre, ree, ee » Each such triplet is hashed to a bit position. l l The word signatures are OR’ed to form block signature. Block signatures are concatenated to form the document signature. 3

Example l Example (n=2, B=12, m=4) word free text block signature l 001 000

Example l Example (n=2, B=12, m=4) word free text block signature l 001 000 001 signature 000 110 010 101 010 111 010 001 011 Search » Use hash function to determine the m 1 -bit positions. » Examine each block signature for 1’s bit positions that the signature of the search word has a 1. 4

False Drop l false alarm (false hit, or false drop) Fd the probability that

False Drop l false alarm (false hit, or false drop) Fd the probability that a block signature seems to qualify, given that the block does not actually qualify. Fd = Prob{signature qualifies/block does not} l For a given value of B, the value of m that minimizes the false drop probability is such that each row of the matrix contains “ 1”s with probability 0. 5. Fd = 2 -m m = B ln 2/n 5

Sequential Signature File (SSF) documents assume documents span exactly one logical block the size

Sequential Signature File (SSF) documents assume documents span exactly one logical block the size of document signature F = the size of block signature B

Classification of Signature-Based Methods l Compression If the signature matrix is deliberately sparse, it

Classification of Signature-Based Methods l Compression If the signature matrix is deliberately sparse, it can be compressed. l Vertical partitioning Storing the signature matrix column-wise improves the response time on the expense of insertion time. l Horizontal partitioning Grouping similar signatures together and/or providing an index on the signature matrix may result in better-than-linear search. 7

Classification of Signature-Based Methods l Sequential storage of the signature matrix » without compression

Classification of Signature-Based Methods l Sequential storage of the signature matrix » without compression sequential signature files (SSF) » with compression bit-block compression (BC) variable bit-block compression (VBC) l Vertical partitioning » without compression bit-sliced signature files (BSSF, B’SSF) frame sliced (FSSF) generalized frame-sliced (GFSSF) 8

Classification of Signature-Based Methods (Continued) » with compression compressed bit slices (CBS) doubly compressed

Classification of Signature-Based Methods (Continued) » with compression compressed bit slices (CBS) doubly compressed bit slices (DCBS) no-false-drop method (NFD) l Horizontal partitioning » data independent partitioning Gustafson’s method partitioned signature files » data dependent partitioning 2 -level signature files 5 -trees 9

Criteria l l l the storage overhead the response time on single word queries

Criteria l l l the storage overhead the response time on single word queries the performance on insertion, as well as whether the insertion maintains the “append-only” property 10

Compression l idea » Create sparse document signatures on purpose. » Compress them before

Compression l idea » Create sparse document signatures on purpose. » Compress them before storing them sequentially. l Method » Use B-bit vector, where B is large. » Hash each word into one (or k) bit position(s). » Use run-length encoding (Mc. Ilroy 1982). 11

Compression using run-length encoding data base management system block signature 0000 0010 0000 0001

Compression using run-length encoding data base management system block signature 0000 0010 0000 0001 0000 1000 0000 0000 1001 0000 0010 1000 L 1 L 2 L 3 L 4 L 5 [L 1] [L 2] [L 3] [L 4] [L 5] where [x] is the encoded vale of x. search: Decode the encoded lengths of all the preceding intervals example: search “data” (1) data ==> 0000 0010 0000 (2) decode [L 1]=0000, decode [L 2]=00, decode [L 3]=000000 disadvantage: search becomes low

Bit-block Compression (BC) Data Structure: (1) The sparse vector is divided into groups of

Bit-block Compression (BC) Data Structure: (1) The sparse vector is divided into groups of consecutive bits (bit-blocks). (2) Each bit block is encoded individually. Algorithm: Part I. It is one bit long, and it indicates whethere any “ 1”s in the bit-block (1) or the bit -block is (0). In the latter case, the bit-block signature stops here. 0000 1001 0000 0010 1000 0 1 1 Part II. It indicates the number s of “ 1”s in the bit-block. It consists of s-1 “ 1” and a terminating zero. 10 0 0 Part III. It contains the offsets of the “ 1”s from the beginning of the bit-block. 0011 10 00 說明: 4 bits,距離為 0, 1, 2, 3,編碼為 00, 01, 10, 11 block signature: 01011 | 10 00 | 00 11 10 00

Bit-block Compression (BC) (Continued) Search “data” (1) data ==> 0000 0010 0000 (2) check

Bit-block Compression (BC) (Continued) Search “data” (1) data ==> 0000 0010 0000 (2) check the 4 th block of signature 01011 | 10 0 0 | 00 11 10 00 (4) OK, there is at least one setting in the 4 th bit-block. (5) Check furthermore. “ 0” tells us there is only one setting in the 4 th bit-clock. Is it the 3 rd bit? (6) Yes, “ 10” confirms the result. Discussion: (1) Bit-block compression requires less space than Sequential Signature File for the same false drop probability. (2) The response time of Bit-block compression is lightly less then Sequential Signature File. 14

Vertical Partitioning l l idea avoid bringing useless portions of the document signature in

Vertical Partitioning l l idea avoid bringing useless portions of the document signature in main memory methods » store the signature file in a bit-sliced form or in a frame-sliced form » store the signature matrix column-wise to improve the response time on the expense of insertion time 15

Bit-Sliced Signature Files (BSSF) Transposed bit matrix (document signature) documents transpose documents represent

Bit-Sliced Signature Files (BSSF) Transposed bit matrix (document signature) documents transpose documents represent

documents F bit-files search: (1) retrieve m bit-files. e. g. , the word signature

documents F bit-files search: (1) retrieve m bit-files. e. g. , the word signature of free is 001 000 110 010 the document contains “free”: 3 rd, 7 th, 8 th, 11 th bit are set i. e. , only 3 rd, 7 th, 8 th, 11 th files are examined. (2) “and” these vectors. The 1 s in the result N-bit vector denote the qualifying logical blocks (documents). (3) retrieve text file through pointer file. insertion: require F disk accesses for a new logical block (document), one for each bit-file, but no rewriting

Frame-Sliced Signature File (FSSF) l Ideas » random disk accesses are more expensive than

Frame-Sliced Signature File (FSSF) l Ideas » random disk accesses are more expensive than sequential ones » force each word to hash into bit positions that are closer to each other in the document signature » these bit files are stored together and can be retrieved with a few random accesses l Procedures » The document signature (F bits long) is divided into k frames of s consecutive bits each. » For each word in the document, one of the k frames will be chosen by a hash function. » Using another hash function, the word sets m bits in that frame. 18

Frame-Sliced Signature File (Cont. ) documents frames Each frame will be kept in consecutive

Frame-Sliced Signature File (Cont. ) documents frames Each frame will be kept in consecutive disk blocks. 19

FSSF (Continued) l Example (n=2, B=12, s=6, f=2, m=3) Word free text doc. signature

FSSF (Continued) l Example (n=2, B=12, s=6, f=2, m=3) Word free text doc. signature l Signature 000000 110010 010110 000000 010110 110010 Search » Only one frame has to be retrieved for a single word query. I. E. , only one random disk access is required. e. g. , search documents that contain the word “free” ->because the word signature of “free” is placed in 2 nd frame, only the 2 nd frame has to be examined. » At most k frames have to be scanned for an k word query. l Insertion » Only f frames have to be accessed instead of F bit-slices. 20

Vertical Partitioning with Compression l idea » create a very sparse signature matrix »

Vertical Partitioning with Compression l idea » create a very sparse signature matrix » store it in a bit-sliced form » compress each bit slice by storing the position of the 1 s in the slice. 21

Compressed Bit Slices (CBS) l Rooms for improvements » Searching – Each search word

Compressed Bit Slices (CBS) l Rooms for improvements » Searching – Each search word requires the retrieval of m bit files. – The search time could be improved if m was forced to be “ 1”. » Insertion – Require too many disk accesses (equal to F, which is typically 600 -1000). 22

Compressed Bit Slices (CBS) (Continued) documents l l Size of a signature l Let

Compressed Bit Slices (CBS) (Continued) documents l l Size of a signature l Let m=1. To maintain the same false drop probability, F has to be increased. To compress each bit file, we store only the positions of the “ 1”s. For unpredictable number of “ 1”s, we store them in buckets of size Bp. Sparse bit matrix 23

l Differences with inversion » The directory (hash table) is sparse » The actual

l Differences with inversion » The directory (hash table) is sparse » The actual word is stored nowhere » Simple structure Hash a word to obtain bucket address h(“base”)=30 Obtain the pointers to the relevant documents from buckets

Doubly Compressed Bit Slices Idea: compress the sparse directory 當S變小 碰撞在一 起的的機會 變大,採用 中間buckets

Doubly Compressed Bit Slices Idea: compress the sparse directory 當S變小 碰撞在一 起的的機會 變大,採用 中間buckets 為了區別 真碰撞和假 碰撞,多了 一個hash function Distinguish synonyms partially. h 1(“base”)=30 h 2(“base”)=011 Follow the pointers of posting buckets to retrieve the qualifying documents.

No False Drops Method To distinguish between synonyms completely. Using pointer to the word

No False Drops Method To distinguish between synonyms completely. Using pointer to the word in the text file

Horizontal Partitioning 1. Goal: group the signatures into sets, partitioning the signature matrix horizontally.

Horizontal Partitioning 1. Goal: group the signatures into sets, partitioning the signature matrix horizontally. 2. Grouping criterion documents

Partitioned Signature Files l l l Using a portion of a document signature as

Partitioned Signature Files l l l Using a portion of a document signature as a signature key to partition the signature file. All signatures with the same key will be grouped into a so-called “module”. When a query signature arrives, » examine its signature key and look for the corresponding modules » scan all the signatures within those modules that have been selected 28