Engi Net WARNING All rights reserved No Part

  • Slides: 61
Download presentation

Engi. Net™ WARNING All rights reserved. No Part of this video lecture series may

Engi. Net™ WARNING All rights reserved. No Part of this video lecture series may be reproduced in any form or by any electronic or mechanical means, including the use of information storage and retrieval systems, without written approval from the copyright owner. © 1999 The Research Foundation of the State University of New York

Binghamton University Engi. Net™ State University of New York

Binghamton University Engi. Net™ State University of New York

CS 533 Information Retrieval Dr. Michal Cutler Lecture #15 March 22, 1999

CS 533 Information Retrieval Dr. Michal Cutler Lecture #15 March 22, 1999

Signature files z. Signature files- main idea z. How they are created z. How

Signature files z. Signature files- main idea z. How they are created z. How they are searched

Main Idea z. Signature files - inexact filter z. Discard many of the nonqualifying

Main Idea z. Signature files - inexact filter z. Discard many of the nonqualifying items z. Documents stored sequentially in a “text file” z. Signatures in a signature file

Main Idea z. Document retrieved when a query’s-signature is “included” in its signature z“included”

Main Idea z. Document retrieved when a query’s-signature is “included” in its signature z“included” - document signature contains 1 for each 1 in query’s signature

Used in: z. PC based medium size DBs z. WORMS (write once read many

Used in: z. PC based medium size DBs z. WORMS (write once read many optical disks), zparallel machines and zdistributed text databases

Signature files versus inverted file z. Signature files - about 10 -15% of the

Signature files versus inverted file z. Signature files - about 10 -15% of the “text file” (Frakes) (30 -70% ? ) z. Inverted files commonly between 50300% z. Comparable to compressed inverted files

Signature files versus inverted file z. Insertion easier by appending to the end of

Signature files versus inverted file z. Insertion easier by appending to the end of the signature file z. Promote concurrency (insert while reading)

Signature files versus inverted file z. Retrieval slow because the whole file needs to

Signature files versus inverted file z. Retrieval slow because the whole file needs to be read. z. However there are techniques to make it more efficient

Superimposing codes z. Document divided into “logical blocks” of D words z. Each word

Superimposing codes z. Document divided into “logical blocks” of D words z. Each word yields a “word signature” z. Bit pattern of F bits, ym 1 s, yrest set to 0.

Superimposing codes z. Each 1 determined by different hash function z. Need good hash

Superimposing codes z. Each 1 determined by different hash function z. Need good hash functions that distribute different words into different signatures

Superimposing codes z. F, m, and D design parameters. z. Depend on false hits

Superimposing codes z. F, m, and D design parameters. z. Depend on false hits allowed

Why is m fixed? z. Adding signatures with i<m 1 s increases probability of

Why is m fixed? z. Adding signatures with i<m 1 s increases probability of false hits z. Let w be a word with i<m 1 s in signature z. Each query-signature where the i bits are one will retrieve w z. Example signature for computer is (1000…)

Signature file generation z. Document divided into blocks z. Words extracted z. Stop words

Signature file generation z. Document divided into blocks z. Words extracted z. Stop words may be deleted z. Words may be stemmed

Signature file generation z. Dictionary created z. Word-signature generated for each new word in

Signature file generation z. Dictionary created z. Word-signature generated for each new word in dictionary

Signature file generation z. Word-signatures of block ORed to block-signature z. Block signatures concatenated

Signature file generation z. Word-signatures of block ORed to block-signature z. Block signatures concatenated to document-signature z. Document signatures concatenated to signature file

EXAMPLE D=2, F=12, and m=4. word signature free text 001 000 110 000 010

EXAMPLE D=2, F=12, and m=4. word signature free text 001 000 110 000 010 101 001 block signature 001 010 111 011

Files Signature file pointer F x N blocks file 01… 01 1 1…. .

Files Signature file pointer F x N blocks file 01… 01 1 1…. . … 1…. . 1 0 1 text file

Retrieval for one term query z. Query-word translated into a queryword-signature z. Signature of

Retrieval for one term query z. Query-word translated into a queryword-signature z. Signature of each block examined. z. Document retrieved if: ya block-signature has 1 for each 1 in query-word-signature

Retrieval for one term query z. Document retrieved may not contain query term z.

Retrieval for one term query z. Document retrieved may not contain query term z. Called “false hit” z. To eliminate false hits actual text file can be searched. z. Often, to save time, this is not done

Block size z. For search efficiency each document a single block z. To minimize

Block size z. For search efficiency each document a single block z. To minimize false hits blocks should be small

Queries - doc one block z. For AND query: y query-signature is created y.

Queries - doc one block z. For AND query: y query-signature is created y. Compared with each documentsignature y. If document signature “includes” query-signature document is retrieved (Maybe false hit)

Queries - doc one block z. For OR query y One of query-word-signatures must

Queries - doc one block z. For OR query y One of query-word-signatures must be “included” in the document signature (maybe false hit) y. Must be done a word at a time

Queries - doc many blocks z. Each word-signature of an AND query must be

Queries - doc many blocks z. Each word-signature of an AND query must be included in at least one block signature. z. For an OR query document retrieved whenever a query-word-signature in one of the blocks

NOT z. If query-word-signature in document excluded (May be false drop) z. Document retrieved

NOT z. If query-word-signature in document excluded (May be false drop) z. Document retrieved if query-wordsignature not in document (OK)

Rank z. Term weights not easily available z. Coordination not easily available

Rank z. Term weights not easily available z. Coordination not easily available

Prefix and part of word search zn-grams used z. Each word divided into successive

Prefix and part of word search zn-grams used z. Each word divided into successive overlapping triples z. Each triple hashed to a bit position

Prefix and part of word search z. When a word has l triplets with

Prefix and part of word search z. When a word has l triplets with l>m the l (non-distinct) bits are used z. If l<m a random number generator is used to add bits to the encoding

Example z“free” divided into “fr, fre, ree, ee” z Each such triplet hashed to

Example z“free” divided into “fr, fre, ree, ee” z Each such triplet hashed to a bit position

Example z. Search for prefix “compr” will find documents with “compression”, and “decompress”

Example z. Search for prefix “compr” will find documents with “compression”, and “decompress”

False hits z. A search for the term “retail” z. A document containing “retain

False hits z. A search for the term “retail” z. A document containing “retain detail” with the triple “re, ret, eta, tai, ail” will be retrieved

Size of the signature file z. Typically m between 6 and 12. z. When

Size of the signature file z. Typically m between 6 and 12. z. When m=8, and z. F=7200, z. Signature file for 742, 358 TREC documents is z 600 -700 Mbytes

Improving the performance 1. Compressing the signature file 2. Vertical partitioning: Only v of

Improving the performance 1. Compressing the signature file 2. Vertical partitioning: Only v of F columns of signature file retrieved 3. Horizontal partitioning: grouping similar signatures so only h of N rows of signature file retrieved

Compressing the signature file z. Main idea: 0 to 1 ratio high z. Run

Compressing the signature file z. Main idea: 0 to 1 ratio high z. Run length compression z. Bit-block compression

Run length compression z. Sequences of 0 s followed by a 1 represented by

Run length compression z. Sequences of 0 s followed by a 1 represented by a number z. Compression is done on the numbers z[4][2][6][1] 00001 0000001 01 000000 (F=28, m=4)

Bit-Block Compression z. Sparse vector divided into blocks of b bits z. Method effective

Bit-Block Compression z. Sparse vector divided into blocks of b bits z. Method effective when 0 to 1 ratio high enough.

Bit-Block Compression z. Code composed of 3 parts: y. Part 1: F/b bits. One

Bit-Block Compression z. Code composed of 3 parts: y. Part 1: F/b bits. One for each block. x 0 if all b bits of block are 0. x. Otherwise 1

Bit-Block Compression z. Part 2: For each block with s ones y s-1 1

Bit-Block Compression z. Part 2: For each block with s ones y s-1 1 s and a terminating 0. y. For block with s=4, part 2 will have 1110

Bit-Block Compression z. Part 3: For each 1 in each block a binary offset

Bit-Block Compression z. Part 3: For each 1 in each block a binary offset in élog 2 bù bits z. Uses m élog 2 bù bits

Example F=28, m=4 P 1 P 2 P 3 0000 1001 0000 0 10

Example F=28, m=4 P 1 P 2 P 3 0000 1001 0000 0 10 0011 0010 1 0 10 1000 0000 1 0 0 (7) 0 (4) 00 (8) The code 0101100|1000|00111000 (19) Decoding: first bit 0 so 0000. Second 1, so check part 2, 10 so 2 1 s. Part 3 has 00 and 11 so 1001. . .

Vertical partitioning Bit sliced signature files N blocks Pointer file 011…. 1101 1 F

Vertical partitioning Bit sliced signature files N blocks Pointer file 011…. 1101 1 F files …. . 1 N …. There is a file for each bit of the signature

Vertical partitioning Bit sliced signature files z. Word search involves reading m out of

Vertical partitioning Bit sliced signature files z. Word search involves reading m out of F bit slices. z. The m bit vectors are ANDed to get retrieval result. z. Insertions need F accesses

Frame-sliced signature file z. Retrieval of m bit-slices slow because of disk accesses z.

Frame-sliced signature file z. Retrieval of m bit-slices slow because of disk accesses z. Generate F/s “frame signature files”, each containing s consecutive bits.

Frame-sliced signature file z. Signature in small number, n, frames. z. For example n=1.

Frame-sliced signature file z. Signature in small number, n, frames. z. For example n=1. z. To generate word-signature: y. Choose randomly n frames, y. Set m 1 s in selected frames.

Horizontal partitioning z(Lee and Leng) z. A portion of F - key z. Example:

Horizontal partitioning z(Lee and Leng) z. A portion of F - key z. Example: first 20 bits of F z. Block-signatures with single key grouped into “modules” z. Key of query-signature used to find module z. Module used for retrieval.

Goal z. Compute the size, F, of a signature tolerating z false hits. z(We

Goal z. Compute the size, F, of a signature tolerating z false hits. z(We will use z=1)

Notation for computing F zq- number of words in typical AND query zb-number of

Notation for computing F zq- number of words in typical AND query zb-number of 1 s in query-signature of typical AND query z. N-number of documents zf-number of unique term/doc pairs in collection

Notation for computing F z. B-maximum number of 1 s in a document-signature zz-expected

Notation for computing F z. B-maximum number of 1 s in a document-signature zz-expected number of retrieved documents zp-probability that a randomly chosen bit in a document’s-signature is 1

Simplifying assumptions zq is known (for example 1, or 2) zb is known (for

Simplifying assumptions zq is known (for example 1, or 2) zb is known (for example 8) z. Each word sets b/q bits (out of b) to 1.

Simplifying assumptions z. A document is a single block z. All documents have roughly

Simplifying assumptions z. A document is a single block z. All documents have roughly the same number of unique terms z. Each term in each document selects its bits at random

Computing F - main idea z. First, we compute p as a function of

Computing F - main idea z. First, we compute p as a function of z, N and b. z. Second we compute B as a function of f, N, b and q z. Third, we compute p as a function of F and B z. Finally we compute F

P(N, b, z) z. Calculate the value of p assuming z documents are retrieved.

P(N, b, z) z. Calculate the value of p assuming z documents are retrieved. z. The probability of b 1 s in querysignature being 1 in documentsignature is pb.

P(N, b, z) zz=pb. N are expected to be retrieved out of N documents.

P(N, b, z) zz=pb. N are expected to be retrieved out of N documents. z(z=2 when pb=1/4, and N=8) z p=(z/N)1/b

B(f, N, b, q) z. B= (f/N)(b/q) where zf/N is the number of unique

B(f, N, b, q) z. B= (f/N)(b/q) where zf/N is the number of unique terms in a document and zb/q the number of bits set to 1 by each term

P(B, F) z 1 -p is the probability that a randomly selected bit in

P(B, F) z 1 -p is the probability that a randomly selected bit in a document’ssignature is 0. z. A document-signature with at most B 1 s is generated by making B random selections and setting the selected bits to 1

P(B, F) z. So a bit in a document’s-signature is 0 if it was

P(B, F) z. So a bit in a document’s-signature is 0 if it was not selected B times z. The probability that a bit is not selected once is (F-1)/F, z. The probability that it is not selected B times is ((F-1)/F)B z. So 1 -p=((F-1)/F)B z. Therefore p=1 -((F-1)/F)B

Computing F z. F=1/(1 -(1 -p)1/B) where p=(z/N)1/b

Computing F z. F=1/(1 -(1 -p)1/B) where p=(z/N)1/b

Example z. Let q=1, b=8, z=1, N=742, 358 and f=136, 010, 026. zp=(1/742358)1/8=0. 185.

Example z. Let q=1, b=8, z=1, N=742, 358 and f=136, 010, 026. zp=(1/742358)1/8=0. 185. z. B=(136010026/742358)8=1466 z. F=1/(1 -0. 1851/1466)=7184 z. F x N=7184 x 742358/8=636 Mbytes