Engi Net WARNING All rights reserved No Part
- Slides: 61
Engi. Net™ WARNING All rights reserved. No Part of this video lecture series may be reproduced in any form or by any electronic or mechanical means, including the use of information storage and retrieval systems, without written approval from the copyright owner. © 1999 The Research Foundation of the State University of New York
Binghamton University Engi. Net™ State University of New York
CS 533 Information Retrieval Dr. Michal Cutler Lecture #15 March 22, 1999
Signature files z. Signature files- main idea z. How they are created z. How they are searched
Main Idea z. Signature files - inexact filter z. Discard many of the nonqualifying items z. Documents stored sequentially in a “text file” z. Signatures in a signature file
Main Idea z. Document retrieved when a query’s-signature is “included” in its signature z“included” - document signature contains 1 for each 1 in query’s signature
Used in: z. PC based medium size DBs z. WORMS (write once read many optical disks), zparallel machines and zdistributed text databases
Signature files versus inverted file z. Signature files - about 10 -15% of the “text file” (Frakes) (30 -70% ? ) z. Inverted files commonly between 50300% z. Comparable to compressed inverted files
Signature files versus inverted file z. Insertion easier by appending to the end of the signature file z. Promote concurrency (insert while reading)
Signature files versus inverted file z. Retrieval slow because the whole file needs to be read. z. However there are techniques to make it more efficient
Superimposing codes z. Document divided into “logical blocks” of D words z. Each word yields a “word signature” z. Bit pattern of F bits, ym 1 s, yrest set to 0.
Superimposing codes z. Each 1 determined by different hash function z. Need good hash functions that distribute different words into different signatures
Superimposing codes z. F, m, and D design parameters. z. Depend on false hits allowed
Why is m fixed? z. Adding signatures with i<m 1 s increases probability of false hits z. Let w be a word with i<m 1 s in signature z. Each query-signature where the i bits are one will retrieve w z. Example signature for computer is (1000…)
Signature file generation z. Document divided into blocks z. Words extracted z. Stop words may be deleted z. Words may be stemmed
Signature file generation z. Dictionary created z. Word-signature generated for each new word in dictionary
Signature file generation z. Word-signatures of block ORed to block-signature z. Block signatures concatenated to document-signature z. Document signatures concatenated to signature file
EXAMPLE D=2, F=12, and m=4. word signature free text 001 000 110 000 010 101 001 block signature 001 010 111 011
Files Signature file pointer F x N blocks file 01… 01 1 1…. . … 1…. . 1 0 1 text file
Retrieval for one term query z. Query-word translated into a queryword-signature z. Signature of each block examined. z. Document retrieved if: ya block-signature has 1 for each 1 in query-word-signature
Retrieval for one term query z. Document retrieved may not contain query term z. Called “false hit” z. To eliminate false hits actual text file can be searched. z. Often, to save time, this is not done
Block size z. For search efficiency each document a single block z. To minimize false hits blocks should be small
Queries - doc one block z. For AND query: y query-signature is created y. Compared with each documentsignature y. If document signature “includes” query-signature document is retrieved (Maybe false hit)
Queries - doc one block z. For OR query y One of query-word-signatures must be “included” in the document signature (maybe false hit) y. Must be done a word at a time
Queries - doc many blocks z. Each word-signature of an AND query must be included in at least one block signature. z. For an OR query document retrieved whenever a query-word-signature in one of the blocks
NOT z. If query-word-signature in document excluded (May be false drop) z. Document retrieved if query-wordsignature not in document (OK)
Rank z. Term weights not easily available z. Coordination not easily available
Prefix and part of word search zn-grams used z. Each word divided into successive overlapping triples z. Each triple hashed to a bit position
Prefix and part of word search z. When a word has l triplets with l>m the l (non-distinct) bits are used z. If l<m a random number generator is used to add bits to the encoding
Example z“free” divided into “fr, fre, ree, ee” z Each such triplet hashed to a bit position
Example z. Search for prefix “compr” will find documents with “compression”, and “decompress”
False hits z. A search for the term “retail” z. A document containing “retain detail” with the triple “re, ret, eta, tai, ail” will be retrieved
Size of the signature file z. Typically m between 6 and 12. z. When m=8, and z. F=7200, z. Signature file for 742, 358 TREC documents is z 600 -700 Mbytes
Improving the performance 1. Compressing the signature file 2. Vertical partitioning: Only v of F columns of signature file retrieved 3. Horizontal partitioning: grouping similar signatures so only h of N rows of signature file retrieved
Compressing the signature file z. Main idea: 0 to 1 ratio high z. Run length compression z. Bit-block compression
Run length compression z. Sequences of 0 s followed by a 1 represented by a number z. Compression is done on the numbers z[4][2][6][1] 00001 0000001 01 000000 (F=28, m=4)
Bit-Block Compression z. Sparse vector divided into blocks of b bits z. Method effective when 0 to 1 ratio high enough.
Bit-Block Compression z. Code composed of 3 parts: y. Part 1: F/b bits. One for each block. x 0 if all b bits of block are 0. x. Otherwise 1
Bit-Block Compression z. Part 2: For each block with s ones y s-1 1 s and a terminating 0. y. For block with s=4, part 2 will have 1110
Bit-Block Compression z. Part 3: For each 1 in each block a binary offset in élog 2 bù bits z. Uses m élog 2 bù bits
Example F=28, m=4 P 1 P 2 P 3 0000 1001 0000 0 10 0011 0010 1 0 10 1000 0000 1 0 0 (7) 0 (4) 00 (8) The code 0101100|1000|00111000 (19) Decoding: first bit 0 so 0000. Second 1, so check part 2, 10 so 2 1 s. Part 3 has 00 and 11 so 1001. . .
Vertical partitioning Bit sliced signature files N blocks Pointer file 011…. 1101 1 F files …. . 1 N …. There is a file for each bit of the signature
Vertical partitioning Bit sliced signature files z. Word search involves reading m out of F bit slices. z. The m bit vectors are ANDed to get retrieval result. z. Insertions need F accesses
Frame-sliced signature file z. Retrieval of m bit-slices slow because of disk accesses z. Generate F/s “frame signature files”, each containing s consecutive bits.
Frame-sliced signature file z. Signature in small number, n, frames. z. For example n=1. z. To generate word-signature: y. Choose randomly n frames, y. Set m 1 s in selected frames.
Horizontal partitioning z(Lee and Leng) z. A portion of F - key z. Example: first 20 bits of F z. Block-signatures with single key grouped into “modules” z. Key of query-signature used to find module z. Module used for retrieval.
Goal z. Compute the size, F, of a signature tolerating z false hits. z(We will use z=1)
Notation for computing F zq- number of words in typical AND query zb-number of 1 s in query-signature of typical AND query z. N-number of documents zf-number of unique term/doc pairs in collection
Notation for computing F z. B-maximum number of 1 s in a document-signature zz-expected number of retrieved documents zp-probability that a randomly chosen bit in a document’s-signature is 1
Simplifying assumptions zq is known (for example 1, or 2) zb is known (for example 8) z. Each word sets b/q bits (out of b) to 1.
Simplifying assumptions z. A document is a single block z. All documents have roughly the same number of unique terms z. Each term in each document selects its bits at random
Computing F - main idea z. First, we compute p as a function of z, N and b. z. Second we compute B as a function of f, N, b and q z. Third, we compute p as a function of F and B z. Finally we compute F
P(N, b, z) z. Calculate the value of p assuming z documents are retrieved. z. The probability of b 1 s in querysignature being 1 in documentsignature is pb.
P(N, b, z) zz=pb. N are expected to be retrieved out of N documents. z(z=2 when pb=1/4, and N=8) z p=(z/N)1/b
B(f, N, b, q) z. B= (f/N)(b/q) where zf/N is the number of unique terms in a document and zb/q the number of bits set to 1 by each term
P(B, F) z 1 -p is the probability that a randomly selected bit in a document’ssignature is 0. z. A document-signature with at most B 1 s is generated by making B random selections and setting the selected bits to 1
P(B, F) z. So a bit in a document’s-signature is 0 if it was not selected B times z. The probability that a bit is not selected once is (F-1)/F, z. The probability that it is not selected B times is ((F-1)/F)B z. So 1 -p=((F-1)/F)B z. Therefore p=1 -((F-1)/F)B
Computing F z. F=1/(1 -(1 -p)1/B) where p=(z/N)1/b
Example z. Let q=1, b=8, z=1, N=742, 358 and f=136, 010, 026. zp=(1/742358)1/8=0. 185. z. B=(136010026/742358)8=1466 z. F=1/(1 -0. 1851/1466)=7184 z. F x N=7184 x 742358/8=636 Mbytes
- Warning all rights reserved
- Warning all rights reserved
- Specification by example
- Copyright 2015 all rights reserved
- All rights reserved sentence
- Freesound content licence
- Confidential all rights reserved
- Sentinel repetition
- Copyright © 2015 all rights reserved
- Pearson education inc all rights reserved
- Microsoft corporation. all rights reserved.
- Microsoft corporation. all rights reserved.
- Microsoft corporation. all rights reserved.
- Pearson education inc. all rights reserved
- Dell all rights reserved copyright 2009
- C all rights reserved
- Quadratic equation cengage
- Confidential all rights reserved
- Microsoft corporation. all rights reserved
- Pearson education inc. all rights reserved
- Copyright © 2018 all rights reserved
- Gssllc
- Pearson education inc all rights reserved
- Pearson education inc. all rights reserved
- Confidential all rights reserved
- Confidential all rights reserved
- Engi cont
- Engi messih
- R rights reserved
- Rights reserved
- Positive vs negative rights
- Riparian doctrine
- Conclusion of rights
- Legal rights and moral rights
- Positive vs negative rights
- Negative rights
- Negative rights vs positive rights
- Negative right
- Name all the lines
- Difference between delegated reserved and concurrent powers
- Reserved powers
- Peachres
- Mpls label operations
- Although frieda is typically very reserved as
- Find the names of sailors who have reserved a red boat
- Reserved address
- Reserved powers pictures
- Example of concurrent powers
- In a banyan switch micro switch
- Space reserved
- Theta join
- Reserved mark
- Maceration meaning
- Space reserved
- Reserved power
- Reserved
- Reserved material
- Reserved mark
- Reserved mark
- Reserved mark
- Pizza mark
- Reserved mark