Ranking with Index Rong Jin 1 Inverted Index

  • Slides: 69
Download presentation
Ranking with Index Rong Jin 1

Ranking with Index Rong Jin 1

Inverted Index o Find plays of Shakespeare related to Brutus and Calpurnia? 2

Inverted Index o Find plays of Shakespeare related to Brutus and Calpurnia? 2

Inverted Index 1 o 2 0 1 0 0 A simple approach: linear scan,

Inverted Index 1 o 2 0 1 0 0 A simple approach: linear scan, compute a score for each doc n n Assume idf(Brutus) = idf(Calpurnia) = 1 Slow for large collections 3

Inverted Index 1 o 2 0 1 0 0 A simple approach: linear scan,

Inverted Index 1 o 2 0 1 0 0 A simple approach: linear scan, compute a score for each doc n n Assume idf(Brutus) = idf(Calpurnia) = 1 Slow for large collections 4

Inverted Index o o Only three plays of Shakespeare contain “Brutus” or “Culpurnia” Inverted

Inverted Index o o Only three plays of Shakespeare contain “Brutus” or “Culpurnia” Inverted index: quickly find the list of documents that contain any 5 of the query words

Inverted Index o For each term t, we store a list of all documents

Inverted Index o For each term t, we store a list of all documents that contain t. dictionary postings 6

Inverted Index o o Query: Brutus and Calpurnia Substantially reduce the size of candidate

Inverted Index o o Query: Brutus and Calpurnia Substantially reduce the size of candidate documents (is this in general true? ) dictionary postings 7

Inverted Index o o Query: Brutus and Calpurnia Substantially reduce the size of candidate

Inverted Index o o Query: Brutus and Calpurnia Substantially reduce the size of candidate documents (is this in general true? ) Merge Only compute score for 1, 2, 4, 11, 31, 45, 54, 173, 174, 101 dictionary postings 8

Indexes o o o Indexes are data structures designed to make search faster, and

Indexes o o o Indexes are data structures designed to make search faster, and support efficient updates Text search has unique requirements, which leads to unique data structures Most common data structure is inverted index n n general name for a class of structures “inverted” because documents are associated with words, rather than words with documents 9

Inverted Index o Each index term is associated with an inverted list n n

Inverted Index o Each index term is associated with an inverted list n n n Contains lists of documents, or lists of word occurrences in documents, and other information Each entry is called a posting The part of the posting that refers to a specific document or location is called a pointer Each document in the collection is given a unique number Lists are usually document-ordered (sorted by document number) 10

Inverted Index postings 11

Inverted Index postings 11

Example “Collection” 12

Example “Collection” 12

Simple Inverted Index query: tropical fish 13

Simple Inverted Index query: tropical fish 13

Inverted Index with counts query: tropical fish 14

Inverted Index with counts query: tropical fish 14

Inverted Index with positions query: tropical fish 15

Inverted Index with positions query: tropical fish 15

Proximity Matches o Matching phrases or words within a window n o e. g.

Proximity Matches o Matching phrases or words within a window n o e. g. , "tropical fish", or “find tropical within 5 words of fish” Word positions in inverted lists make these types of query features efficient n e. g. , 16

Proximity Matches o Matching phrases or words within a window n o e. g.

Proximity Matches o Matching phrases or words within a window n o e. g. , "tropical fish", or “find tropical within 5 words of fish” Word positions in inverted lists make these types of query features efficient n e. g. , 17

Fields and Extents o Document structure is useful in search n field restrictions o

Fields and Extents o Document structure is useful in search n field restrictions o n some fields more important o o e. g. , date, from: , etc. e. g. , title Options: n n n separate inverted lists for each field type add information about fields to postings use extent lists 18

Extent Lists o An extent is a contiguous region of a document n n

Extent Lists o An extent is a contiguous region of a document n n n represent extents using word positions inverted list records all extents for a given field type e. g. , find “fish” in title extent list 19

Extent Lists o An extent is a contiguous region of a document n n

Extent Lists o An extent is a contiguous region of a document n n n represent extents using word positions inverted list records all extents for a given field type e. g. , find “fish” in title extent list 20

Other Issues o Precomputed scores in inverted list n n e. g. , list

Other Issues o Precomputed scores in inverted list n n e. g. , list for “fish” [(1: 3. 6), (3: 2. 2)], where 3. 6 is total feature value for document 1 improves speed but reduces flexibility 21

Other Issues o Precomputed scores in inverted list n n o e. g. ,

Other Issues o Precomputed scores in inverted list n n o e. g. , list for “fish” [(1: 3. 6), (3: 2. 2)], where 3. 6 is total feature value for document 1 improves speed but reduces flexibility Score-ordered lists n n very efficient for single-word queries only retrieve the top part of each inverted list, reducing disc access 22

Compression o Inverted lists are very large n n o Compression of indexes saves

Compression o Inverted lists are very large n n o Compression of indexes saves disk and/or memory space n n o e. g. , 25 -50% of collection for TREC collections using Indri search engine Much higher if n-grams are indexed Typically have to decompress lists to use them Trade off between compression ratios and computational cost Lossless compression – no information lost 23

Compression o Basic idea: Common data elements use short codes while uncommon data elements

Compression o Basic idea: Common data elements use short codes while uncommon data elements use longer codes n Example: coding numbers o number sequence: o possible encoding: (14 bits) 24

Compression o Basic idea: Common data elements use short codes while uncommon data elements

Compression o Basic idea: Common data elements use short codes while uncommon data elements use longer codes n Example: coding numbers o number sequence: o possible encoding: (14 bits) But ‘ 0’ is more popular than ‘ 1’, ‘ 2’, and ‘ 3’ o n A better coding scheme: 0: 0, 1: 01, 2: 11, 3: 10 25

Compression o Basic idea: Common data elements use short codes while uncommon data elements

Compression o Basic idea: Common data elements use short codes while uncommon data elements use longer codes n Example: coding numbers o number sequence: o possible encoding: o encode 0 using a single 0: o only 10 bits, but. . . 26

Compression o Basic idea: Common data elements use short codes while uncommon data elements

Compression o Basic idea: Common data elements use short codes while uncommon data elements use longer codes n Example: coding numbers o number sequence: o possible encoding: o encode 0 using a single 0: o only 10 bits, but. . . 0 0 3 3 0 27

Compression Example o Ambiguous encoding – not clear how to decode o use unambiguous

Compression Example o Ambiguous encoding – not clear how to decode o use unambiguous code: o which gives: (13 bits) 28

Delta Encoding Document ID Count Inverted list (1, 1) (102, 3) (1100, 2) (100,

Delta Encoding Document ID Count Inverted list (1, 1) (102, 3) (1100, 2) (100, 010, 10) o Word count data is good candidate for compression n n many small numbers and few larger numbers encode small numbers with small codes 29

Delta Encoding Document ID Count Inverted list (1, 1) (102, 3) (1100, 2) (100,

Delta Encoding Document ID Count Inverted list (1, 1) (102, 3) (1100, 2) (100, 010, 10) o Word count data is good candidate for compression n n o many small numbers and few larger numbers encode small numbers with small codes Document numbers are less predictable n but differences between numbers in an ordered list are smaller and more predictable 30

Delta Encoding o Word count data is good candidate for compression n n o

Delta Encoding o Word count data is good candidate for compression n n o Document numbers are less predictable n o many small numbers and few larger numbers encode small numbers with small codes but differences between numbers in an ordered list are smaller and more predictable Delta encoding: n encoding differences between document numbers (d-gaps) 31

Delta Encoding • Inverted list (without counts) • Differences between adjacent numbers 32

Delta Encoding • Inverted list (without counts) • Differences between adjacent numbers 32

Bit-Aligned Codes o o Breaks between encoded numbers can occur after any bit position

Bit-Aligned Codes o o Breaks between encoded numbers can occur after any bit position Unary code n n Encode k by k 1 s followed by 0 0 at end makes code unambiguous 33

Bit-Aligned Codes o Example 1 4 4 9 5 10 11110 111110 10 1101110

Bit-Aligned Codes o Example 1 4 4 9 5 10 11110 111110 10 1101110 34

Bit-Aligned Codes o Example 1 4 4 9 5 10 11110 111110 10 1101110

Bit-Aligned Codes o Example 1 4 4 9 5 10 11110 111110 10 1101110 1 2 3 1 3 35

Unary and Binary Codes o Unary is very efficient for small numbers such as

Unary and Binary Codes o Unary is very efficient for small numbers such as 0 and 1, but quickly becomes very expensive n o 1023 can be represented in 10 binary bits, but requires 1024 bits in unary Binary is more efficient for large numbers, but it may be ambiguous 36

Elias-γ Code o To encode a number k, compute o Important property: n The

Elias-γ Code o To encode a number k, compute o Important property: n The number of bits for coding kr is no more than kd 37

Elias-γ Code o To encode a number k, compute o Unary code for kd

Elias-γ Code o To encode a number k, compute o Unary code for kd and binary code for kr n kd number of bits used to encode kr 38

Elias- γ Code o Elias-γ code uses no more bits than unary, many fewer

Elias- γ Code o Elias-γ code uses no more bits than unary, many fewer for k > 2 n o 1023 takes 19 bits instead of 1024 bits using unary In general, takes 2�log 2 k�+1 bits n No more than twice number of bits than binary code 100111010011010111100111 39

Elias- γ Code o Elias-γ code uses no more bits than unary, many fewer

Elias- γ Code o Elias-γ code uses no more bits than unary, many fewer for k > 2 n o 1023 takes 19 bits instead of 1024 bits using unary In general, takes 2�log 2 k�+1 bits n No more than twice number of bits than binary code 100111010011010111100111 2 10 6 23 40

Byte-Aligned Codes o o Variable-length bit encodings can be a problem on processors that

Byte-Aligned Codes o o Variable-length bit encodings can be a problem on processors that process bytes v-byte is a popular byte-aligned code n o o Similar to Unicode UTF-8 Shortest v-byte code is 1 byte Numbers are 1 to 4 bytes, with high bit 1 in the last byte, 0 otherwise 41

V-Byte Encoding 6 110 00000110 42

V-Byte Encoding 6 110 00000110 42

V-Byte Encoding 6 127 110 10000000 10000110 00000001 0000 43

V-Byte Encoding 6 127 110 10000000 10000110 00000001 0000 43

V-Byte Encoding 6 127 110 10000000 10000110 00000001 10000000 44

V-Byte Encoding 6 127 110 10000000 10000110 00000001 10000000 44

Compression Example o Consider invert list with positions: o Delta encode document numbers and

Compression Example o Consider invert list with positions: o Delta encode document numbers and positions: Compress using v-byte: 1 2 1 6 1 3 6 11 180 1 1 1 o 45

Results of Compression o Reuter dataset: 800 K documents 46

Results of Compression o Reuter dataset: 800 K documents 46

Skipping 922, 000 web pages galago animal 1, 000, 000 web pages o Search

Skipping 922, 000 web pages galago animal 1, 000, 000 web pages o Search involves comparison of inverted lists of different lengths can be expensive n n Find documents with both “galago” and “animal” Number of documents with “animal”: 1, 000, 000 Number of documents with “galago”: 922, 000 Number of documents with both: 89, 700 47

Skipping o Search involves comparison of inverted lists of different lengths n n n

Skipping o Search involves comparison of inverted lists of different lengths n n n Can be very inefficient “Skipping” ahead to check document numbers is much better Compression makes this difficult o o Variable size, only d-gaps stored Skip pointers are additional data structure to support skipping 48

Skip Pointers o A skip pointer (d, p) contains a document number d and

Skip Pointers o A skip pointer (d, p) contains a document number d and a byte (or bit) position p n Means there is an inverted list posting that starts at position p, and the posting before it was for document d skip pointers Inverted list 49

Auxiliary Structures o Inverted lists usually stored together in a single file for efficiency

Auxiliary Structures o Inverted lists usually stored together in a single file for efficiency n o o Inverted file Term statistics stored at start of inverted lists Collection statistics stored in separate file 50

Auxiliary Structures o Vocabulary or lexicon n n Brutus Contains a lookup table from

Auxiliary Structures o Vocabulary or lexicon n n Brutus Contains a lookup table from index terms to the byte offset of the inverted list in the inverted file Either hash table in memory or B-tree for larger vocabularies Hashtable B-tree 51

Index Construction o Simple in-memory indexer 52

Index Construction o Simple in-memory indexer 52

Merging o Merging addresses limited memory problem n n n o Build the inverted

Merging o Merging addresses limited memory problem n n n o Build the inverted list structure until memory runs out Then write the partial index to disk, start making a new one At the end of this process, the disk is filled with many partial indexes, which are merged Partial lists must be designed so they can be merged in small pieces n e. g. , storing in alphabetical order 53

Merging 54

Merging 54

Distributed Indexing o o o Distributed processing driven by need to index and analyze

Distributed Indexing o o o Distributed processing driven by need to index and analyze huge amounts of data (i. e. , the Web) Large numbers of inexpensive servers used rather than larger, more expensive machines Map. Reduce is a distributed programming tool designed for indexing and analysis tasks 55

Map. Reduce o Basic process n n n Map stage which transforms data records

Map. Reduce o Basic process n n n Map stage which transforms data records into pairs, each with a key and a value (i. e. (word, doc-id) pair) Shuffle uses a hash function so that all pairs with the same key end up next to each other and on the same machine (i. e. all pairs for the same word are sent to the same machine) Reduce stage processes records in batches, where all pairs with the same key are processed at the same time (i. e. create the inverted list for each word) 56

Map. Reduce 57

Map. Reduce 57

Indexing Example 58

Indexing Example 58

Update Index o o Index merging is a good strategy for handling updates when

Update Index o o Index merging is a good strategy for handling updates when they come in large batches For small updates this is very inefficient n n o instead, create separate index for new documents, merge results from both searches could be in-memory, fast to update and search How to deal with deleted documents ? 59

Update Index o o Index merging is a good strategy for handling updates when

Update Index o o Index merging is a good strategy for handling updates when they come in large batches For small updates this is very inefficient n n o instead, create separate index for new documents, merge results from both searches could be in-memory, fast to update and search Deletions handled using delete list n Modifications done by putting old version on delete list, adding new version to new documents index 60

Query Processing o Document-at-a-time n Calculates complete scores for documents by processing all term

Query Processing o Document-at-a-time n Calculates complete scores for documents by processing all term lists, one document at a time Query: Salt Water Tropic 61

Query Processing o Term-at-a-time n Accumulates scores for documents by processing term lists one

Query Processing o Term-at-a-time n Accumulates scores for documents by processing term lists one at a time Query: Salt Water Tropic 62

Optimization Techniques o o Term-at-a-time uses more memory for storing inverted lists, but less

Optimization Techniques o o Term-at-a-time uses more memory for storing inverted lists, but less disk accesses Two classes of optimization n Read less data from inverted lists o n e. g. , using skip lists to read part of an inverted list Calculate scores for fewer documents o e. g. , conjunctive processing to reduce the number of candidate documents 63

Distributed Evaluation o Basic process n n o All queries sent to a director

Distributed Evaluation o Basic process n n o All queries sent to a director machine Director then sends messages to many index servers Each index server does some portion of the query processing Director organizes the results and returns them to the user Two main approaches n Document distribution o n by far the most popular Term distribution 64

Distributed Evaluation o Document distribution n o each index server acts as a search

Distributed Evaluation o Document distribution n o each index server acts as a search engine for a small fraction of the total collection director sends a copy of the query to each of the index servers, each of which returns the top-k results are merged into a single ranked list by the director Collection statistics should be shared for effective ranking 65

Document Distribution Results Query Merge Doc. List 1 Index Server 1 Document Collection 1

Document Distribution Results Query Merge Doc. List 1 Index Server 1 Document Collection 1 Director Doc. List 2 Index Server 2 Document Collection 2 Doc. List 100 ……. . Index Server 100 Document Collection 100 66

Distributed Evaluation o Term distribution n n Single index is built for the whole

Distributed Evaluation o Term distribution n n Single index is built for the whole cluster of machines Each inverted list in that index is then assigned to one index server o n One of the index servers is chosen to process the query o n n in most cases the data to process a query is not stored on a single machine usually the one holding the longest inverted list Other index servers send information to that server Final results sent to director 67

Term Distribution Results Query Director Final Results Index Server 1 Words a-b Doc. List

Term Distribution Results Query Director Final Results Index Server 1 Words a-b Doc. List 1 Index Server 2 Word c-d ……. . Doc. List 2 Index Server 100 Word y-z 68

Caching o Query distributions similar to Zipf n o Caching can significantly improve effectiveness

Caching o Query distributions similar to Zipf n o Caching can significantly improve effectiveness n n o About 50% are unique, but some are very popular Cache popular query results Cache common inverted lists Cache must be refreshed to prevent stale data 69