Indexing and Searching Modern Information Retrieval by R
Indexing and Searching Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Chapter 8 1
Outline Inverted Files n Other Indices for Text n Sequential Searching n Pattern Matching n Compression n 2
Inverted Files n n n And inverted file (or inverted index) is a wordoriented mechanism for indexing a text collection in order to speed up the searching task. Structure:vocabulary and occurrences Block addressing u The text is divided in blocks, and the occurrences point to the blocks u Full inverted indices:exact occurrences 3
4
5
Inverted Files n n The search algorithm on an inverted index u Vocabulary search u Retrieval of occurrences u Manipulation of occurrences Construction (split the index into two files) u Posting file:the lists of occurrences are stored contiguously u The vocabulary is stored in lexicographical order and points to its list. 6
7
Inverted Files n For Large texts u Partial index u Merging two indices consists of merging the sorted vocabularies. 8
9
Other Indices for Text Suffix Trees n Suffix Arrays n Signature Files n 10
Suffix Trees and Suffix Arrays Each position in the text is considered as a text suffix n Index points are selected form the text, which point to the beginning of the text positions which will be retrievable n 11
12
Suffix arrays The main drawbacks of Suffix Array are its costly construction process. n Allow binary searches done by comparing the contents of each pointer. n Supra-indices (for large suffix array) n 13
14
15
Construction of Suffix Arrays for Large Texts 16
Signature Files n n n Word-oriented index structures base on hashing Maps words to bit masks of B bits Divides the text in blocks of b words each The mask is obtained by bitwise ORing the signatures of all the words in the text block. Hash the query to a bit mask W If W & Bi = W, the text block may contain the word 17
18
Sequential Searching Brute Force n Knuth-Morris-Pratt n Boyer-Moore Family n Shift-Or n Suffix Automaton u Backward DAWG matching (BDM) u BNDM n 19
Knuth-Morris-Pratt 20
Boyer-Moore Family 21
Shift-Or 22
Suffix Automaton 23
24
Pattern Matching Searching allowing errors u Dynamic Programming u Automaton n Regular Expressions and Extended patterns n Pattern Matching Using Indices u Inverted files u Suffix Trees and Suffix Arrays n 25
Dynamic Programming 26
Automaton 27
Regular Expressions 28
Pattern Matching Using Indices n Inverted Files u The types of queries such as suffix or substring queries, searching allowing errors and regular expressions, are solved by a sequential search u The restriction is to find approximate matches or regular expressions that span many word. 29
Pattern Matching Using Indices n Suffix Trees u Suffix trees are able to perform complex searches Word, prefix, suffix, substring, and Range queries t Regular expressions t Unrestricted approximate string matching u Useful in specific areas t Find the longest substring t Find the most common substring of a fixed 30 t
Pattern Matching Using Indices n Suffix Arrays u Some patterns can be searched directly in the suffix array without simulation the suffix tree u Word, prefix, suffix, subword search and range search 31
Compression Compressed text--Huffman coding u Taking words as symbols u Use an alphabet of bytes instead of bits n Compressed indices u Inverted Files u Suffix Trees and Suffix Arrays u Signature Files n 32
- Slides: 32