Inverted Indices Inverted Files Definition an inverted file

Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text

Example Text: 1 6 12 16 18 25 29 36 40 45 54 58

Inverted Files with TF-IDF Prior example allows for boolean queries. Need the document frequency

Space Requirements The space required for the vocabulary is rather small. According to Heaps’

Block Addressing The text is divided in blocks The occurrences point to the blocks

Example Text: Block 1 Block 2 Block 3 Block 4 That house has a

Block Effect on Inverted File Size How big are inverted files? – In relation

Searching The search algorithm on an inverted index follows three steps: 1. Vocabulary search:

Searching inverted files starts with vocabulary – store the vocabulary in a separate file

Vocabulary Construction All the vocabulary is kept in a suitable data structure storing for

Index File Construction Once the text is exhausted the vocabulary is written to disk

Faster Large Index Construction An option is to use the previous algorithm until the

Example final index I 1. . . 8 7 I 1. . . 2

Large Index Construction Time The total time to generate partial indices is O(n) The

Conclusion Inverted files are used to index text The indices are appropriate when the

IR System: What Do You Need? Vocabulary List – Text preprocessing modules • lexical

Slides: 17

Download presentation

Inverted Indices

Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. Structure of inverted file: – Vocabulary: is the set of all distinct words in the text – Occurrences: lists containing all information necessary for each word of the vocabulary (text position, frequency, documents where the word appears, etc. )

Example Text: 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful Inverted file Vocabulary Occurrences beautiful 70 flowers 45, 58 garden 18, 29 house 6

Inverted Files with TF-IDF Prior example allows for boolean queries. Need the document frequency and term frequency. Vocabulary entry k dk Posting file entry doc 1 f 1 k doc 2 dk : document frequency of term k doci : i-th document that contains term k fik : term frequency of term k in document i f 2 k …

Space Requirements The space required for the vocabulary is rather small. According to Heaps’ law the vocabulary grows as O(n ), where is a constant between 0. 4 and 0. 6 in practice – TREC-2: 1 GB text, 5 MB lexicon On the other hand, the occurrences demand much more space. Since each word appearing in the text is referenced once in that structure, the extra space is O(n) To reduce space requirements, a technique called block addressing is used

Block Addressing The text is divided in blocks The occurrences point to the blocks where the word appears Advantages: – the number of pointers is smaller than positions – all the occurrences of a word inside a single block are collapsed to one reference Disadvantages: – online search over the qualifying blocks if exact positions are required

Example Text: Block 1 Block 2 Block 3 Block 4 That house has a garden. The garden has many flowers. The flowers are beautiful Inverted file Vocabulary Occurrences beautiful 4 flowers 3 garden 2 house 1

Block Effect on Inverted File Size How big are inverted files? – In relation to original collection size Small collection Medium collection Large collection Index (1 Mb) (200 Mb) (2 Gb) Addressing words 45% 73% 36% 64% 35% 63% Addressing 256 blocks 27% 41% 18% 32% 5% 9% Addressing 64 K blocks 18% 25% 1. 7% 2. 4% 0. 5% 0. 7% • right column indexes stopwords while left removes stopwords Blocks require text to be available for location of terms within blocks.

Searching The search algorithm on an inverted index follows three steps: 1. Vocabulary search: the words present in the query are located in the vocabulary 2. Retrieval occurrences: the lists of the occurrences of all query words found are retrieved 3. Manipulation of occurrences: the occurrences are processed to solve the query

Searching inverted files starts with vocabulary – store the vocabulary in a separate file Structures used to store the vocabulary include – – – Hashing : O (1) lookup, does not support range queries Tries : O (c) lookup, c = length (word) B-trees : O (log v) lookup An alternative is simply storing the words in lexicographical order – cheaper in space and very competitive with O(log v) cost

Vocabulary Construction All the vocabulary is kept in a suitable data structure storing for each word and a list of its occurrences Each word of each text in the corpus is read and searched for in the vocabulary If it is not found, it is added to the vocabulary with a empty list of occurrences The new position is added to the end of its list of occurrences for the word

Index File Construction Once the text is exhausted the vocabulary is written to disk with the list of occurrences. Two files are created: – – in the first file, each list of word occurrences is stored contiguously in the second file, the vocabulary is stored in lexicographical order and, for each word, a pointer to its list in the first file is also included. This allows the vocabulary to be kept in memory at search time The overall process is O(n) worst-case time

Faster Large Index Construction An option is to use the previous algorithm until the main memory is exhausted. When no more memory is available, the partial index Ii obtained up to now is written to disk and erased the main memory before continuing with the rest of the text Once the text is exhausted, a number of partial indices Ii exist on disk The partial indices are merged to obtain the final index

Example final index I 1. . . 8 7 I 1. . . 2 I 1. . . 4 I 5. . . 8 3 6 I 3. . . 4 1 I 1 level 3 I 5. . . 6 2 I 3 level 2 I 7. . . 8 4 I 5 level 1 5 I 6 I 7 I 8 initial dumps

Large Index Construction Time The total time to generate partial indices is O(n) The number of partial indices is O(n/M) To merge the O(n/M) partial indices are necessary log 2(n/M) merging levels The total cost of this algorithm is O(n log(n/M))

Conclusion Inverted files are used to index text The indices are appropriate when the text collection is large and semistatic If the text collection is volatile online searching is the only option Some techniques combine online and indexed searching

IR System: What Do You Need? Vocabulary List – Text preprocessing modules • lexical analysis, stemming, stopwords Occurrences of Vocabulary Terms – Inverted index creation • term frequency in documents, document frequency Retrieval and Ranking Algorithm Query and Ranking Interfaces Browsing/Visualization Interface