File Structures Information Retrieval Data Structures and Algorithms

  • Slides: 20
Download presentation
File Structures Information Retrieval: Data Structures and Algorithms by W. B. Frakes and R.

File Structures Information Retrieval: Data Structures and Algorithms by W. B. Frakes and R. Baeza-Yates (Eds. ) Englewood Cliffs, NJ: Prentice Hall, 1992. (Chapters 3 -5) 1

File Structures for IR l lexicographical indices » indices that are sorted » e.

File Structures for IR l lexicographical indices » indices that are sorted » e. g. inverted files » e. g. Patricia (PAT) trees l l cluster file structures indices based on hashing » signature files 2

Inverted Files Information Retrieval: Data Structures and Algorithms by W. B. Frakes and R.

Inverted Files Information Retrieval: Data Structures and Algorithms by W. B. Frakes and R. Baeza-Yates (Eds. ) Englewood Cliffs, NJ: Prentice Hall, 1992. (Chapters 3) 3

Inverted Files l l Each document is assigned a list of keywords or attributes.

Inverted Files l l Each document is assigned a list of keywords or attributes. Each keyword (attribute) is associated with operational relevance weights. An inverted file is the sorted list of keywords (attributes), with each keyword having links to the documents containing that keyword. Penalty » the size of inverted files ranges from 10% to 100% of more of the size of the text itself » need to update the index as the data set changes 4

Indexing Restrications l l A controlled vocabulary which is the collection of keywords that

Indexing Restrications l l A controlled vocabulary which is the collection of keywords that will be indexed. Words in the text that are not in the vocabulary will not be indexed A list of stopwords that for reasons of volume will not be included in the index A set of rules that decide the beginning of a word or a piece of text that is indexable A list of character sequences to be indexed (or not indexed) 5

Sorted array implementation of an inverted file

Sorted array implementation of an inverted file

Structures used in Inverted Files l Sorted Arrays » » l l l store

Structures used in Inverted Files l Sorted Arrays » » l l l store the list of keywords in a sorted array using a standard binary search advantage: easy to implement disadvantage: updating the index is expensive Hashing Structures Tries (digital search trees) Combinations of these structures 7

Sorted Arrays 1. The input text is parsed into a list of words along

Sorted Arrays 1. The input text is parsed into a list of words along with their location in the text. (time and storage consuming operation) 2. This list is inverted from a list of terms in location order to a list of terms in alphabetical order. 3. Add term weights, or reorganize or compress the files. 8

Inversion of Word List

Inversion of Word List

Dictionary and postings file Idea: the file to be searched should be as short

Dictionary and postings file Idea: the file to be searched should be as short as possible split a single file into two pieces e. g. data set: 38, 304 records, 250, 000 unique terms (document #, frequency) 10

Producing an Inverted File for Large Data Sets without Sorting Idea: avoid the use

Producing an Inverted File for Large Data Sets without Sorting Idea: avoid the use of an explicit sort by using a right-threaded binary tree current number of term postings & the storage location of postings list traverse the binary tree and the linked postings list

A Fast Inversion Algorithm l Principle 1 the large primary memories are available If

A Fast Inversion Algorithm l Principle 1 the large primary memories are available If databases can be split into memory loads that can be rapidly processed and then combined, the overall cost will be minimized. l Principle 2 the inherent order of the input data It is very expensive to use polynomial or even nlogn sorting algorithms for large files 12

FAST-INV algorithm concept postings/ pointers See p. 13.

FAST-INV algorithm concept postings/ pointers See p. 13.

Sample document vector document number concept number (one concept number for each unique word)

Sample document vector document number concept number (one concept number for each unique word) Similar to the documentword list shown in p. 7. The concept numbers are sorted within document numbers, and document numbers are sorted within collection

Preparation l Terminology » HCN= highest concept number in dictionary, or the number of

Preparation l Terminology » HCN= highest concept number in dictionary, or the number of words to be indexed » L= number of document/concept pairs in the collection » M= available primary memory size l Assumption » M>>HCN » M<L 15

: the range of concepts for each primary load 讀入(Doc, Con) 依Con去查Load 表,確定這個 配對該落在那

: the range of concepts for each primary load 讀入(Doc, Con) 依Con去查Load 表,確定這個 配對該落在那 個Load 依序將每個Load File反轉。CONPTR 表中的Offset顯示每 筆資料該填入的位 置。

Preparation 1. Allocate an array, con_entries_cnt, of size HCN. 2. For each <doc#, con#>

Preparation 1. Allocate an array, con_entries_cnt, of size HCN. 2. For each <doc#, con#> entry in the document vector file: increment con_entries_cnt[con#] ………… 0 (1, 1), (1, 4)………. . 2 (2, 3) ……………. . 3 (3, 1), (3, 2), (3, 5). . . 6 (4, 2), (4, 3) ………. 8 … (con#, doc#)

Preparation (continued) 5. For each <con#, count> pair obtained from con_entries_cnt: if there is

Preparation (continued) 5. For each <con#, count> pair obtained from con_entries_cnt: if there is no room for documents with this concept to fit in the current load, then created an entry in the load table and initialize the next load entry; otherwise update information for the current load table entry.

Building Load Table l Terminology » » l LL= length of current load S=

Building Load Table l Terminology » » l LL= length of current load S= spread of concept numbers in the current load 8 bytes = space needed for each concept/weight pair 4 bytes = space needed for each concept to store count of postings for it Constraints » 8*LL+4*S<M 19

: the range of concepts for each primary load 讀入(Doc, Con) 依Con去查Load 表,確定這個 配對該落在那

: the range of concepts for each primary load 讀入(Doc, Con) 依Con去查Load 表,確定這個 配對該落在那 個Load 依序將每個Load File反轉。CONPTR 表中的Offset顯示每 筆資料該填入的位 置。