INVERTED FILES CS 4323 0910 1 YFA Tersedia

  • Slides: 15
Download presentation
INVERTED FILES CS 4323 / 0910 -1 YFA Tersedia online di http: //www. ittelkom.

INVERTED FILES CS 4323 / 0910 -1 YFA Tersedia online di http: //www. ittelkom. ac. id/staf/yanuar 10 YFA CS 4323 S 1/IT/IR/E 3/1109 Institut Teknologi Telkom http: //www. ittelkom. ac. id/staf/yanuar

Results of a Search x x x x documents found by search query http:

Results of a Search x x x x documents found by search query http: //www. ittelkom. ac. id/staf/yanuar hits from search

Relevance Feedback (Concept) http: //www. ittelkom. ac. id/staf/yanuar

Relevance Feedback (Concept) http: //www. ittelkom. ac. id/staf/yanuar

Relevance Feedback (Concept) Generated New Query Expansion http: //www. ittelkom. ac. id/staf/yanuar

Relevance Feedback (Concept) Generated New Query Expansion http: //www. ittelkom. ac. id/staf/yanuar

Relevance Feedback (Concept) x x o x hits from original search o x documents

Relevance Feedback (Concept) x x o x hits from original search o x documents identified as nonrelevant o documents identified as relevant original query reformulated query http: //www. ittelkom. ac. id/staf/yanuar

Relevance Feedback (Concept) http: //www. ittelkom. ac. id/staf/yanuar

Relevance Feedback (Concept) http: //www. ittelkom. ac. id/staf/yanuar

Document Clustering (Concept) x x x x x Document clusters are a form of

Document Clustering (Concept) x x x x x Document clusters are a form of automatic classification. A document may be in several clusters. http: //www. ittelkom. ac. id/staf/yanuar

Organization of Inverted Files Index file Postings file Term Pointer to postings ant bee

Organization of Inverted Files Index file Postings file Term Pointer to postings ant bee cat dog elk fox gnu hog Inverted lists http: //www. ittelkom. ac. id/staf/yanuar Documents file

Decisions in Building an Inverted File: Efficiency and Query Languages Some query options may

Decisions in Building an Inverted File: Efficiency and Query Languages Some query options may require huge computation, e. g. , Regular expressions If inverted files are stored in lexicographic order, comp* can be processed efficiently *comp cannot be processed efficiently Boolean terms If A and B are search terms A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists http: //www. ittelkom. ac. id/staf/yanuar

Postings File The postings file stores the elements of a sparse matrix, the term

Postings File The postings file stores the elements of a sparse matrix, the term assignment matrix. It is stored as a separate inverted list for each column, i. e. , a list corresponding to each term in the index file. Each element in an inverted list is called a posting, i. e. , the occurrence on a term in a document Each list consists of one or many individual postings. http: //www. ittelkom. ac. id/staf/yanuar

Postings File: A Linked List for Each Term 1 abacus 2 actor 3 aspen

Postings File: A Linked List for Each Term 1 abacus 2 actor 3 aspen 4 atoll 3 94 2 5 11 3 19 7 19 213 11 70 19 212 29 34 40 22 56 A linked list for each term is convenient to process sequentially, but slow to update when the lists are long. 66 45 http: //www. ittelkom. ac. id/staf/yanuar 43

Index File Structures: Binary Tree Input: elk, hog, bee, fox, cat, gnu, ant, dogc

Index File Structures: Binary Tree Input: elk, hog, bee, fox, cat, gnu, ant, dogc http: //www. ittelkom. ac. id/staf/yanuar

Index File Structures: Binary Tree Input: elk, hog, bee, fox, cat, gnu, ant, dogc

Index File Structures: Binary Tree Input: elk, hog, bee, fox, cat, gnu, ant, dogc elk bee ant hog cat fox dog http: //www. ittelkom. ac. id/staf/yanuar gnu

Binary Tree Advantages Can be searched quickly Convenient for batch updating Easy to add

Binary Tree Advantages Can be searched quickly Convenient for batch updating Easy to add an extra term Economical use of storage Disadvantages Less good for lexicographic processing, e. g. , comp* Tree tends to become unbalanced If the index is held on disk, important to optimize the number of disk accesses http: //www. ittelkom. ac. id/staf/yanuar

YFA November 2009 (3 rd Edition), February 2008 http: //www. ittelkom. ac. id/staf/yanuar Adapted

YFA November 2009 (3 rd Edition), February 2008 http: //www. ittelkom. ac. id/staf/yanuar Adapted from cs. cornell. edu http: //www. ittelkom. ac. id/staf/yanuar