TEXT BASED INFORMATION RETRIEVAL CS 4323 0910 2

TEXT BASED INFORMATION RETRIEVAL CS 4323 / 0910 -2 YFA Tersedia online di http: //www. ittelkom. ac. id/sisfo/yanuar 04 YFA CS 4323 S 1/IT/IR/E 4/0310 Institut Teknologi Telkom http: //www. ittelkom. ac. id/staf/yanuar

Course Description This course studies techniques and human factors in discovering information in online information systems. Methods that are covered include techniques for searching, browsing and filtering information, descriptive metadata, the use of classification systems and thesauruses, with examples from Web search systems and digital libraries. http: //www. ittelkom. ac. id/staf/yanuar

http: //www. ittelkom. ac. id/staf/yanuar

Information Retrieval from Collections of Textual Documents Major Categories of Methods 1. Exact matching (Boolean) 2. Ranking by similarity to query (vector space model) 3. Ranking of matches by importance of documents (Page. Rank) 4. Combination methods Course begins with Boolean, then similarity methods, then importance methods. http: //www. ittelkom. ac. id/staf/yanuar

Text Based Information Retrieval Most matching methods are based on Boolean operators. Most ranking methods are based on the vector space model. Web search methods combine vector space model with ranking based on importance of documents. Many practical systems combine features of several approaches. In the basic form, all approaches treat words as separate tokens with minimal attempt to interpret them linguistically. http: //www. ittelkom. ac. id/staf/yanuar

Documents A textual document is a digital object consisting of a sequence of words and other symbols, e. g. , punctuation. The individual words and other symbols are known as tokens or terms. A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. [Methods of markup, e. g. , XML] http: //www. ittelkom. ac. id/staf/yanuar

Word Frequency Observation: Some words are more common than others. Statistics: Most large collections of text documents have similar statistical characteristics. These statistics: • influence the effectiveness and efficiency of data structures used to index documents • many retrieval models rely on them http: //www. ittelkom. ac. id/staf/yanuar

Word Frequency Example The following example is taken from: Jamie Callan, Characteristics of Text, 1997 Sample of 19 million words The next slide shows the 50 commonest words in rank order (r), with their frequency (f). http: //www. ittelkom. ac. id/staf/yanuar

Word Frequency f the of to a in and that for is said it on by as at mr with 1130021 547311 516635 464736 390819 387703 204351 199340 152483 148302 134323 121173 118863 109135 101779 101679 101210 f from he million year its be was company an has are have but will say new share 96900 94585 93515 90104 86774 85588 83398 83070 76974 74405 74097 73132 71887 71494 66807 64456 63925 http: //www. ittelkom. ac. id/staf/yanuar f or about market they this would you which bank stock trade his more who one their 54958 53713 52110 51359 50933 50828 49281 48273 47940 47401 47310 47116 46244 42142 41635 40910

Rank Frequency Distribution For all the words in a collection of documents, for each word w f is the frequency that w appears r is rank of w in order of frequency. (The most commonly occurring word has rank 1, etc. ) f w has rank r and frequency f r http: //www. ittelkom. ac. id/staf/yanuar

Rank Frequency Example The next slide shows the words in Callan's data normalized. In this example: r is the rank of word w in the sample. f is the frequency of word w in the sample. n is the total number of word occurrences in the sample. http: //www. ittelkom. ac. id/staf/yanuar

Methods that Build on Zipf's Law Stop lists: Ignore the most frequent words (upper cut-off). Used by almost all systems. Significant words: Ignore the most frequent and least frequent words (upper and lower cut-off). Rarely used. Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighed less. Used by almost all ranking methods. http: //www. ittelkom. ac. id/staf/yanuar

EXACT MATCHING (BOOLEAN MODEL) CS 4323 / 0910 -2 YFA Tersedia online di http: //www. ittelkom. ac. id/sisfo/yanuar 04 YFA CS 4323 S 1/IT/IR/E 4/0310 Institut Teknologi Telkom http: //www. ittelkom. ac. id/staf/yanuar

1. Exact Matching (Boolean Model) http: //www. ittelkom. ac. id/staf/yanuar

1. Exact Matching (Boolean Model) Query Index database Mechanism for determining whether a document matches a query. Set of hits http: //www. ittelkom. ac. id/staf/yanuar Documents

Evaluation of Matching: Recall and Precision If information retrieval were perfect. . . Every hit would be relevant to the original query, and every relevant item in the body of information would be found. Precision: percentage (or fraction) of the hits that are relevant, i. e. , the extent to which the set of hits retrieved by a query satisfies the requirement that generated the query. Recall: percentage (or fraction) of the relevant items that are found by the query, i. e. , the extent to which the query found all the items that satisfy the requirement. http: //www. ittelkom. ac. id/staf/yanuar

$Precision and Recall Precision: percentage (or fraction) of the hits that are relevant, i.$

Precision and Recall Precision: percentage (or fraction) of the hits that are relevant, i. e. , the extent to which the set of hits retrieved by a query satisfies the requirement that generated the query. Recall: percentage (or fraction) of the relevant items that are found by the query, i. e. , the extent to which the query found all the items that satisfy the requirement. Precision: What fraction of the returned results are relevant to the information need? Recall: What fraction of the relevant documents in the collection were returned by the system? http: //www. ittelkom. ac. id/staf/yanuar

Precision and Recall http: //www. ittelkom. ac. id/staf/yanuar

Recall and Precision with Exact Matching: Example • Collection of 10, 000 documents, 50 on a specific topic • Ideal search finds these 50 documents and reject all others • Actual search identifies 25 documents; 20 are relevant but 5 were on other topics • Precision: 20/ 25 = 0. 8 (80% of hits were relevant) • Recall: 20/50 = 0. 4 (40% of relevant were found) http: //www. ittelkom. ac. id/staf/yanuar

Measuring Precision and Recall Precision is easy to measure: • A knowledgeable person looks at each document that is identified and decides whether it is relevant. • In the example, only the 25 documents that are found need to be examined. Recall is difficult to measure: • To know all relevant items, a knowledgeable person must go through the entire collection, looking at every object to decide if it fits the criteria. • In the example, all 10, 000 documents must be examined. http: //www. ittelkom. ac. id/staf/yanuar

Measuring Precision and Recall http: //www. ittelkom. ac. id/staf/yanuar

Query A query is a string to match against entries in an index. The string might may contain: search terms computation operators computation and parallel fields author = Newton metacharacters b[aeiou]n*g (Metacharacters can be used to build regular expressions, which will be covered later in the course. ) http: //www. ittelkom. ac. id/staf/yanuar

Boolean Queries Boolean query: two or more search terms, related by logical operators, e. g. , and or not Examples: abacus and actor abacus or actor (abacus and actor) or (abacus and atoll) not actor http: //www. ittelkom. ac. id/staf/yanuar

Boolean Diagram not (A or B) A and B A or B http: //www. ittelkom. ac. id/staf/yanuar

Review of Boolean Operator http: //www. ittelkom. ac. id/staf/yanuar

Adjacent and Near Operators abacus adj actor Terms abacus and actor are adjacent to each other as in the string "abacus actor" abacus near 4 actor Terms abacus and actor are near to each other as in the string "the actor has an abacus" Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph). http: //www. ittelkom. ac. id/staf/yanuar

Evaluation of Boolean Operators Precedence of operators must be defined: adj, near high and, not or low Example A and B or C and B is evaluated as (A and B) or (C and B) http: //www. ittelkom. ac. id/staf/yanuar

Inverted File Inverted file: A list of search terms that are used to index a set of documents. The inverted file is organized for associative look-up, i. e. , to answer the question, "In which documents does a specified search term appear? " In practical applications, the inverted file contains related information, such as the location within the document where the search terms appear. http: //www. ittelkom. ac. id/staf/yanuar

Inverted Files http: //www. ittelkom. ac. id/staf/yanuar

Inverted File -- Basic Concept Word abacus actor aspen atoll Document 3 19 22 2 19 29 5 11 34 Stop words are removed before building the index. http: //www. ittelkom. ac. id/staf/yanuar

Inverted List -- Concept Inverted List: All the entries in an inverted file that apply to a specific word, e. g. abacus 3 19 22 Posting: Entry in an inverted list, e. g. , there are three postings for "abacus". http: //www. ittelkom. ac. id/staf/yanuar

Evaluating a Boolean Query Examples: abacus and actor 3 Postings for abacus 19 22 Postings for actor 2 19 29 To evaluate the and operator, merge the two inverted lists with a logical AND operation. Document 19 is the only document that contains both terms, "abacus" and "actor". http: //www. ittelkom. ac. id/staf/yanuar

Enhancements to Inverted Files -Concept Location: The inverted file can hold information about the location of each term within the document. Uses adjacency and near operators user interface design -- highlight location of search term Frequency: The inverted file includes the number of postings for each term. Uses term weighting query processing optimization http: //www. ittelkom. ac. id/staf/yanuar

Inverted File -- Concept (Enhanced) Word Postings abacus 4 actor 3 aspen atoll 1 3 Document Location 3 19 19 22 2 19 29 5 11 11 34 94 7 212 56 66 213 45 43 3 70 40 http: //www. ittelkom. ac. id/staf/yanuar

Evaluating an Adjacency Operation Examples: abacus adj actor Postings for abacus 3 94 19 7 19 212 22 56 Postings for actor 2 66 19 213 29 45 Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent. http: //www. ittelkom. ac. id/staf/yanuar

Exact String Matching Algorithms http: //www. ittelkom. ac. id/staf/yanuar

YFA August 2008 (2 nd Edition), February 2008 http: //www. ittelkom. ac. id/staf/yanuar Adapted from cs. cornell. edu http: //www. ittelkom. ac. id/staf/yanuar