CS 430 INFO 430 Information Retrieval Lecture 6

Course Administration Assignments You are encouraged to use existing Java or C++ classes, e.

Porter Stemmer A multi-step, longest-match stemmer. M. F. Porter, An algorithm for suffix stripping.

Porter's Stemmer Multi-Step Stemming Algorithm Complex suffixes are removed bit by bit in the

Porter Stemmer: Step 1 a Suffix sses ies ss s Replacement Examples ss caresses

Porter Stemmer: Step 1 b Conditions Suffix Replacement Examples (m > 0) eed ee

Porter Stemmer: Step 5 a (m>1) e -> (m=1 and not *o) e ->

Porter Stemmer: Results Suffix stripping of a vocabulary of 10, 000 words Number of

Exact Matching (Boolean Model) Query Index database Documents Mechanism for determining whether a document

Boolean Queries Boolean query: two or more search terms, related by logical operators, e.

Boolean Diagram not (A or B) A and B A 11 B A or

Adjacent and Near Operators abacus adj actor Terms abacus and actor are adjacent to

Evaluation of Boolean Operators Precedence of operators must be defined: adj, near high and,

Evaluating a Boolean Query Examples: abacus and actor Postings for abacus Postings for actor

Evaluating an Adjacency Operation Examples: abacus adj actor Postings for abacus Postings for actor

Query Matching: Boolean Methods Query: (abacus or asp*) and actor 1. From the index

Use of Postings File for Query Matching 17 1 abacus 2 actor 3 aspen

Query Matching: Vector Ranking Methods Query: abacus asp* 1. From the index file (word

Contrast of Ranking with Matching With matching, a document either matches a query exactly

Problems with the Boolean model Counter-intuitive results: Query q = a and b and

Problems with the Boolean model (continued) Boolean is all or nothing 21 • Boolean

Extending the Boolean model Term weighting • Give weights to terms in documents and/or

Ranking methods in Boolean systems SIRE (Syracuse Information Retrieval Experiment) Term weights • Add

Relevance feedback in SIRE (Syracuse Information Retrieval Experiment) Relevance feedback is particularly important with

Boolean model as sets d is either in the set A or not in

Boolean model as fuzzy sets d is more or less in A. d A

Fuzzy Sets: Basic concept • A document has a term weight associated with each

Fuzzy Sets Fuzzy set theory d. A is the degree of membership of an

Fuzzy Sets Fuzzy set theory example standard set theory 29 fuzzy set theory d.

MMM: Mixed Min and Max model Terms: a 1, a 2, . . .

MMM: Mixed Min and Max model Experimental values: all di = 1 or 0

Other Models Paice model The MMM model considers only the maximum and minimum document

Test data CISI CACM INSPEC P-norm 79 106 210 Paice 77 104 206 MMM

Reading E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended Boolean Models, Frake, Chapter

Slides: 35

Download presentation

CS 430 / INFO 430 Information Retrieval Lecture 6 Boolean Methods 1

Course Administration Assignments You are encouraged to use existing Java or C++ classes, e. g. , to manage data structures. If you do, you must acknowledge them in your report. In addition, for a data structure, you should explain why this structure is appropriate for the use you make of it. 2

Porter Stemmer A multi-step, longest-match stemmer. M. F. Porter, An algorithm for suffix stripping. (Originally published in Program, 14 no. 3, pp 130 -137, July 1980. ) http: //www. tartarus. org/~martin/Porter. Stemmer/def. txt Notation v c (vc)m vowel(s) constant(s) vowel(s) followed by constant(s), repeated m times Any word can be written: [c](vc)m[v] m is called the measure of the word 3

Porter's Stemmer Multi-Step Stemming Algorithm Complex suffixes are removed bit by bit in the different steps. Thus: GENERALIZATIONS becomes GENERALIZATION (Step 1) GENERALIZE (Step 2) GENERAL (Step 3) GENER (Step 4) [In this example, note that Steps 3 and 4 appear to be unhelpful for information retrieval. ] 4

Porter Stemmer: Step 1 a Suffix sses ies ss s Replacement Examples ss caresses -> caress i ponies ties -> poni -> ti ss caress -> caress cats -> cat At each step, carry out the longest match only. 5

Porter Stemmer: Step 1 b Conditions Suffix Replacement Examples (m > 0) eed ee feed -> feed agreed -> agree (*v*) ed null plastered -> plaster bled -> bled (*v*) ing null motoring -> motor sing -> sing *v* - the stem contains a vowel 6

Porter Stemmer: Step 5 a (m>1) e -> (m=1 and not *o) e -> probat rate -> rate cease -> ceas *o - the stem ends cvc, where the second c is not w, x or y (e. g. -wil, -hop). 7

Porter Stemmer: Results Suffix stripping of a vocabulary of 10, 000 words Number of words reduced in step 1: step 2: step 3: step 4: step 5: Number of words not reduced: 3597 766 327 2424 1373 3650 The resulting vocabulary of stems contained 6370 distinct entries. Thus the suffix stripping process reduced the size of the vocabulary by about one third. 8

Exact Matching (Boolean Model) Query Index database Documents Mechanism for determining whether a document matches a query. Set of hits 9

Boolean Queries Boolean query: two or more search terms, related by logical operators, e. g. , and or not Examples: abacus and actor abacus or actor (abacus and actor) or (abacus and atoll) not actor 10

Boolean Diagram not (A or B) A and B A 11 B A or B

Adjacent and Near Operators abacus adj actor Terms abacus and actor are adjacent to each other as in the string "abacus actor" abacus near 4 actor Terms abacus and actor are near to each other as in the string "the actor has an abacus" Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph). 12

Evaluation of Boolean Operators Precedence of operators must be defined: adj, near high and, not or low Example A and B or C and B is evaluated as (A and B) or (C and B) 13

Evaluating a Boolean Query Examples: abacus and actor Postings for abacus Postings for actor 3 19 22 2 19 29 To evaluate the and operator, merge the two inverted lists with a logical AND operation. Document 19 is the only document that contains both terms, "abacus" and "actor". 14

Evaluating an Adjacency Operation Examples: abacus adj actor Postings for abacus Postings for actor 3 94 19 7 19 212 22 56 2 66 19 213 29 45 Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent. 15

Query Matching: Boolean Methods Query: (abacus or asp*) and actor 1. From the index file (word list), find the postings file for: "abacus" every word that begins "asp" "actor" 2. Merge these posting lists. For each document that occurs in any of the postings lists, evaluate the Boolean expression to see if it is true or false. 3. Step 2 should be carried out in a single pass. 16

Use of Postings File for Query Matching 17 1 abacus 2 actor 3 aspen 4 atoll 3 94 2 5 11 3 19 7 19 213 11 70 19 212 29 34 40 22 56 66 45 43

Query Matching: Vector Ranking Methods Query: abacus asp* 1. From the index file (word list), find the postings file for: "abacus" every word that begins "asp" 2. Merge these posting lists. Calculate the similarity to the query for each document that occurs in any of the postings lists. 3. Sort the similarities to obtain the results in ranked order. 4. Steps 2 and 3 should be carried out in a single pass. 18

Contrast of Ranking with Matching With matching, a document either matches a query exactly or not at all • Encourages short queries • Requires precise choice of index terms • Requires precise formulation of queries (professional training) With retrieval using similarity measures, similarities range from 0 to 1 for all documents • Encourages long queries, to have as many dimensions as possible • Benefits from large numbers of index terms • Benefits from queries with many terms, not all of which need match the document 19

Problems with the Boolean model Counter-intuitive results: Query q = a and b and c and d and e Document d has terms a, b, c and d, but not e Intuitively, d is quite a good match for q, but it is rejected by the Boolean model. Query q = a or b or c or d or e Document d 1 has terms a, b, c, d, and e Document d 2 has term a, but not b, c, d or e Intuitively, d 1 is a much better match than d 2, but the Boolean model ranks them as equal. 20

Problems with the Boolean model (continued) Boolean is all or nothing 21 • Boolean model has no way to rank documents. • Boolean model allows for no uncertainty in assigning index terms to documents. • The Boolean model has no provision for adjusting the importance of query terms.

Extending the Boolean model Term weighting • Give weights to terms in documents and/or queries. • Combine standard Boolean retrieval with vector ranking of results Fuzzy sets • 22 Relax the boundaries of the sets used in Boolean retrieval

Ranking methods in Boolean systems SIRE (Syracuse Information Retrieval Experiment) Term weights • Add term weights to documents Weights calculated by the standard method of term frequency * inverse document frequency. Ranking 23 • Calculate results set by standard Boolean methods • Rank results by vector distances

Relevance feedback in SIRE (Syracuse Information Retrieval Experiment) Relevance feedback is particularly important with Boolean retrieval because it allow the results set to be expanded • Results set is created by standard Boolean retrieval • User selects one document from results set • Other documents in collection are ranked by vector distance from this document [Relevance feedback will be covered in a later lecture. ] 24

Boolean model as sets d is either in the set A or not in A. d A 25

Boolean model as fuzzy sets d is more or less in A. d A 26

Fuzzy Sets: Basic concept • A document has a term weight associated with each index term. The term weight measures the degree to which that term characterizes the document. • Term weights are in the range [0, 1]. (In the standard Boolean model all weights are either 0 or 1. ) • For a given query, calculate the similarity between the query and each document in the collection. • This calculation is needed for every document that has a nonzero weight for any of the terms in the query. 27

Fuzzy Sets Fuzzy set theory d. A is the degree of membership of an element to set A intersection (and) d. A B = min(d. A, d. B) union (or) d. A B = max(d. A, d. B) 28

Fuzzy Sets Fuzzy set theory example standard set theory 29 fuzzy set theory d. A 1 1 0 0 0. 5 d. B 1 0 and d. A B 1 0 0 or d. A B 1 1 1 0. 5 0 0 0. 7 0 0 0. 5 0 0 0. 7 0. 5 0. 7 0

MMM: Mixed Min and Max model Terms: a 1, a 2, . . . , an Document: d, with index-term weights: d 1, d 2, . . . , dn qor = (a 1 or a 2 or. . . or an) Query-document similarity: S(qor, d) = or * max(d 1, d 2, . . , dn) + (1 - or) * min(d 1, d 2, . . , dn) With regular Boolean logic, all di = 1 or 0, or = 1 30

MMM: Mixed Min and Max model Terms: a 1, a 2, . . . , an Document: d, with index-term weights: d 1, d 2, . . . , dn qand = (a 1 and a 2 and. . . and an) Query-document similarity: S(qand, d) = and * min(d 1, . . , dn) + (1 - and)* max(d 1, . . , dn) With regular Boolean logic, all di = 1 or 0, and = 1 31

MMM: Mixed Min and Max model Experimental values: all di = 1 or 0 and in range [0. 5, 0. 8] or > 0. 2 Computational cost is low. Retrieval performance much improved. 32

Other Models Paice model The MMM model considers only the maximum and minimum document weights. The Paice model takes into account all of the document weights. Computational cost is higher than MMM. P-norm model Document d, with term weights: d. A 1, d. A 2, . . . , d. An Query terms are given weights, a 1, a 2, . . . , an Operators have coefficients that indicate degree of strictness Query-document similarity is calculated by considering each document and query as a point in n space. 33

Test data CISI CACM INSPEC P-norm 79 106 210 Paice 77 104 206 MMM 68 109 195 Percentage improvement over standard Boolean model (average best precision) Lee and Fox, 1988 34

Reading E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended Boolean Models, Frake, Chapter 15 Methods based on fuzzy set concepts 35