Information Retrieval and Web Search IR models Boolean

Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: http: //www. cs. unt. edu/~rada/CSCE 5300

Today’s topics • Boolean retrieval • Improvements / Variations of the boolean model – Extended boolean model – Fuzzy information retrieval Slide 1

IR Models Set Theoretic Classic Models U s e r T a s k Retrieval: Adhoc Filtering boolean vector probabilistic Structured Models Non-Overlapping Lists Proximal Nodes Fuzzy Extended Boolean Algebraic Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext Slide 2

The Boolean Model • Simple model based on set theory • Queries specified as boolean expressions – precise semantics – neat formalism – q = ka (kb kc) • Terms are either present or absent. Thus, wij {0, 1} • Consider – q = ka (kb kc) – vec(qdnf) = (1, 1, 1) (1, 1, 0) (1, 0, 0) – vec(qcc) = (1, 1, 0) is a conjunctive component • Each query can be transformed in DNF form Slide 3

The Boolean Model Ka • q = ka (kb kc) (1, 0, 0) Kb (1, 1, 0) (1, 1, 1) Kc • sim(q, dj) = 1, if document satisfies the boolean query 0 otherwise - no in-between, only 0 or 1 Slide 4

Exercise • D 1 = “computer information retrieval” • D 2 = “computer retrieval” • D 3 = “information” • D 4 = “computer information” • Q 1 = “information retrieval” • Q 2 = “information ¬computer” Slide 5

Exercise 0 1 Swift 2 Shakespeare 3 Shakespeare 4 Milton 5 Milton 6 Milton Shakespeare 7 Milton Shakespeare Swift 8 Chaucer 9 Chaucer 10 Chaucer Shakespeare 11 Chaucer Shakespeare 12 Chaucer Milton 13 Chaucer Milton 14 Chaucer Milton Shakespeare 15 Chaucer Milton Shakespeare Swift Swift ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare)) Slide 6

Drawbacks of the Boolean Model • Retrieval based on binary decision criteria with no notion of partial matching • No ranking of the documents is provided (absence of a grading scale) • Information need has to be translated into a Boolean expression which most users find awkward • The Boolean queries formulated by the users are most often too simplistic • As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query Slide 7

• The Boolean model imposes a binary criterion for deciding relevance • The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past • Two extensions of boolean model: – Fuzzy Set Model – Extended Boolean Model Slide 8

Fuzzy Set Model • Queries and docs represented by sets of index terms: matching is approximate from the start • This vagueness can be modeled using a fuzzy framework, as follows: – with each term is associated a fuzzy set – each doc has a degree of membership in this fuzzy set • This interpretation provides the foundation for many models for IR based on fuzzy theory • In here, the model proposed by Ogawa, Morita, and Kobayashi (1991) Slide 9

Fuzzy Set Theory • Framework for representing classes whose boundaries are not well defined • Key idea is to introduce the notion of a degree of membership associated with the elements of a set • This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership • Thus, membership is now a gradual notion, contrary to the notion enforced by classic Boolean logic Slide 10

Fuzzy Set Theory • Definition – A fuzzy subset A of U is characterized by a membership function (A, u) : U [0, 1] which associates with each element u of U a number (u) in the interval [0, 1] • Definition – Let A and B be two fuzzy subsets of U. Also, let ¬A be the complement of A. Then, • • • (¬A, u) = 1 - (A, u) (A B, u) = max( (A, u), (B, u)) (A B, u) = min( (A, u), (B, u)) Slide 11

Fuzzy Information Retrieval • Fuzzy sets are modeled based on a thesaurus • This thesaurus is built as follows: – Let vec(c) be a term-term correlation matrix – Let c(i, l) be a normalized correlation factor for (ki, kl): c(i, l) = n(i, l) ni + nl - n(i, l) - ni: number of docs which contain ki - nl: number of docs which contain kl - n(i, l): number of docs which contain both ki and kl • We now have the notion of proximity among index terms. Slide 12

Fuzzy Information Retrieval • The correlation factor c(i, l) can be used to define fuzzy set membership for a document dj as follows: (i, j) = 1 - (1 - c(i, l)) kl dj - (i, j) : membership of doc dj in fuzzy subset associated with ki • The above expression computes an algebraic sum over all terms in the doc dj • A doc dj belongs to the fuzzy set for ki, if its own terms are associated with ki Slide 13

Fuzzy Information Retrieval • (i, j) = 1 - (1 - c(i, l)) kl dj - (i, j) : membership of doc dj in fuzzy subset associated with ki • If doc dj contains a term kl which is closely related to ki, we have – c(i, l) ~ 1 – (i, j) ~ 1 – index ki is a good fuzzy index for doc Slide 14

Fuzzy IR: An Example Ka cc 3 Kb cc 2 cc 1 • q = ka (kb kc) • vec(qdnf) = (1, 1, 1) + (1, 1, 0) + (1, 0, 0) = vec(cc 1) + vec(cc 2) + vec(cc 3) Kc • (q, dj) = (cc 1+cc 2+cc 3, j) = 1 - (a, j) (b, j) (c, j)) * (1 - (a, j) (b, j) (1 - (c, j))) * (1 - (a, j) (1 - (b, j)) (1 - (c, j))) Slide 15

Fuzzy Information Retrieval • Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory • Experiments with standard test collections are not available • Difficult to compare at this time Slide 16

Extended Boolean Model • Boolean model is simple and elegant. • But, no provision for a ranking • As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership • Extend the Boolean model with the notions of partial matching and term weighting • Combine characteristics of the Vector model with properties of Boolean algebra Slide 17

The Idea • The extended Boolean model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra • Let, – q = kx ky – Use weights associated with kx and ky – In boolean model: wx = wy = 1; all other documents are irrelevant Slide 18

The Idea: • qand = kx ky; wxj = x and wyj = y (1, 1) ky dj+1 y = wyj (0, 0) AND dj x = wxj We want a document to be as close as possible to (1, 1) kx sim(qand, dj) = 1 - sqrt( (1 -x) 2+ (1 -y) 2 Slide 19

The Idea: • qor = kx ky; wxj = x and wyj = y (1, 1) ky dj+1 y = wyj (0, 0) OR dj x = wxj We want a document to be as far as possible from (0, 0) kx sim(qor, dj) = sqrt( x 2 + y 2 ) 2 Slide 20

Generalizing the Idea • We can extend the previous model to consider Euclidean distances in a t-dimensional space • This can be done using p-norms which extend the notion of distance to include p-distances, where 1 p is a new parameter • A generalized conjunctive query is given by – qor = k 1 k 2 p. . . pkt p • A generalized disjunctive query is given by – qand = k 1 k 2 p . . . p p kt Slide 21

Generalizing the Idea p 1 p p p – sim(qor, dj) = (x 1 + x 2 +. . . + xm ) m p p 1 p p – sim(qand, dj) = 1 - ((1 -x 1) + (1 -x 2) +. . . + (1 -xm) ) m 1 p p – sim(qor, dj) = (x 1 + x 2 +. . . + xm ) p p m – If p = 1 then (Vector like) • sim(qor, dj) = sim(qand, dj) = x 1 +. . . + xm m Slide 22

Conclusions • Model is quite powerful • Properties are interesting and might be useful • Computation is somewhat complex • However, distributivity operation does not hold for ranking computation: – q 1 = (k 1 k 2) k 3 – q 2 = (k 1 k 3) (k 2 k 3) – sim(q 1, dj) sim(q 2, dj) Slide 23