CS 430 INFO 430 Information Retrieval Lecture 3

  • Slides: 31
Download presentation
CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1 1

CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1 1

Course Administration • Wednesday evenings are 7: 30 to 8: 30, except the Midterm

Course Administration • Wednesday evenings are 7: 30 to 8: 30, except the Midterm Examination may run 1. 5 hours. 2

2. Similarity Ranking Methods Query Index database Documents Mechanism for determining the similarity of

2. Similarity Ranking Methods Query Index database Documents Mechanism for determining the similarity of the query to the document. Set of documents ranked by how similar they are to the query 3

Similarity Ranking Methods that look for matches (e. g. , Boolean) assume that a

Similarity Ranking Methods that look for matches (e. g. , Boolean) assume that a document is either relevant to a query or not relevant. Similarity ranking methods: measure the degree of similarity between a query and a document. Similar Query Documents Similar: How similar is document to a request? 4

Evaluation: Precision and Recall Precision and recall measure the results of a single query

Evaluation: Precision and Recall Precision and recall measure the results of a single query using a specific search system applied to a specific set of documents. Matching methods: Precision and recall are single numbers. Ranking methods: Precision and recall are functions of the rank order. 5

Evaluating Ranking: Recall and Precision If information retrieval were perfect. . . Every document

Evaluating Ranking: Recall and Precision If information retrieval were perfect. . . Every document relevant to the original information need would be ranked above every other document. With ranking, precision and recall are functions of the rank order. Precision(n): fraction (or percentage) of the n most highly ranked documents that are relevant. Recall(n) : fraction (or percentage) of the relevant items that are in the n most highly ranked documents. 6

Precision and Recall with Ranking Example "Your query found 349, 871 possibly relevant documents.

Precision and Recall with Ranking Example "Your query found 349, 871 possibly relevant documents. Here are the first eight. " Examination of the first 8 finds that 5 of them are relevant. 7

Graph of Precision with Ranking: P(r) Relevant? Y N Y N Y 1/1 1/2

Graph of Precision with Ranking: P(r) Relevant? Y N Y N Y 1/1 1/2 2/3 3/4 3/5 4/6 4/7 5/8 1 2 3 4 5 6 7 8 Precision P(r) 1 0 8 Rank r

Term Similarity: Example Problem: Given two text documents, how similar are they? [Methods that

Term Similarity: Example Problem: Given two text documents, how similar are they? [Methods that measure similarity do not assume exact matches. ] Example Here are three documents. How similar are they? d 1 d 2 d 3 ant bee dog hog dog ant dog cat gnu dog eel fox Documents can be any length from one word to thousands. A query is a special type of document. 9

Term Similarity: Basic Concept Two documents are similar if they contain some of the

Term Similarity: Basic Concept Two documents are similar if they contain some of the same terms. Possible measures of similarity might take into consideration: (a) The lengths of the documents (b) The number of terms in common (c) Whether the terms are common or unusual (d) How many times each term appears 10

TERM VECTOR SPACE Term vector space n-dimensional space, where n is the number of

TERM VECTOR SPACE Term vector space n-dimensional space, where n is the number of different terms used to index a set of documents. Vector Document i is represented by a vector. Its magnitude in dimension j is tij, where: tij > 0 tij = 0 if term j occurs in document i otherwise tij is the weight of term j in document i. 11

A Document Represented in a 3 -Dimensional Term Vector Space t 3 d 1

A Document Represented in a 3 -Dimensional Term Vector Space t 3 d 1 t 13 t 2 t 11 t 12 t 1 12

Basic Method: Incidence Matrix (No Weighting) document d 1 d 2 d 3 text

Basic Method: Incidence Matrix (No Weighting) document d 1 d 2 d 3 text ant bee dog hog dog ant dog cat gnu dog eel fox terms ant bee dog hog cat dog eel fox gnu ant bee cat dog eel fox gnu hog d 1 1 1 d 2 1 1 d 3 1 1 1 1 3 vectors in 8 -dimensional term vector space Weights: tij = 1 if document i contains term j and zero otherwise 13

Basic Vector Space Methods: Similarity The similarity between two documents is a function of

Basic Vector Space Methods: Similarity The similarity between two documents is a function of the angle between their vectors in the term vector space. 14

Two Documents Represented in 3 -Dimensional Term Vector Space t 3 d 1 d

Two Documents Represented in 3 -Dimensional Term Vector Space t 3 d 1 d 2 t 1 15

Vector Space Revision x = (x 1, x 2, x 3, . . .

Vector Space Revision x = (x 1, x 2, x 3, . . . , xn) is a vector in an n-dimensional vector space Length of x is given by (extension of Pythagoras's theorem) |x|2 = x 12 + x 22 + x 32 +. . . + xn 2 If x 1 and x 2 are vectors: Inner product (or dot product) is given by x 1. x 2 = x 11 x 21 + x 12 x 22 + x 13 x 23 +. . . + x 1 nx 2 n Cosine of the angle between the vectors x 1 and x 2: x 1. x 2 cos ( ) = |x 1| |x 2| 16

Example 1 No Weighting ant bee cat dog eel fox gnu hog d 1

Example 1 No Weighting ant bee cat dog eel fox gnu hog d 1 1 1 d 2 1 1 d 3 17 length 2 1 1 1 1 4 5

Example 1 (continued) Similarity of documents in example: d 1 18 d 2 d

Example 1 (continued) Similarity of documents in example: d 1 18 d 2 d 3 d 1 1 0. 71 0 d 2 0. 71 1 0. 22 d 3 0 0. 22 1

Weighting Methods: tf and idf Term frequency (tf) A term that appears several times

Weighting Methods: tf and idf Term frequency (tf) A term that appears several times in a document is weighted more heavily than a term that appears only once. Inverse document frequency (idf) A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents. 19

Example 2 Weighting by Term Frequency (tf) document d 1 d 2 d 3

Example 2 Weighting by Term Frequency (tf) document d 1 d 2 d 3 text ant bee dog hog dog ant dog cat gnu dog eel fox terms ant bee dog hog cat dog eel fox gnu ant bee cat dog eel fox gnu hog d 1 2 1 d 2 1 1 d 3 length 5 4 1 1 1 19 5 Weights: tij = frequency that term j occurs in document i 20

Example 2 (continued) Similarity of documents in example: d 1 d 2 d 3

Example 2 (continued) Similarity of documents in example: d 1 d 2 d 3 d 1 1 0. 31 0 d 2 0. 31 1 0. 41 d 3 0 0. 41 1 Similarity depends upon the weights given to the terms. [Note differences in results from Example 1. ] 21

Summary: Vector Similarity Computation with Weights Documents in a collection are assigned terms from

Summary: Vector Similarity Computation with Weights Documents in a collection are assigned terms from a set of n terms The term vector space W is defined as: if term k does not occur in document di, wik = 0 if term k occurs in document di, wik is greater than zero (wik is called the weight of term k in document di) Similarity between di and dj is defined as: n cos(di, dj) = wikwjk k=1 |di| |dj| Where di and dj are the corresponding weighted term vectors 22

Approaches to Weighting Boolean information retrieval: Weight of term k in document di: w(i,

Approaches to Weighting Boolean information retrieval: Weight of term k in document di: w(i, k) = 1 w(i, k) = 0 if term k occurs in document di otherwise General weighting methods Weight of term k in document di: 0 < w(i, k) <= 1 if term k occurs in document di w(i, k) = 0 otherwise (The choice of weights for ranking is the topic of Lecture 4. ) 23

Simple Uses of Vector Similarity in Information Retrieval Threshold For query q, retrieve all

Simple Uses of Vector Similarity in Information Retrieval Threshold For query q, retrieve all documents with similarity above a threshold, e. g. , similarity > 0. 50. Ranking For query q, return the n most similar documents ranked in order of similarity. [This is the standard practice. ] 24

Simple Example of Ranking (Weighting by Term Frequency) query q document d 1 d

Simple Example of Ranking (Weighting by Term Frequency) query q document d 1 d 2 d 3 ant dog text ant bee dog hog dog ant dog cat gnu dog eel fox terms ant bee dog hog cat dog eel fox gnu ant bee cat dog eel fox gnu hog q d 1 d 2 d 3 25 1 2 1 1 1 4 1 1 1 length √ 2 5 19 5

Calculate Ranking Similarity of query to documents in example: d 1 q d 2

Calculate Ranking Similarity of query to documents in example: d 1 q d 2 d 3 2/√ 10 5/√ 38 1/√ 10 0. 63 0. 81 0. 32 If the query q is searched against this document set, the ranked results are: d 2, d 1, d 3 26

Contrast of Ranking with Matching With matching, a document either matches a query exactly

Contrast of Ranking with Matching With matching, a document either matches a query exactly or not at all • Encourages short queries • Requires precise choice of index terms • Requires precise formulation of queries (professional training) With retrieval using similarity measures, similarities range from 0 to 1 for all documents • Encourages long queries, to have as many dimensions as possible • Benefits from large numbers of index terms • Benefits from queries with many terms, not all of which need match the document 27

Document Vectors as Points on a Surface 28 • Normalize all document vectors to

Document Vectors as Points on a Surface 28 • Normalize all document vectors to be of length 1 • Then the ends of the vectors all lie on a surface with unit radius • For similar documents, we can represent parts of this surface as a flat region • Similar document are represented as points that are close together on this surface

Results of a Search x x x x documents found by search query 29

Results of a Search x x x x documents found by search query 29 hits from search

Relevance Feedback (Concept) x x o x o x hits from original search o

Relevance Feedback (Concept) x x o x o x hits from original search o x documents identified as non-relevant o documents identified as relevant original query reformulated query 30

Document Clustering (Concept) xx x x x x Document clusters are a form of

Document Clustering (Concept) xx x x x x Document clusters are a form of automatic classification. A document may be in several clusters. 31