CS 430 INFO 430 Information Retrieval Lecture 9

  • Slides: 28
Download presentation
CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing 1

CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing 1

Course Administration 2

Course Administration 2

Latent Semantic Indexing Objective Replace indexes that use sets of index terms by indexes

Latent Semantic Indexing Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the term vector space into a lower dimensional space, usingular value decomposition. Each dimension in the new space corresponds to a latent concept in the original data. 3

Deficiencies with Conventional Automatic Indexing Synonymy: Various words and phrases refer to the same

Deficiencies with Conventional Automatic Indexing Synonymy: Various words and phrases refer to the same concept (lowers recall). Polysemy: Individual words have more than one meaning (lowers precision) Independence: No significance is given to two terms that frequently appear together Latent semantic indexing addresses the first of these (synonymy), and the third (dependence) 4

Example Query: "IDF in computer-based information look-up" Index terms for a document: access, document,

Example Query: "IDF in computer-based information look-up" Index terms for a document: access, document, retrieval, indexing How can we recognize that information look-up is related to retrieval and indexing? Conversely, if information has many different contexts in the set of documents, how can we discover that it is an unhelpful term for retrieval? 5

Technical Memo Example: Titles c 1 c 2 c 3 6 Human machine interface

Technical Memo Example: Titles c 1 c 2 c 3 6 Human machine interface for Lab ABC computer applications A survey of user opinion of computer system response time The EPS user interface management system c 4 c 5 System and human system engineering testing of EPS Relation of user-perceived response time to error measurement m 1 m 2 m 3 m 4 The generation of random, binary, unordered trees The intersection graph of paths in trees Graph minors IV: Widths of trees and well-quasi-ordering Graph minors: A survey

Technical Memo Example: Terms and Documents Terms Documents 7 c 1 human 1 interface

Technical Memo Example: Terms and Documents Terms Documents 7 c 1 human 1 interface 1 computer 1 user 0 system 0 response 0 time 0 EPS 0 survey 0 trees 0 graph 0 minors 0 c 2 0 0 1 1 1 0 0 0 c 3 0 1 1 0 0 0 0 c 4 1 0 0 0 2 0 0 1 0 0 c 5 0 0 0 1 1 0 0 0 m 1 0 0 0 0 0 1 0 0 m 2 0 0 0 0 0 1 1 0 m 3 0 0 0 0 0 1 1 1 m 4 0 0 0 0 1 0 1 1

Technical Memo Example: Query: Find documents relevant to "human computer interaction" Simple Term Matching:

Technical Memo Example: Query: Find documents relevant to "human computer interaction" Simple Term Matching: Matches c 1, c 2, and c 4 Misses c 3 and c 5 8

Models of Semantic Similarity Proximity models: Put similar items together in some space or

Models of Semantic Similarity Proximity models: Put similar items together in some space or structure 9 • Clustering (hierarchical, partition, overlapping). Documents are considered close to the extent that they contain the same terms. Most then arrange the documents into a hierarchy based on distances between documents. [Covered later in course. ] • Factor analysis based on matrix of similarities between documents (single mode). • Two-mode proximity methods. Start with rectangular matrix and construct explicit representations of both row and column objects.

Selection of Two-mode Factor Analysis Additional criterion: Computationally efficient O(N 2 k 3) N

Selection of Two-mode Factor Analysis Additional criterion: Computationally efficient O(N 2 k 3) N is number of terms plus documents k is number of dimensions 10

The term vector space The space has as many dimensions as there are terms

The term vector space The space has as many dimensions as there are terms in the word list. t 3 d 1 d 2 t 1 11

Figure 1 Latent concept vector space • term document query --- cosine > 0.

Figure 1 Latent concept vector space • term document query --- cosine > 0. 9 12

Mathematical concepts Define X as the term-document matrix, with t rows (number of index

Mathematical concepts Define X as the term-document matrix, with t rows (number of index terms) and d columns (number of documents). Singular Value Decomposition For any matrix X, with t rows and d columns, there exist matrices T 0, S 0 and D 0', such that: X = T 0 S 0 D 0' T 0 and D 0 are the matrices of left and right singular vectors T 0 and D 0 have orthonormal columns S 0 is the diagonal matrix of singular values 13

Dimensions of matrices t x d X txm = T 0 m is the

Dimensions of matrices t x d X txm = T 0 m is the rank of X < min(t, d) 14 mxm mxd S 0 D 0'

Reduced Rank S 0 can be chosen so that the diagonal elements are positive

Reduced Rank S 0 can be chosen so that the diagonal elements are positive and decreasing in magnitude. Keep the first k and set the others to zero. Delete the zero rows and columns of S 0 and the corresponding rows and columns of T 0 and D 0. This gives: X X = TSD' ~^ ~ Interpretation If value of k is selected well, expectation is that X retains the ^ semantic information from X, but eliminates noise from synonymy and recognizes dependence. 15

Selection of singular values t x d txk kxk S ^ X = kxd

Selection of singular values t x d txk kxk S ^ X = kxd D' T k is the number of singular values chosen to represent the concepts in the set of documents. Usually, k « m. 16

Comparing a Term and a Document ^ An individual cell of X is the

Comparing a Term and a Document ^ An individual cell of X is the number of occurrences of term i in document j. ^ X = TSD' - = TS(DS)' where S is a diagonal matrix whose values are the square root of the corresponding elements of S. 17

Calculation Similarities in the Concept Space Objective: Calculate similarities between terms, documents, and queries,

Calculation Similarities in the Concept Space Objective: Calculate similarities between terms, documents, and queries, using the matrices T, S, and D. 18

Mathematical Revision A is a p x q matrix B is a r x

Mathematical Revision A is a p x q matrix B is a r x q matrix ai is the vector represented by row i of A bj is the vector represented by row j of B The inner product ai. bj is element i, j of AB' ith q row of A r q p B' A 19 jth row of B

Comparing Two Terms ^ The dot product of two rows of X reflects the

Comparing Two Terms ^ The dot product of two rows of X reflects the extent to which two terms have a similar pattern of occurrences. ^ ^ = TSD'(TSD')' XX' = TSD'DS'T' = TSS'T' Since D is orthonormal = TS(TS)' To calculate the i, j cell, take the dot product between the i and j rows of TS 20 Since S is diagonal, TS differs from T only by stretching the coordinate system

Comparing Two Documents ^ The dot product of two columns of X reflects the

Comparing Two Documents ^ The dot product of two columns of X reflects the extent to which two columns have a similar pattern of occurrences. ^ ^ = (TSD')'TSD' X'X = DS(DS)' To calculate the i, j cell, take the dot product between the i and j columns of DS. Since S is diagonal DS differs from D only by stretching the coordinate system 21

Comparing a Query and a Document A query can be expressed as a vector

Comparing a Query and a Document A query can be expressed as a vector in the termdocument vector space xq. xqi = 1 if term i is in the query and 0 otherwise. (Ignore query terms that are not in the term vector space. ) Let pqj be the inner product of the query xq with document dj in the term-document vector space. ^ pqj is the jth element in the product of xq'X. 22

Comparing a Query and a Document [pq 1. . . pqj. . . pqt]

Comparing a Query and a Document [pq 1. . . pqj. . . pqt] = [xq 1 xq 2. . . xqt] inner product of query q with document dj ^ X query ^ pq' = xq'X = xq'TSD' = xq'T(DS)' similarity(q, dj) = 23 pqj |xq| |dj| document dj is column j ^ of X cosine of angle is inner product divided by lengths of vectors

Comparing a Query and a Document In the reading, the authors treat the query

Comparing a Query and a Document In the reading, the authors treat the query as a pseudodocument in the concept space dq: dq = xq'TS-1 [Note that S-1 stretches the vector] To compare a query against document j, they extend the method used to compare document i with document j. Take the jth element of the product of: dq. S and (DS)' This is the jth element of product of: xq'T (DS)' which is the same expression as before. Note that with their notation dq is a row vector. 24

Technical Memo Example: Query Terms 25 Query xq human 1 interface 0 computer 0

Technical Memo Example: Query Terms 25 Query xq human 1 interface 0 computer 0 user 0 system 1 response 0 time 0 EPS 0 survey 0 trees 1 graph 0 minors 0 Query: "human system interactions on trees" In term-document space, a query is represented by xq, a column vector with t elements. In concept space, a query is represented by dq, a row vector with k elements.

Experimental Results Deerwester, et al. tried latent semantic indexing on two test collections, MED

Experimental Results Deerwester, et al. tried latent semantic indexing on two test collections, MED and CISI, where queries and relevant judgments were available. Documents were full text of title and abstract. Stop list of 439 words (SMART); no stemming, etc. Comparison with: (a) simple term matching, (b) SMART, (c) Voorhees method. 26

Experimental Results: 100 Factors 27

Experimental Results: 100 Factors 27

Experimental Results: Number of Factors 28

Experimental Results: Number of Factors 28