CS 430 INFO 430 Information Retrieval Lecture 8

Course Administration Assignment Reports A sample report will be posted before the next assignment

CS 430 / INFO 430 Information Retrieval Completion of Lecture 7 3

Search for Substring In some information retrieval applications, any substring can be a search

Tries: Search for Substring Basic concept The text is divided into unique semi-infinite strings,

Tries: Suffix Tree Example: suffix tree for the following words: beginning between bread break

Tries: Sistrings A binary example String: Sistrings: 7 01 100 010 111 1 2

Tries: Lexical Ordering 7 4 8 5 1 6 3 2 00 010 111

Trie: Basic Concept 1 0 0 0 1 1 7 0 0 1 5

Query Refinement new query Query formulation Search reformulated query Display retrieved information Reformulate query

Reformulation of Query Manual • • • Add or remove search terms Change Boolean

Manual Reformulation: Vocabulary Tools Feedback • Information about stop lists, stemming, etc. • Numbers

Manual Reformulation: Document Tools Feedback to user consists of document excerpts or surrogates •

Relevance Feedback: Document Vectors as Points on a Surface 16 • Normalize all document

Results of a Search x x x x documents found by search query 17

Relevance Feedback (Concept) x x o x hits from original search o x documents

Theoretically Best Query optimal query o x x o o x x x x

Theoretically Best Query For a specific query, q, let: DR be the set of

Estimating the Best Query In practice, DR and DN-R are not known. (The objective

Rocchio's Modified Query Modified query vector = Original query vector + Mean of relevant

Query Modification q 1 = q 0 + 1 n 1 ri i =1

Difficulties with Relevance Feedback optimal query o x x o o x x 24

Difficulties with Relevance Feedback optimal results x set x o o o x x

Effectiveness of Relevance Feedback Best when: 26 • Relevant documents are tightly clustered (similarities

When to Use Relevance Feedback Relevance feedback is most important when the user wishes

Adjusting Parameters 1: Relevance Feedback 1 q 1 = q 0 + n 1

Adjusting Parameters 2: Filtering Incoming Messages D 1, D 2, D 3, . .

Seeking Optimal Parameters Theoretical approach Develop a theoretical model Derive parameters Test with users

Seeking Optimal Parameters using Machine Learning GENERAL: EXAMPLE: Text Retrieval Input: • training examples

Machine Learning: Tasks and Applications Task Application Text Routing Help-Desk Support: Who is an

Learning to Rank Assume: • distribution of queries P(q) • distribution of target rankings

A Loss Function for Rankings For two orderings ra and rb, a pair is:

Machine Learning: Algorithms The choice of algorithms is a subject of active research, which

Relevance Feedback: Clickthrough Data Relevance feedback methods have suffered from the unwillingness of users

Clickthrough Example Ranking Presented to User: 1. Kernel Machines http: //svm. first. gmd. de/

Slides: 37

Download presentation

CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback 1

Course Administration Assignment Reports A sample report will be posted before the next assignment is due. Preparation for Discussion Classes Most of the readings were used last year. You can see the questions that were used on last year's Web site: http: //www. cs. cornell. edu/Courses/cs 430/2004 fa/ 2

CS 430 / INFO 430 Information Retrieval Completion of Lecture 7 3

Search for Substring In some information retrieval applications, any substring can be a search term. Tries, using suffix trees, provide lexicographical indexes for all the substrings in a document or set of documents. 4

Tries: Search for Substring Basic concept The text is divided into unique semi-infinite strings, or sistrings. Each sistring has a starting position in the text, and continues to the right until it is unique. The sistrings are stored in (the leaves of) a tree, the suffix tree. Common parts are stored only once. Each sistring can be associated with a location within a document where the sistring occurs. Subtrees below a certain node represent all occurrences of the substring represented by that node. Suffix trees have a size of the same order of magnitude as the input documents. 5

Tries: Suffix Tree Example: suffix tree for the following words: beginning between bread break b e gin null 6 rea tween ning d k

Tries: Sistrings A binary example String: Sistrings: 7 01 100 010 111 1 2 3 4 5 6 7 8 01 100 010 111 11 000 101 11 10 001 011 1 00 100 010 111 01 000 101 11 10 001 011 1 00 010 111 00 101 11

Tries: Lexical Ordering 7 4 8 5 1 6 3 2 00 010 111 00 101 11 01 000 101 11 01 100 010 111 10 001 011 1 11 000 101 11 Unique string indicated in blue 8

Trie: Basic Concept 1 0 0 0 1 1 7 0 0 1 5 1 2 0 0 4 9 6 1 8 1 1 3

Patricia Tree 1 0 2 0 7 0 4 1 3 0 1 10 0 5 5 2 1 00 3 1 1 0 4 1 6 3 1 8 Single-descendant nodes are eliminated. Nodes have bit number. 10 2

CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback 11

Query Refinement new query Query formulation Search reformulated query Display retrieved information Reformulate query 12 EXIT

Reformulation of Query Manual • • • Add or remove search terms Change Boolean operators Change wild cards Automatic • • • 13 Remove search terms Change weighting of search terms Add new search terms

Manual Reformulation: Vocabulary Tools Feedback • Information about stop lists, stemming, etc. • Numbers of hits on each term or phrase Suggestions 14 • Thesaurus • Browse lists of terms in the inverted index • Controlled vocabulary

Manual Reformulation: Document Tools Feedback to user consists of document excerpts or surrogates • Shows the user how the system has interpreted the query Effective at suggesting how to restrict a search • Shows examples of false hits Less good at suggesting how to expand a search • 15 No examples of missed items

Relevance Feedback: Document Vectors as Points on a Surface 16 • Normalize all document vectors to be of length 1 • Then the ends of the vectors all lie on a surface with unit radius • For similar documents, we can represent parts of this surface as a flat region • Similar document are represented as points that are close together on this surface

Results of a Search x x x x documents found by search query 17 hits from search

Relevance Feedback (Concept) x x o x hits from original search o x documents identified by user as non-relevant o documents identified by user as relevant original query reformulated query 18

Theoretically Best Query optimal query o x x o o x x x x x x non-relevant documents o relevant documents 19

Theoretically Best Query For a specific query, q, let: DR be the set of all relevant documents DN-R be the set of all non-relevant documents sim (q, DR) be the mean similarity between query q and documents in DR sim (q, DN-R) be the mean similarity between query q and documents in DN-R The theoretically best query would maximize: F = sim (q, DR) - sim (q, DN-R) 20

Estimating the Best Query In practice, DR and DN-R are not known. (The objective is to find them. ) However, the results of an initial query can be used to estimate sim (q, DR) and sim (q, DN-R). 21

Rocchio's Modified Query Modified query vector = Original query vector + Mean of relevant documents found by original query - Mean of non-relevant documents found by original query 22

Query Modification q 1 = q 0 + 1 n 1 ri i =1 - 1 n 2 si i =1 q 0 = vector for the initial query q 1 = vector for the modified query ri = vector for relevant document i si = vector for non-relevant document i n 1 = number of relevant documents n 2 = number of non-relevant documents Rocchio 1971 23

Difficulties with Relevance Feedback optimal query o x x o o x x 24 x x x o o x x x x x non-relevant documents o relevant documents original query reformulated query Hits from the initial query are contained in the gray shaded area

Difficulties with Relevance Feedback optimal results x set x o o o x x 25 x x x o o x x x What region provides the optimal results set? x x non-relevant documents o relevant documents original query reformulated query

Effectiveness of Relevance Feedback Best when: 26 • Relevant documents are tightly clustered (similarities are large) • Similarities between relevant and non-relevant documents are small

When to Use Relevance Feedback Relevance feedback is most important when the user wishes to increase recall, i. e. , it is important to find all relevant documents. Under these circumstances, users can be expected to put effort into searching: 27 • Formulate queries thoughtfully with many terms • Review results carefully to provide feedback • Iterate several times • Combine automatic query enhancement with studies of thesauruses and other manual enhancements

Adjusting Parameters 1: Relevance Feedback 1 q 1 = q 0 + n 1 1 r i n 2 i =1 n 2 si i =1 , and are weights that adjust the importance of the three vectors. If = 0, the weights provide positive feedback, by emphasizing the relevant documents in the initial set. If = 0, the weights provide negative feedback, by reducing the emphasis on the non-relevant documents in the initial set. 28

Adjusting Parameters 2: Filtering Incoming Messages D 1, D 2, D 3, . . . is a stream of incoming documents that are to be divided into two sets: R - documents judged relevant to an information need S - documents judged not relevant to the information need A query is defined as the vector in the term vector space: q = (w 1, w 2, . . . , wn) where wi is the weight given to term i Dj will be assigned to R if similarity(q, Dj) > What is the optimal query, i. e. , the optimal values of the wi? 29

Seeking Optimal Parameters Theoretical approach Develop a theoretical model Derive parameters Test with users Heuristic approach Develop a heuristic Vary parameters Test with users Machine learning approach 30

Seeking Optimal Parameters using Machine Learning GENERAL: EXAMPLE: Text Retrieval Input: • training examples • design space Input: • queries with relevance judgments • parameters of retrieval function Training: • automatically find the solution • find parameters so that many in design space that works well relevant documents are ranked on the training data highly Prediction: • predict well on new examples 31 Prediction: • rank relevant documents high also for new queries Joachims

Machine Learning: Tasks and Applications Task Application Text Routing Help-Desk Support: Who is an appropriate expert for a particular problem? Information Filtering Information Agents: Which news articles are interesting to a particular person? Relevance Feedback Information Retrieval: What are other documents relevant for a particular query? Text Knowledge Management: Categorization Organizing a document database by semantic categories. 32

Learning to Rank Assume: • distribution of queries P(q) • distribution of target rankings for query P(r | q) Given: • collection D of documents • independent, identically distributed training sample (qi, ri) Design: • set of ranking functions F • loss function l(ra, rb) • learning algorithm Goal: • find f F that minimizes l(f (q), r) integrated across all queries 33

A Loss Function for Rankings For two orderings ra and rb, a pair is: • concordant, if ra and rb agree in their ordering P = number of concordant pairs • discordant, if ra and rb disagree in their ordering Q = number of discordant pairs Loss function: l(ra, rb) = Q Example: ra = (a, c, d, b, e, f, g, h) rb = (a, b, c, d, e, f, g, h) 34 The discordant pairs are: (c, b), (d, b) l(ra, rb) = 2 Joachims

Machine Learning: Algorithms The choice of algorithms is a subject of active research, which is covered in several courses, notably CS 478 and CS/INFO 630. Some effective methods include: Naive Bayes Rocchio Algorithm C 4. 5 Decision Tree k-Nearest Neighbors Support Vector Machine 35

Relevance Feedback: Clickthrough Data Relevance feedback methods have suffered from the unwillingness of users to provide feedback. Joachims and others have developed methods that use Clickthrough data from online searches. Concept: Suppose that a query delivers a set of hits to a user. If a user skips a link a and clicks on a link b ranked lower, then the user preference reflects rank(b) < rank(a). 36

Clickthrough Example Ranking Presented to User: 1. Kernel Machines http: //svm. first. gmd. de/ User clicks on 1, 3 and 4 2. Support Vector Machine http: //jbolivar. freeservers. com/ 3. SVM-Light Support Vector Machine http: //ais. gmd. de/~thorsten/svm light/ 4. An Introduction to Support Vector Machines http: //www. support-vector. net/ 5. Support Vector Machine and Kernel. . . References http: //svm. research. bell-labs. com/SVMrefs. html 37 Ranking: (3 < 2) and (4 < 2) Joachims