Discussion Class 4 Ranking 1 Discussion Classes Format

  • Slides: 9
Download presentation
Discussion Class 4 Ranking 1

Discussion Class 4 Ranking 1

Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity

Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to comment When answering: Give your name. Make sure that the TA hears it. Stand up Speak clearly so that all the class can hear 2

Question 1: Inverted Document Frequency (IDF) In class, we first introduced Salton's original term

Question 1: Inverted Document Frequency (IDF) In class, we first introduced Salton's original term weighting, known as Inverted Document Frequency: wik = fik * N/dk The reading gives Sparck Jones's term weighting, Inverted Document Frequency (IDF): IDFi = log 2 (N/ni) + 1 or IDFi = log 2 (maxn/ni) + 1 What is the relationship between these alternatives? 3

Q 1 (continued): Definitions of Terms wik weight given to term k in document

Q 1 (continued): Definitions of Terms wik weight given to term k in document i fik dk frequency with which term k appears in document i number of documents that contain term k N number of documents in the collection ni total number of occurrences of term i in the collection maximum frequency of any term in the collection 4

Question 2: Inverted Files "The use of a ranking system instead of a Boolean

Question 2: Inverted Files "The use of a ranking system instead of a Boolean retrieval system has several important implications for supporting inverted file systems. " Discuss the implications of: (a) Adjacency operators (b) Stemming and stoplists 5

Question 3: Operations on Inverted Files Consider a search of a large set of

Question 3: Operations on Inverted Files Consider a search of a large set of documents with the query: vector space methods in information retrieval (a) What are the steps that the search process must go through? (b) Where would you expect the computation impact to be greatest? (c) How can the inverted file system be organized to minimize the computation? 6

Question 4: Within-Document Frequency (a) Why does term weighting using within document frequency improve

Question 4: Within-Document Frequency (a) Why does term weighting using within document frequency improve ranking? (b) Why is it useful to normalize within-document frequency? (c) Explain Croft's normalization: cfreqij = K + (1 - K) freqij/maxfreqj (d) How does Salton and Buckley's recommendation term weighting fit with Croft's normalization? 7

Question 4 (continued): Salton/Buckley Recommendation t (wiq x wij) i=1 similarity (Q, D) =

Question 4 (continued): Salton/Buckley Recommendation t (wiq x wij) i=1 similarity (Q, D) = t t wiq 2 x wij 2 i=1 where 0. 5 freqiq wiq = 0. 5 + maxfreqq and wij = freqij x IDFj ( ) x IDF i freqiq = frequency of term i in query q maxfreqq = maximum frequency of any term in query q IDFi = IDF of term i in entire collection freqij = frequency of term i in document j 8

Question 5: tf. idf compared with Google Page. Rank (a) tf. idf and Page.

Question 5: tf. idf compared with Google Page. Rank (a) tf. idf and Page. Rank are based on fundamentally different considerations. What are the fundamental differences? (b) Under which circumstances would you expect each to excel? 9