Document ranking Paolo Ferragina Dipartimento di Informatica Universit

  • Slides: 38
Download presentation
Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

The big fight: find the best ranking. . .

The big fight: find the best ranking. . .

Ranking: Google vs Google. cn

Ranking: Google vs Google. cn

Document ranking Text-based Ranking (1° generation) Reading 6. 2 and 6. 3

Document ranking Text-based Ranking (1° generation) Reading 6. 2 and 6. 3

Similarity between binary vectors n Document is binary vector X, Y in {0, 1}D

Similarity between binary vectors n Document is binary vector X, Y in {0, 1}D n Score: overlap measure What’s wrong ?

Normalization n Dice coefficient (wrt avg #terms): NO, triangular n Jaccard coefficient (wrt possible

Normalization n Dice coefficient (wrt avg #terms): NO, triangular n Jaccard coefficient (wrt possible terms): OK, triangular

What’s wrong in doc-similarity ? Overlap matching doesn’t consider: n Term frequency in a

What’s wrong in doc-similarity ? Overlap matching doesn’t consider: n Term frequency in a document n n Term scarcity in collection n n Talks more of t ? Then t should be weighted more. of commoner than baby bed Length of documents n score should be normalized

A famous “weight”: tf-idf tf t, d = Number of occurrences of term t

A famous “weight”: tf-idf tf t, d = Number of occurrences of term t in doc d idf t = logæç n ö÷ è nt ø where nt = #docs containing term t n = #docs in the indexed collection Vector Space model

Sec. 6. 3 Why distance is a bad idea

Sec. 6. 3 Why distance is a bad idea

Easy to Spam A graphical example t 3 cos(a) = v w / ||v||

Easy to Spam A graphical example t 3 cos(a) = v w / ||v|| * ||w|| d 2 d 3 d 1 a t 1 Sophisticated algos to find top-k docs for a query Q d 5 t 2 d 4 The user query is a very short doc Postulate: Documents that are “close together” in the vector space talk about the same things. Euclidean distance sensible to vector length !!

Sec. 6. 3 cosine(query, document) Dot product qi is the tf-idf weight of term

Sec. 6. 3 cosine(query, document) Dot product qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q, d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.

Cos for length-normalized vectors n For length-normalized vectors, cosine similarity is simply the dot

Cos for length-normalized vectors n For length-normalized vectors, cosine similarity is simply the dot product (or scalar product): for q, d length-normalized.

Cosine similarity among 3 docs How similar are the novels Sa. S: Sense and

Cosine similarity among 3 docs How similar are the novels Sa. S: Sense and Sensibility Pa. P: Pride and Prejudice, and WH: Wuthering Heights? term affection Sa. S Pa. P WH 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 Term frequencies (counts) Note: To simplify this example, we don’t do idf weighting.

3 documents example contd. Log frequency weighting term Sa. S Pa. P WH After

3 documents example contd. Log frequency weighting term Sa. S Pa. P WH After length normalization term Sa. S Pa. P WH affection 3. 06 2. 76 2. 30 affection 0. 789 0. 832 0. 524 jealous 2. 00 1. 85 2. 04 jealous 0. 515 0. 555 0. 465 gossip 1. 30 0 1. 78 gossip 0. 335 0 0. 405 0 0 2. 58 wuthering 0 0 0. 588 wuthering cos(Sa. S, Pa. P) ≈ 0. 789 × 0. 832 + 0. 515 × 0. 555 + 0. 335 × 0. 0 + 0. 0 × 0. 0 ≈ 0. 94 cos(Sa. S, WH) ≈ 0. 79 cos(Pa. P, WH) ≈ 0. 69

Sec. 7. 1. 2 Storage n n For every term, we store the IDF

Sec. 7. 1. 2 Storage n n For every term, we store the IDF in memory, in terms of nt , which is actually the length of its posting list (so anyway needed). For every doc. ID d in the posting list of term t, we store its frequency tft, d which is tipically small and thus stored with unary/gamma.

Vector spaces and other operators n Vector space OK for bag-of-words queries n n

Vector spaces and other operators n Vector space OK for bag-of-words queries n n n Clean metaphor for similar-document queries Not a good combination with operators: Boolean, wild-card, positional, proximity First generation of search engines n Invented before “spamming” web search

Document ranking Top-k retrieval Reading 7

Document ranking Top-k retrieval Reading 7

Sec. 7. 1. 1 Speed-up top-k retrieval n Costly is the computation of the

Sec. 7. 1. 1 Speed-up top-k retrieval n Costly is the computation of the cos n Find a set A of contenders, with K < |A| << N n n Set A does not necessarily contain the top K, but has many docs from among the top K Return the top K docs in A, according to the score The same approach is also used for other (noncosine) scoring functions Will look at several schemes following this

Sec. 7. 1. 2 How to select A’s docs n n Consider docs containing

Sec. 7. 1. 2 How to select A’s docs n n Consider docs containing at least one query term. Hence this means… Take this further: 1. Only consider high-idf query terms 2. Champion lists: top scores 3. Only consider docs containing many query terms 4. Fancy hits: for complex ranking functions 5. Clustering

Approach #1: High-idf query terms only n n n Sec. 7. 1. 2 For

Approach #1: High-idf query terms only n n n Sec. 7. 1. 2 For a query such as catcher in the rye n Only accumulate scores from catcher and rye Intuition: in and the contribute little to the scores and so don’t alter rank-ordering much Benefit: Postings of low-idf terms have many docs these (many) docs get eliminated from set A of contenders n

Approach #2: Champion Lists n Preprocess: Assign to each term, its m best documents

Approach #2: Champion Lists n Preprocess: Assign to each term, its m best documents n Search: n n If |Q| = q terms, merge their preferred lists ( mq answers). Compute COS between Q and these docs, and choose the top k. Need to pick m>k to work well empirically. Now SE use tf-idf PLUS Page. Rank (PLUS other weights)

Approach #3: Docs containing many query terms n For multi-term queries, compute scores for

Approach #3: Docs containing many query terms n For multi-term queries, compute scores for docs containing several of the query terms n n n Say, at least 3 out of 4 Imposes a “soft conjunction” on queries seen on web search engines (early Google) Easy to implement in postings traversal

Sec. 7. 1. 2 3 of 4 query terms Antony 3 4 8 16

Sec. 7. 1. 2 3 of 4 query terms Antony 3 4 8 16 32 64 128 Brutus 2 4 8 16 32 64 128 Caesar 1 2 3 5 8 13 13 16 32 Calpurnia 21 34 Scores only computed for docs 8, 16 and 32.

Sec. 7. 1. 4 Complex scores n n Consider a simple total score combining

Sec. 7. 1. 4 Complex scores n n Consider a simple total score combining cosine relevance and authority net-score(q, d) = PR(d) + cosine(q, d) n n Can use some other linear combination than an equal weighting Now we seek the top K docs by net score

Approach #4: Fancy-hits heuristic n Preprocess: n n n Assign doc. ID by decreasing

Approach #4: Fancy-hits heuristic n Preprocess: n n n Assign doc. ID by decreasing PR weight Define FH(t) = m docs for t with highest tf-idf weight Define IL(t) = the rest (i. e. incr doc. ID = decr PR weight) n Idea: a document that scores high should be in FH or in the front of IL n Search for a t-term query: n First FH: Take the common docs of their FH n n compute the score of these docs and keep the top-k docs. Then IL: scan ILs and check the common docs n n Compute the score and possibly insert them into the top-k. Stop when M docs have been checked or the PR score becomes smaller than some threshold.

Sec. 7. 1. 6 Approach #5: Clustering Query Leader Follower

Sec. 7. 1. 6 Approach #5: Clustering Query Leader Follower

Sec. 7. 1. 6 Cluster pruning: preprocessing n n Pick N docs at random:

Sec. 7. 1. 6 Cluster pruning: preprocessing n n Pick N docs at random: call these leaders For every other doc, precompute nearest leader n n Docs attached to a leader: its followers; Likely: each leader has ~ N followers.

Sec. 7. 1. 6 Cluster pruning: query processing n Process a query as follows:

Sec. 7. 1. 6 Cluster pruning: query processing n Process a query as follows: n n Given query Q, find its nearest leader L. Seek K nearest docs from among L’s followers.

Sec. 7. 1. 6 Why use random sampling n n Fast Leaders reflect data

Sec. 7. 1. 6 Why use random sampling n n Fast Leaders reflect data distribution

Sec. 7. 1. 6 General variants n n n Have each follower attached to

Sec. 7. 1. 6 General variants n n n Have each follower attached to b 1=3 (say) nearest leaders. From query, find b 2=4 (say) nearest leaders and their followers. Can recur on leader/follower construction.

Document ranking Relevance feedback Reading 9

Document ranking Relevance feedback Reading 9

Sec. 9. 1 Relevance Feedback n Relevance feedback: user feedback on relevance of docs

Sec. 9. 1 Relevance Feedback n Relevance feedback: user feedback on relevance of docs in initial set of results n n User issues a (short, simple) query The user marks some results as relevant or non -relevant. The system computes a better representation of the information need based on feedback. Relevance feedback can go through one or more iterations.

Sec. 9. 1. 1 Rocchio (SMART) n n Used in practice: Dr = set

Sec. 9. 1. 1 Rocchio (SMART) n n Used in practice: Dr = set of known relevant doc vectors Dnr = set of known irrelevant doc vectors qm = modified query vector; q 0 = original query vector; α, β, γ: weights (hand-chosen or set empirically) n New query moves toward relevant documents and away from irrelevant documents

Relevance Feedback: Problems n n n Users are often reluctant to provide explicit feedback

Relevance Feedback: Problems n n n Users are often reluctant to provide explicit feedback It’s often harder to understand why a particular document was retrieved after applying relevance feedback There is no clear evidence that relevance feedback is the “best use” of the user’s time.

Sec. 9. 1. 6 Pseudo relevance feedback n Pseudo-relevance feedback automates the “manual” part

Sec. 9. 1. 6 Pseudo relevance feedback n Pseudo-relevance feedback automates the “manual” part of true relevance feedback. n n n Retrieve a list of hits for the user’s query Assume that the top k are relevant. Do relevance feedback (e. g. , Rocchio) Works very well on average But can go horribly wrong for some queries. Several iterations can cause query drift.

Sec. 9. 2. 2 Query Expansion n n In relevance feedback, users give additional

Sec. 9. 2. 2 Query Expansion n n In relevance feedback, users give additional input (relevant/non-relevant) on documents, which is used to reweight terms in the documents In query expansion, users give additional input (good/bad search term) on words or phrases

Sec. 9. 2. 2 How augment the user query? n Manual thesaurus n n

Sec. 9. 2. 2 How augment the user query? n Manual thesaurus n n E. g. Med. Line: physician, syn: doc, doctor, MD Global Analysis (static; all docs in collection) n Automatically derived thesaurus n n (co-occurrence statistics) Refinements based on query-log mining n n (costly to generate) Common on the web Local Analysis (dynamic) n Analysis of documents in result set

Query assist Would you expect such a feature to increase the query volume at

Query assist Would you expect such a feature to increase the query volume at a search engine?