Paired Experiments and Interleaving for Retrieval Evaluation Thorsten

  • Slides: 12
Download presentation
Paired Experiments and Interleaving for Retrieval Evaluation Thorsten Joachims, Madhu Kurup, Filip Radlinski Department

Paired Experiments and Interleaving for Retrieval Evaluation Thorsten Joachims, Madhu Kurup, Filip Radlinski Department of Computer Science Department of Information Science Cornell University

Decide between two Ranking Functions Distribution P(u, q) of users u, queries q Retrieval

Decide between two Ranking Functions Distribution P(u, q) of users u, queries q Retrieval Function 1 f 1(u, q) r 1 1. 2. 3. 4. 5. Kernel Machines http: //svm. first. gmd. de/ SVM-Light Support Vector Machine http: //svmlight. joachims. org/ School of Veterinary Medicine at UPenn http: //www. vet. upenn. edu/ An Introduction to Support Vector Machines http: //www. support-vector. net/ Service Master Company http: //www. servicemaster. com/ ⁞ U(tj, ”SVM”, r 1) Which one is better? (tj, ”SVM”) Retrieval Function 2 f 2(u, q) r 2 1. 2. 3. 4. 5. School of Veterinary Medicine at UPenn http: //www. vet. upenn. edu/ Service Master Company http: //www. servicemaster. com/ Support Vector Machine http: //jbolivar. freeservers. com/ Archives of SUPPORT-VECTOR-MACHINES http: //www. jiscmail. ac. uk/lists/SUPPORT. . . SVM-Light Support Vector Machine http: //ais. gmd. de/~thorsten/svm light/ ⁞ U(tj, ”SVM”, r 2)

Measuring Utility Name Description Aggregation Hypothesized Change with Decreased Quality Abandonment Rate % of

Measuring Utility Name Description Aggregation Hypothesized Change with Decreased Quality Abandonment Rate % of queries with no click N/A Increase Reformulation Rate % of queries that are followed by reformulation N/A Increase Queries per Session = no interruption of more than 30 minutes Mean Increase Clicks per Query Number of clicks Mean Decrease Click@1 % of queries with clicks at position 1 N/A Decrease Max Reciprocal Rank* 1/rank for highest click Mean Decrease Mean Reciprocal Rank* Mean of 1/rank for all clicks Mean Decrease Time to First Click* Seconds before first click Median Increase Time to Last Click* Seconds before final click Median Decrease (*) only queries with at least one click count

Ar. Xiv. org: User Study in Ar. Xiv. org – Natural user and query

Ar. Xiv. org: User Study in Ar. Xiv. org – Natural user and query population – User in natural context, not lab – Live and operational search engine – Ground truth by construction ORIG SWAP 2 SWAP 4 • ORIG: Hand-tuned fielded • SWAP 2: ORIG with 2 pairs swapped • SWAP 4: ORIG with 4 pairs swapped ORIG FLAT RAND • ORIG: Hand-tuned fielded • FLAT: No field weights • RAND : Top 10 of FLAT shuffled [Radlinski et al. , 2008]

Ar. Xiv. org: Experiment Setup • Experiment Setup – Phase I: 36 days •

Ar. Xiv. org: Experiment Setup • Experiment Setup – Phase I: 36 days • Users randomly receive ranking from Orig, Flat, Rand – Phase II: 30 days • Users randomly receive ranking from Orig, Swap 2, Swap 4 – User are permanently assigned to one experimental condition based on IP address and browser. • Basic Statistics – ~700 queries per day / ~300 distinct users per day • Quality Control and Data Cleaning – Test run for 32 days – Heuristics to identify bots and spammers – All evaluation code was written twice and cross-validated

Arxiv. org: Results 2, 5 Conclusions ORIG FLAT • None of the absolute metrics

Arxiv. org: Results 2, 5 Conclusions ORIG FLAT • None of the absolute metrics reflects RAND expected order. ORIG SWAP 2 after • Most differences not significant SWAP 4 one month of data. 2 1, 5 1 • Analogous results for Yahoo! Search with much more data. 0, 5 La st C C st to Fir Ti m e to m e Ti M ea n Re cip ip ec ax R Ra nk 1 @ ick f. C ro S sp ie er Qu Nu m be er te Ra rm Re fo Cl M Ab an do nm Ra te 0 [Radlinski et al. , 2008]

Decide between two Ranking Functions Distribution P(u, q) (tj, ”SVM”) What would Paul do?

Decide between two Ranking Functions Distribution P(u, q) (tj, ”SVM”) What would Paul do? of users u, queries q KANTOR, P. 1988. National, language-specific evaluation sites for retrieval systems and interfaces. Proceedings of the International Conference on Computer-Assisted Information Retrieval (RIAO). 139– 147. Retrieval Function 1 Which one Retrieval Function 2 • Take retrieval functions better? f 1(u, q) results r 1 fromistwo f 2(u, q) and r 2 mix them blind paired comparison. 1. 2. 3. 4. 5. Kernel Machines http: //svm. first. gmd. de/ SVM-Light Support Vector Machine http: //svmlight. joachims. org/ School of Veterinary Medicine at UPenn http: //www. vet. upenn. edu/ An Introduction to Support Vector Machines http: //www. support-vector. net/ Service Master Company http: //www. servicemaster. com/ ⁞ 1. School of Veterinary Medicine at UPenn http: //www. vet. upenn. edu/ Service Master Company http: //www. servicemaster. com/ Support Vector Machine http: //jbolivar. freeservers. com/ Archives of SUPPORT-VECTOR-MACHINES http: //www. jiscmail. ac. uk/lists/SUPPORT. . . SVM-Light Support Vector Machine http: //ais. gmd. de/~thorsten/svm light/ ⁞ • Fedex them to the 2. users. 3. • Users assess relevance of papers. Retrieval system 4. with more relevant papers wins. 5. U(tj, ”SVM”, r 1) U(tj, ”SVM”, r 2)

Balanced Interleaving (u=tj, q=“svm”) f 1(u, q) r 1 1. 2. 3. 4. 5.

Balanced Interleaving (u=tj, q=“svm”) f 1(u, q) r 1 1. 2. 3. 4. 5. f 2(u, q) r 2 1. Kernel Machines http: //svm. first. gmd. de/ Support Vector Machine http: //jbolivar. freeservers. com/ An Introduction to Support Vector Machines http: //www. support-vector. net/ Archives of SUPPORT-VECTOR-MACHINES. . . http: //www. jiscmail. ac. uk/lists/SUPPORT. . . SVM-Light Support Vector Machine http: //ais. gmd. de/~thorsten/svm light/ 2. 3. 4. Interleaving(r 1, r 2) 1. 2. Model of User: Better retrieval functions is more likely to get more clicks. 5. 3. 4. 5. 6. 7. Kernel Machines http: //svm. first. gmd. de/ Support Vector Machine http: //jbolivar. freeservers. com/ SVM-Light Support Vector Machine http: //ais. gmd. de/~thorsten/svm light/ An Introduction to Support Vector Machines http: //www. support-vector. net/ Support Vector Machine and Kernel. . . References http: //svm. research. bell-labs. com/SVMrefs. html Archives of SUPPORT-VECTOR-MACHINES. . . http: //www. jiscmail. ac. uk/lists/SUPPORT. . . Lucent Technologies: SVM demo applet http: //svm. research. bell-labs. com/SVT/SVMsvt. html Kernel Machines http: //svm. first. gmd. de/ SVM-Light Support Vector Machine http: //ais. gmd. de/~thorsten/svm light/ Support Vector Machine and Kernel. . . References http: //svm. research. bell-labs. com/SVMrefs. html Lucent Technologies: SVM demo applet http: //svm. research. bell-labs. com/SVT/SVMsvt. html Royal Holloway Support Vector Machine http: //svm. dcs. rhbnc. ac. uk 1 2 2 3 3 4 4 Invariant: For all k, top k of balanced interleaving is union of top k 1 of r 1 and top k 2 of r 2 with k 1=k 2 ± 1. Interpretation: (r 1 r 2) ↔ clicks(topk(r 1)) > clicks(topk(r 2)) see also [Radlinski, Craswell, 2012] [Hofmann, 2012] [Joachims, 2001] [Radlinski et al. , 2008]

Arxiv. org: Interleaving Experiment • Experiment Setup – Phase I: 36 days • Balanced

Arxiv. org: Interleaving Experiment • Experiment Setup – Phase I: 36 days • Balanced Interleaving of (Orig, Flat) (Flat, Rand) (Orig, Rand) – Phase II: 30 days • Balanced Interleaving of (Orig, Swap 2) (Swap 2, Swap 4) (Orig, Swap 4) • Quality Control and Data Cleaning – Same as for absolute metrics

Arxiv. org: Interleaving Results % wins RAND % wins ORIG 45 • All interleaving

Arxiv. org: Interleaving Results % wins RAND % wins ORIG 45 • All interleaving experiments reflect the expected order. 40 35 • All differences are significant after one month of data. 30 25 20 • Same results also for alternative data -preprocessing. 15 10 5 AP 4 G> S OR I SW 2> AP SW W AP 4 AP 2 W G> S OR I >R AN D OR IG RA AT > FL IG >F LA T ND 0 OR Percent Wins Conclusions

Yahoo and Bing: Interleaving Results • Yahoo Web Search [Chapelle et al. , 2012]

Yahoo and Bing: Interleaving Results • Yahoo Web Search [Chapelle et al. , 2012] – Four retrieval functions (i. e. 6 paired comparisons) – Balanced Interleaving All paired comparisons consistent with ordering by NDCG. • Bing Web Search [Radlinski & Craswell, 2010] – Five retrieval function pairs – Team-Game Interleaving Consistent with ordering by NDGC when NDCG significant.

Conclusion • Pick Paul’s brain frequently • Pick Paul’s brain early • Library dust

Conclusion • Pick Paul’s brain frequently • Pick Paul’s brain early • Library dust is not harmful