Evaluation INST 734 Module 5 Doug Oard Agenda

Agenda • Evaluation fundamentals Ø Test collections: evaluating sets • Test collections: evaluating rankings

Batch Evaluation Model Documents Query IR Black Box Search Result Evaluation Module Relevance Judgments

IR Test Collection Design • Representative document collection – Size, sources, genre, topics, …

A TREC Ad Hoc Topic Title: Health and Computer Terminals Description: Is it hazardous

Saracevic on Relevance is the measure degree dimension estimate appraisal relation of a correspondence

Teasing Apart “Relevance” • Relevance relates a topic and a document – Duplicates are

Set-Based Effectiveness Measures • Precision – How much of what was found is relevant?

Relevant + Retrieved Not Relevant + Not Retrieved

Effectiveness Measures System Truth User. Oriented Retrieved Not Retrieved Relevant Retrieved Miss Not Relevant

Single-Figure Set-Based Measures • Balanced F-measure – Harmonic mean of recall and precision –

(Paired) Statistical Significance Tests Query System A 1 0. 02 2 0. 39 3

Reporting Results • Do you have a measurable improvement? – Inter-assessor agreement limits max

Agenda • Evaluation fundamentals • Test collections: evaluating sets Ø Test collections: evaluating rankings

Slides: 14

Download presentation

Evaluation INST 734 Module 5 Doug Oard

Agenda • Evaluation fundamentals Ø Test collections: evaluating sets • Test collections: evaluating rankings • Interleaving • User studies

Batch Evaluation Model Documents Query IR Black Box Search Result Evaluation Module Relevance Judgments Measure of Effectiveness These are the four things we need

IR Test Collection Design • Representative document collection – Size, sources, genre, topics, … • “Random” sample of topics – Associated somehow with queries • Known (often binary) levels of relevance – For each topic-document pair (topic, not query!) – Assessed by humans, used only for evaluation • Measure(s) of effectiveness – Used to compare alternate systems

A TREC Ad Hoc Topic Title: Health and Computer Terminals Description: Is it hazardous to the health of individuals to work with computer terminals on a daily basis? Narrative: Relevant documents would contain any information that expands on any physical disorder/problems that may be associated with the daily working with computer terminals. Such things as carpel tunnel, cataracts, and fatigue have been said to be associated, but how widespread are these or other problems and what is being done to alleviate any health problems.

Saracevic on Relevance is the measure degree dimension estimate appraisal relation of a correspondence utility connection satisfaction fit bearing matching existing between a document and a article textual form reference information provided fact as determined by person judge user requester Information specialist query request information used point of view information need statement Tefko Saracevic. (1975) Relevance: A Review of and a Framework for Thinking on the Notion in Information Science. Journal of the American Society for Information Science, 26(6), 321 -343.

Teasing Apart “Relevance” • Relevance relates a topic and a document – Duplicates are equally relevant by definition – Constant over time and across users • Pertinence relates a task and a document – Accounts for quality, complexity, language, … • Utility relates a user and a document – Accounts for prior knowledge Dagobert Soergel (1994). Indexing and Retrieval Performance: The Logical Evidence. JASIS, 45(8), 589 -599.

Set-Based Effectiveness Measures • Precision – How much of what was found is relevant? • Often of interest, particularly for interactive searching • Recall – How much of what is relevant was found? • Particularly important for law, patents, and medicine • Fallout – How much of what was irrelevant was rejected? • Useful when different size collections are compared

Relevant + Retrieved Not Relevant + Not Retrieved

Effectiveness Measures System Truth User. Oriented Retrieved Not Retrieved Relevant Retrieved Miss Not Relevant False Alarm Irrelevant Rejected System. Oriented

Single-Figure Set-Based Measures • Balanced F-measure – Harmonic mean of recall and precision – Weakness: What if no relevant documents exist? • Cost function – Reward relevant retrieved, Penalize non-relevant • For example, 3 R+ - 2 N+ – Weakness: Hard to normalize, so hard to average

(Paired) Statistical Significance Tests Query System A 1 0. 02 2 0. 39 3 0. 16 4 0. 58 5 0. 04 6 0. 09 7 0. 12 Average 0. 20 Sign Test + + p=1. 0 System B 0. 76 0. 07 0. 37 0. 21 0. 02 0. 91 0. 46 0. 40 Wilcoxon +0. 74 - 0. 32 +0. 21 - 0. 37 - 0. 02 +0. 82 + 0. 38 p=0. 94 t-test +0. 74 - 0. 32 +0. 21 - 0. 37 - 0. 02 +0. 82 + 0. 38 p=0. 34 95% of outcomes 0 Try some at: http: //www. socscistatistics. com/tests/signedranks/

Reporting Results • Do you have a measurable improvement? – Inter-assessor agreement limits max precision – Using one judge to assess another yields ~0. 8 • Do you have a meaningful improvement? – 0. 05 (absolute) in precision might be noticed – 0. 10 (absolute) in precision makes a difference • Do you have a reliable improvement? – Two-tailed paired statistical significance test

Agenda • Evaluation fundamentals • Test collections: evaluating sets Ø Test collections: evaluating rankings • Interleaving • User studies