Special Topics in Computer Science Advanced Topics in

Previous chapter q Models are needed formal operations q Boolean model is the simplest

Previous chapter: Research issues q How people judge relevance? o ranking strategies q How

To write a paper: Evaluation! q How do you measure whether a system is

Methodology to write a paper q Define formally your task and constraints q Define

. . . Methodology The only valid way of reasoning q “But we want

Evaluation? (Possible? How? ) q IR: “user satisfaction” o Difficult to model formally o

Parameters to evaluate q Performance (in general sense) o Speed o Space § Tradoff

Evaluation: Model User Satisfaction q User task o Batch query processing? Interaction? Mixed? q

Sets (Boolean): Precision & Recall q Tradeoff (as with time and space) q Assumes

Ranked Output (Vector): ? q “Truth”: ordering built by experts q System output: guessed

Ranked Output (Vector) vs. Set q “Truth”: unordered “relevant” set q Output: ordered guessing

. . . Ranked Output vs. set (one query) q Plot precision vs. recall

. . . Many queries q Average precision and recall Ranked output: q Average

Precision vs. Recall Figures q Alternative method: document cutoff values o Precision at first

Single-value summaries q Curves cannot be used for averaging by multiple queries q We

Precision histogram Two algs: A, B R(A)-R(B). Which is better? 19

Alternative measures for Boolean q Problems with Precision & Recall measure: o o Recall

User-oriented measures q Coverage ratio o Many expected docs q Novelty ratio o Many

Reference collections Texts with queries and relevant docs known TREC q Text REtrieval Conference.

. . . TREC tasks q q q Ad-hoc (conventional: query answer) Routing (ranked

. . . TREC evaluation q Summary table statistics o # of requests used

Smaller collections q q q Simpler to use Can include info that TREC does

CACM collection q q q Communications of ACM, 1958 -1979 3204 articles Computer science

ISI collection q q On information sciences 1460 docs For similarity in terms and

Cystic Fibrosis (CF) collection q Medical q 1239 docs q MEDLINE data o keywords

Research issues q Different types of interfaces; interactive systems: o What measures to use?

Conclusions q Main measures: Precision & Recall. o For sets o Rankings are evaluated

Slides: 32

Download presentation

Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 3: Goals: Retrieval Evaluation Alexander Gelbukh www. Gelbukh. com

Previous chapter q Models are needed formal operations q Boolean model is the simplest q Vector model is the best combination of quality and simplicity o TF-IDF term weighting o This (or similar) weighting is used in all further models q Many interesting and not well-investigated variations o possible future work 2

Previous chapter: Research issues q How people judge relevance? o ranking strategies q How to combine different sources of evidence? q What interfaces can help users to understand formulate their information need? o user interfaces: an open issue q Meta-search engines: how to combine results from different Web search engines? o These results almost do not intersect o How to combine rankings? 3

To write a paper: Evaluation! q How do you measure whether a system is good or bad? q To go to the right direction, need to know where you want to get to. q “We can do it this way” vs. “This way it performs better” o o “I think it is better. . . ” “We do it this way. . . ” “Our method takes into account syntax and semantics. . . ” “I like the results. . . ” q Criterion of truth. Crucial for any science. q Enables competition financial policy attracts people o TREC international competitions 4

Methodology to write a paper q Define formally your task and constraints q Define formally your evaluation criterion (argue if needed) o One numerical value is better than several q Show that your method gives better value than o the baseline (the simple obvious way), such as: § Retrieve all. Retrieve none. Retrieve at random. Use Google. o state-of-the-art (the best reported method) § in the same setting and same evaluation method! q and your parameter settings are optimal o Consider extreme settings: 0, 5

. . . Methodology The only valid way of reasoning q “But we want the clusters to be non-trivial” o Add this as a penalty to your criteria or as constraints q Divide your “acceptability considerations” into: o Constraints: yes/no. o Evaluation: better/worse. q Check that your evaluation criteria are well justified o “My formula gives it this way” o “My result is correct since this is what my algorithm gives” o Reason in terms of the user task, not your algorithm / formulas q Are your good/bad judgments in accord with intuition? 6

Evaluation? (Possible? How? ) q IR: “user satisfaction” o Difficult to model formally o Expensive to measure directly (experiments with subjects) q At least two contradicting parameters o Completeness vs. quality o No good way to combine into one single numerical value o Some “user-defined” “weights of importance” of the two § Not formal, depend on situation q Art 7

Parameters to evaluate q Performance (in general sense) o Speed o Space § Tradoff o Common for all systems. Not discussed here. q Retrieval performance (quality? ) o o = goodness of a retrieval strategy A test reference collection: docs and queries. The “correct” set (or ordering) provided by “experts” A similarity measure to compare system output with the “correct” one. 8

Evaluation: Model User Satisfaction q User task o Batch query processing? Interaction? Mixed? q Way of use o Real-life situation: what factors matter? o Interface type q In this chapter: laboratory settings o Repeatability o Scalability 9

Sets (Boolean): Precision & Recall q Tradeoff (as with time and space) q Assumes the retrieval results are sets o as in Boolean; in Vector, use threshold q Measures closeness between two sets q Recall: Of relevant docs, how many (%) were retrieved? Others are lost. q Precision: Of retrieved docs, how many (%) are relevant? Others are noise. q Nowadays with huge collections Precision is more important! 10

Precision & Recall = Precision = 11

Ranked Output (Vector): ? q “Truth”: ordering built by experts q System output: guessed ordering q Ways to compare two rankings: ? q Build the “truth” set is not possible or too expensive q So not used (rarely used? ) in practice q One can built the “truth” set automatically o Research topic for us? 12

Ranked Output (Vector) vs. Set q “Truth”: unordered “relevant” set q Output: ordered guessing q Compare ordered set with an unordered one 13

. . . Ranked Output vs. set (one query) q Plot precision vs. recall curve q In the initial part of the list containing n% of all relevant docs, what the precision is? o 11 standard recall levels: 0%, 10%, . . . , 90%, 100%. o 0%: interpolated 14

. . . Many queries q Average precision and recall Ranked output: q Average precision at each recall level q To get equal (standard) recall levels, interpolation o of 3 relevant docs, there is no 10% level! o Interpolated value at level n = maximum known value between n and n + 1 o If none known, use the nearest known. 15

Precision vs. Recall Figures q Alternative method: document cutoff values o Precision at first 5, 10, 15, 20, 30, 50, 100 docs q Used to compare algorithms. o Simple o Intuitive q NOT a one-value comparison! 16

Which one is better? 17

Single-value summaries q Curves cannot be used for averaging by multiple queries q We need single-value performance for each query o Can be averaged over several queries o Histogram for several queries can be made o Tables can be made q Precision at first relevant doc? q Average precision at (each) seen relevant docs o Favors systems that give several relevant docs first q R-precision o precision at R-th retrieved (R = total relevant) 18

Precision histogram Two algs: A, B R(A)-R(B). Which is better? 19

Alternative measures for Boolean q Problems with Precision & Recall measure: o o Recall cannot be estimated with large collections Two values, but we need one value to compare Designed for batch mode, not interactive. Informativeness! Designed for linear ordering of docs (not weak ordering) q Alternative measures: combine both in one F-measure: E-measure: user preference Rec vs. Prec 20

User-oriented measures Definitions: 21

User-oriented measures q Coverage ratio o Many expected docs q Novelty ratio o Many new docs q Relative recall: # found / # expected q Recall effort: # expected / # examined until those are found q Other: o expected search length (good for weak order) o satisfaction (considers only relevant docs) o frustration (considers only non-relevant docs) 22

Reference collections Texts with queries and relevant docs known TREC q Text REtrieval Conference. Different in different years q Wide variety of topics. Document structure marked up. q 6 GB. See NIST website: available at small cost q Not all relevant docs marked! o o Pooling method: top 100 docs in ranking of many search engines manually verified Was tested that is a good approximation to the “real” set 23

. . . TREC tasks q q q Ad-hoc (conventional: query answer) Routing (ranked filtering of changing collection) Chinese ad-hoc Filtering (changing collection; no ranking) Interactive (no ranking) NLP: does it help? q Query transforming o Automatic Cross-language (ad-hoc) o Manual High precision (only 10 docs in answer) Spoken document retrieval (written transcripts) Very large corpus (ad-hoc, 20 GB = 7. 5 M docs) Query task (several query versions; does strategy depends on it? ) 24

. . . TREC evaluation q Summary table statistics o # of requests used in the task o # of retrieved docs; # of relevant retrieved and not retrieved q Recall-precision averages o 11 standard points. Interpolated (and not) q Document level averages o Also, can include average R-value q Average precision histogram o By topic. o E. g. , difference between R-precision of this system and average of all systems 25

Smaller collections q q q Simpler to use Can include info that TREC does not Can be of specialized type (e. g. , include co-citations) Less sparse, greater overlap between queries Examples: o CACM o ISI o there are others 26

CACM collection q q q Communications of ACM, 1958 -1979 3204 articles Computer science Structure info (author, date, citations, . . . ) Stems (only title and abstract) Good for algorithms relying on cross-citations o If a paper cites another one, they are related o If two papers cite the same ones, they are related q 52 queries with Boolean form and answer sets 27

ISI collection q q On information sciences 1460 docs For similarity in terms and cross-citation Includes: o Stems (title and abstracts) o Number of cross-citations q 35 natural-language queries with Boolean form and answer sets 28

Cystic Fibrosis (CF) collection q Medical q 1239 docs q MEDLINE data o keywords assigned manually! q 100 requests q 4 judgments for each doc o Good to see agreement q Degrees of relevance, from 0 to 2 q Good answer set overlap o can be used for learning from previous queries 29

Research issues q Different types of interfaces; interactive systems: o What measures to use? o Such as infromativeness 30

Conclusions q Main measures: Precision & Recall. o For sets o Rankings are evaluated through initial subsets q There are measures that combine them into one o Involve user-defined preferences q Many (other) characteristics o An algorithm can be good at some and bad at others o Averages are used, but not always are meaningful q Reference collection exists with known answers to evaluate new algorithms 31

Thank you! Till. . . ? ? 32