-

Philosophy of IR Evaluation

Ellen Voorhees

0 0 The Case Against the Cranfield Tradition

2 5

• Allows sufficient control of variables to increase power of comparative experiments - laboratory tests less expensive - laboratory tests more diagnostic - laboratory tests necessarily an abstraction Cranfield Tradition Assumptions

• It works! – numerous examples of techniques developed in the laboratory that improve performance in operational settings • Laboratory testing of retrieval systems first done in Cranfield II experiment (1963) – fixed document and query sets – evaluation based on relevance judgments – relevance abstracted to topical similarity • Test collections – set of documents – set of questions – relevance judgments

NIST NIST

• Relevance can be approximated by topical similarity – relevance of one doc is independent of others – all relevant documents equally desirable – user information need doesn’t change • Single set of judgments is representative of user population • Complete judgments (i.e., recall is knowable) • [Binary judgments] • Relevance judgments – vary too much to be the basis of evaluation – topical similarity is not utility – static set of judgments cannot reflect user’s changing information need • Recall is unknowable • Results on test collections are not representative of operational retrieval systems

NIST NIST

Response to Criticism

Documents

• Must be representative of real task of interest – genre – amount – diversity (subjects, style, vocabulary) – full text vs. abstract RUN A RUN B 401 401

Top 100

Pools 401 402 403 Creating Relevance Judgments Alphabetized

Docnos

• Recap

NIST NIST NIST

Using Pooling to Create Large

Test Collections

Assessors create topics. Systems are evaluated using relevance judgments. A variety of different systems retrieve the top 1000 documents for each topic. Form pools of unique

documents from all submissions which the assessors judge for relevance.

Topics

• Distinguish between statement of user need (topic) & system data structure (query) – topic gives criteria for relevance – allows for different query construction techniques

Test Collection Reliability of different approaches • Two dimensions to explore • test collections are abstractions of operational retrieval settings used to explore the relative merits of different retrieval strategies • test collections are reliable if they predict the relative worth • inconsistency: differences in relevance judgments caused by using different assessors • incompleteness: violation of assumption that all documents are judged for all test queries

NIST NIST

NIST 0.3 n o ii s c e rP0.2 e g a re v A0.1 0

Average Precision by Qrel

System

Incompleteness • Relatively new concern regarding test collection quality – early test collections were small enough to have complete judgments – current collections can have only a small portion examined for relevance for each query; portion judged is usually selected by pooling

NIST NIST NIST

Inconsistency • Most frequently cited “problem” of test collections – undeniably true that relevance is highly subjective; judgments vary by assessor and for same assessor over time ... – … but no evidence that these differences affect comparative evaluation of systems Experiment: • Given three independent sets of judgments for each of 48 TREC-4 topics • Rank the TREC-4 runs by mean average precision as evaluated using different combinations of judgments • Compute correlation among run rankings

NIST NIST NIST

Line 1 Line 2 Mean Original Union Intersection

Effect of Different Judgments • Similar highly-correlated results found using • different query sets • different evaluation measures • different groups of assessors • single opinion vs. group opinion judgments • Conclusion: comparative results are stable despite the idiosyncratic nature of relevance judgments

Incompleteness • Study by Zobel [SIGIR-98]: – Quality of relevance judgments does depend on pool depth and diversity – TREC judgments not complete • additional relevant documents distributed roughly uniformly across systems but highly skewed across topics – TREC ad hoc collections not biased against systems that do not contribute to the pools Uniques Effect on Evaluation 90 80 p uo70 r G60 y eb50 u iqn40 U r30 e bm20 u N10 0 0.0025 • For test collections, bias is much worse than incompleteness – smaller, fair judgment sets always preferable to larger, potentially-biased sets – need to carefully evaluate effects of new pool building paradigms with respect to bias introduced • Test collections are abstractions, but laboratory tests are useful nonetheless – evaluation technology is predictive (i.e., results transfer to operational settings) – relevance judgments by different assessors almost always produce the same comparative results – adequate pools allow unbiased evaluation of unjudged runs

NIST NIST NIST

Cross-language Collections • More difficult to build a cross-language collection than a monolingual collection – consistency harder to obtain • multiple assessors per topic (one per language) • must take care when comparing different language evaluations (e.g., cross run to mono baseline) – pooling harder to coordinate • need to have large, diverse pools for all languages • retrieval results are not balanced across languages • haven’t tended to get recall-oriented manual runs in crosslanguage tasks • Note the emphasis on comparative !! – absolute score of some effectiveness measure not meaningful • absolute score changes when assessor changes • query variability not accounted for • impact of collection size, generality not accounted for • theoretical maximum of 1.0 for both recall & precision not obtainable by humans – evaluation results are only comparable when they are from the same collection • a subset of a collection is a different collection • direct comparison of scores from two different TREC collections (e.g., scores from TRECs 7&8) is invalid

NIST NIST NIST