Evaluation: How well does system meet information need? • System evaluation: how Philosophy of IR Evaluation good are document rankings? • User-based evaluation: Ellen Voorhees how satisfied is user? NIST Why do system evaluation? Cranfield Tradition • Allows sufficient control of variables to increase • Laboratory testing of retrieval systems first done power of comparative experiments in Cranfield II experiment (1963) – laboratory tests less expensive – fixed document and query sets – laboratory tests more diagnostic – evaluation based on relevance judgments – laboratory tests necessarily an abstraction – relevance abstracted to topical similarity • It works! – numerous examples of techniques developed in the • Test collections laboratory that improve performance in operational – set of documents settings – set of questions – relevance judgments NIST NIST Cranfield Tradition Assumptions The Case Against the Cranfield • Relevance can be approximated by topical Tradition similarity • Relevance judgments – relevance of one doc is independent of others – vary too much to be the basis of evaluation – all relevant documents equally desirable – topical similarity is not utility – user information need doesn’t change – static set of judgments cannot reflect user’s changing information need • Single set of judgments is representative of user population • Recall is unknowable • Results on test collections are not representative • Complete judgments (i.e., recall is knowable) of operational retrieval systems • [Binary judgments] NIST NIST 1 Using Pooling to Create Large Response to Criticism Test Collections • Goal in Cranfield tradition is to compare systems A variety of different Assessors systems retrieve the top • gives relative scores of evaluation measures, not absolute create topics. 1000 documents for each • differences in relevance judgments matter only if relative topic. measures based on those judgments change • Realism is a concern • historically concern has been collection size Systems are evaluated using Form pools of unique • for TREC and similar collections, bigger concern is realism relevance documents from all of topic statement judgments. submissions which the assessors judge for relevance. NIST NIST Documents Topics • Must be representative of real task of interest • Distinguish between statement of user need – genre (topic) & system data structure (query) – diversity (subjects, style, vocabulary) – topic gives criteria for relevance – amount – allows for different query construction techniques – full text vs. abstract NIST NIST Creating Relevance Judgments Test Collection Reliability Pools Alphabetized RUN A Docnos • Recap 401 • test collections are abstractions of operational retrieval settings used to explore the relative merits of different 401 retrieval strategies 402 • test collections are reliable if they predict the relative worth Top 100 of different approaches RUN B • Two dimensions to explore 403 • inconsistency: differences in relevance judgments caused by using different assessors 401 • incompleteness: violation of assumption that all documents are judged for all test queries NIST NIST 2 Inconsistency Experiment: • Most frequently cited “problem” of test collections • Given three independent sets of judgments for – undeniably true that relevance is highly subjective; each of 48 TREC-4 topics judgments vary by assessor and for same assessor over time ... • Rank the TREC-4 runs by mean average – … but no evidence that these differences affect precision as evaluated using different comparative evaluation of systems combinations of judgments • Compute correlation among run rankings NIST NIST Average Precision by Qrel Effect of Different Judgments 0.4 • Similar highly-correlated results found using • different query sets 0.3 Line 1 • different evaluation measures Line 2 Average Precision Mean • different groups of assessors Original • single opinion vs. group opinion judgments 0.2 Union Intersection • Conclusion: comparative results are stable 0.1 despite the idiosyncratic nature of relevance judgments 0 System NIST NIST Incompleteness Incompleteness • Relatively new concern regarding test collection • Study by Zobel [SIGIR-98]: quality – Quality of relevance judgments does depend on pool depth and diversity – early test collections were small enough to have complete judgments – TREC judgments not complete – current collections can have only a small portion • additional relevant documents distributed roughly examined for relevance for each query; portion uniformly across systems but highly skewed across topics judged is usually selected by pooling – TREC ad hoc collections not biased against systems that do not contribute to the pools NIST NIST 3 Uniques Effect on Evaluation: Uniques Effect on Evaluation Automatic Only 0.05 90 0.0025 80 0.04 Number Unique by Group 70 0.002 Difference in MAP Difference in MAP 60 0.03 0.0015 50 0.02 40 0.001 30 0.01 20 0.0005 10 0 0 0 Run Run NIST NIST Incompleteness Cross-language Collections • Adequate pool depth (and diversity) is important • More difficult to build a cross-language collection to building reliable test collections than a monolingual collection • With such controls, large test collections are – consistency harder to obtain viable laboratory tools • multiple assessors per topic (one per language) • must take care when comparing different language • For test collections, bias is much worse than evaluations (e.g., cross run to mono baseline) incompleteness – smaller, fair judgment sets always preferable to – pooling harder to coordinate larger, potentially-biased sets • need to have large, diverse pools for all languages – need to carefully evaluate effects of new pool building • retrieval results are not balanced across languages paradigms with respect to bias introduced • haven’t tended to get recall-oriented manual runs in cross- language tasks NIST NIST Cranfield Tradition Cranfield Tradition • Note the emphasis on comparative !! • Test collections are abstractions, but laboratory – absolute score of some effectiveness measure not tests are useful nonetheless meaningful – evaluation technology is predictive (i.e., results • absolute score changes when assessor changes transfer to operational settings) • query variability not accounted for • impact of collection size, generality not accounted for – relevance judgments by different assessors almost • theoretical maximum of 1.0 for both recall & precision not always produce the same comparative results obtainable by humans – adequate pools allow unbiased evaluation of – evaluation results are only comparable when they unjudged runs are from the same collection • a subset of a collection is a different collection • direct comparison of scores from two different TREC collections (e.g., scores from TRECs 7&8) is invalid NIST NIST 4