Evaluation: How well does system
                                                                   meet information need?
                                                             • System evaluation: how
       Philosophy of IR Evaluation                             good are document
                                                               rankings?

                                                             • User-based evaluation:
                    Ellen Voorhees                             how satisfied is user?


                                                                                                                     NIST


      Why do system evaluation?                                           Cranfield Tradition
• Allows sufficient control of variables to increase         • Laboratory testing of retrieval systems first done
  power of comparative experiments                             in Cranfield II experiment (1963)
   – laboratory tests less expensive                           – fixed document and query sets
   – laboratory tests more diagnostic
                                                               – evaluation based on relevance judgments
   – laboratory tests necessarily an abstraction
                                                               – relevance abstracted to topical similarity
• It works!
   – numerous examples of techniques developed in the        • Test collections
     laboratory that improve performance in operational        – set of documents
     settings                                                  – set of questions
                                                               – relevance judgments

                                                      NIST                                                           NIST


   Cranfield Tradition Assumptions                               The Case Against the Cranfield
• Relevance can be approximated by topical
                                                                           Tradition
  similarity                                                 • Relevance judgments
  – relevance of one doc is independent of others              – vary too much to be the basis of evaluation
  – all relevant documents equally desirable                   – topical similarity is not utility
  – user information need doesn’t change                       – static set of judgments cannot reflect user’s changing
                                                                 information need
• Single set of judgments is representative of user
  population                                                 • Recall is unknowable
                                                             • Results on test collections are not representative
• Complete judgments (i.e., recall is knowable)                of operational retrieval systems
• [Binary judgments]
                                                      NIST                                                           NIST


                                                                                                                            1
                                                                               Using Pooling to Create Large
            Response to Criticism                                                    Test Collections
• Goal in Cranfield tradition is to compare systems                                                         A variety of different
                                                                           Assessors                        systems retrieve the top
     • gives relative scores of evaluation measures, not absolute
                                                                           create topics.                   1000 documents for each
     • differences in relevance judgments matter only if relative
                                                                                                            topic.
       measures based on those judgments change

• Realism is a concern
     • historically concern has been collection size                     Systems are
                                                                         evaluated using                      Form pools of unique
     • for TREC and similar collections, bigger concern is realism
                                                                         relevance                            documents from all
       of topic statement
                                                                         judgments.                           submissions which the
                                                                                                              assessors judge for
                                                                                                              relevance.

                                                                  NIST                                                                         NIST


                      Documents                                                                      Topics
• Must be representative of real task of interest                        • Distinguish between statement of user need
   – genre                                                                 (topic) & system data structure (query)
   – diversity (subjects, style, vocabulary)                               – topic gives criteria for relevance
   – amount                                                                – allows for different query construction techniques
   – full text vs. abstract


                                                                  NIST                                                                         NIST


   Creating Relevance Judgments                                                   Test Collection Reliability
                                          Pools    Alphabetized
  RUN A                                               Docnos
                                                                         • Recap
                                            401
                                                                              • test collections are abstractions of operational retrieval
                                                                                settings used to explore the relative merits of different
               401                                                              retrieval strategies
                                           402                                • test collections are reliable if they predict the relative worth
                  Top 100                                                       of different approaches

  RUN B                                                                  • Two dimensions to explore
                                           403                                • inconsistency: differences in relevance judgments caused
                                                                                by using different assessors
                401                                                           • incompleteness: violation of assumption that all documents
                                                                                are judged for all test queries
                                                                  NIST                                                                         NIST


                                                                                                                                                      2
                                  Inconsistency                           Experiment:
• Most frequently cited “problem” of test collections                     • Given three independent sets of judgments for
  – undeniably true that relevance is highly subjective;                    each of 48 TREC-4 topics
    judgments vary by assessor and for same assessor
    over time ...                                                         • Rank the TREC-4 runs by mean average
  – … but no evidence that these differences affect                         precision as evaluated using different
    comparative evaluation of systems                                       combinations of judgments

                                                                          • Compute correlation among run rankings


                                                                   NIST                                                                 NIST


                             Average Precision by Qrel                         Effect of Different Judgments
                       0.4
                                                                          • Similar highly-correlated results found using
                                                                               • different query sets
                       0.3                          Line 1
                                                                               • different evaluation measures
                                                    Line 2
   Average Precision


                                                    Mean
                                                                               • different groups of assessors
                                                    Original                   • single opinion vs. group opinion judgments
                       0.2
                                                    Union
                                                    Intersection          • Conclusion: comparative results are stable
                       0.1                                                  despite the idiosyncratic nature of relevance
                                                                            judgments
                        0
                                      System

                                                                   NIST                                                                 NIST


                                 Incompleteness                                            Incompleteness
• Relatively new concern regarding test collection                        • Study by Zobel [SIGIR-98]:
  quality                                                                    – Quality of relevance judgments does depend on
                                                                               pool depth and diversity
   – early test collections were small enough to have
     complete judgments                                                      – TREC judgments not complete
   – current collections can have only a small portion                         • additional relevant documents distributed roughly
     examined for relevance for each query; portion                              uniformly across systems but highly skewed across topics
     judged is usually selected by pooling                                   – TREC ad hoc collections not biased against systems
                                                                               that do not contribute to the pools


                                                                   NIST                                                                 NIST


                                                                                                                                               3
                                                                                            Uniques Effect on Evaluation:
                      Uniques Effect on Evaluation
                                                                                                  Automatic Only
                     0.05
                                                                                                          90                                      0.0025
                                                                                                          80
                     0.04


                                                                                 Number Unique by Group
                                                                                                          70                                      0.002
 Difference in MAP


                                                                                                                                                           Difference in MAP
                                                                                                          60
                     0.03
                                                                                                                                                  0.0015
                                                                                                          50

                     0.02                                                                                 40
                                                                                                                                                  0.001
                                                                                                          30

                     0.01                                                                                 20                                      0.0005
                                                                                                          10
                       0                                                                                  0                                       0

                                             Run                                                                          Run

                                                                      NIST                                                                                                     NIST


                             Incompleteness
                                                                                                          Cross-language Collections
• Adequate pool depth (and diversity) is important                           • More difficult to build a cross-language collection
  to building reliable test collections                                        than a monolingual collection
• With such controls, large test collections are                               – consistency harder to obtain
  viable laboratory tools                                                                   • multiple assessors per topic (one per language)
                                                                                            • must take care when comparing different language
• For test collections, bias is much worse than
                                                                                              evaluations (e.g., cross run to mono baseline)
  incompleteness
           – smaller, fair judgment sets always preferable to                  – pooling harder to coordinate
             larger, potentially-biased sets                                                • need to have large, diverse pools for all languages
           – need to carefully evaluate effects of new pool building                        • retrieval results are not balanced across languages
             paradigms with respect to bias introduced                                      • haven’t tended to get recall-oriented manual runs in cross-
                                                                                              language tasks
                                                                      NIST                                                                                                     NIST


                            Cranfield Tradition                                                                Cranfield Tradition
                                                                             • Note the emphasis on comparative !!
• Test collections are abstractions, but laboratory
                                                                                – absolute score of some effectiveness measure not
  tests are useful nonetheless                                                    meaningful
               – evaluation technology is predictive (i.e., results                               • absolute score changes when assessor changes
                 transfer to operational settings)                                                • query variability not accounted for
                                                                                                  • impact of collection size, generality not accounted for
               – relevance judgments by different assessors almost                                • theoretical maximum of 1.0 for both recall & precision not
                 always produce the same comparative results                                        obtainable by humans

               – adequate pools allow unbiased evaluation of                    – evaluation results are only comparable when they
                 unjudged runs                                                    are from the same collection
                                                                                                  • a subset of a collection is a different collection
                                                                                                  • direct comparison of scores from two different TREC
                                                                                                    collections (e.g., scores from TRECs 7&8) is invalid
                                                                      NIST                                                                                                     NIST


                                                                                                                                                                                      4