<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Philosophy of IR Evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ellen Voorhees</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The Case Against the Cranfield Tradition</institution>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>5</lpage>
      <abstract>
        <p>• Allows sufficient control of variables to increase power of comparative experiments - laboratory tests less expensive - laboratory tests more diagnostic - laboratory tests necessarily an abstraction Cranfield Tradition Assumptions</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>• It works!
– numerous examples of techniques developed in the
laboratory that improve performance in operational
settings
• Laboratory testing of retrieval systems first done
in Cranfield II experiment (1963)
– fixed document and query sets
– evaluation based on relevance judgments
– relevance abstracted to topical similarity
• Test collections
– set of documents
– set of questions
– relevance judgments</p>
    </sec>
    <sec id="sec-2">
      <title>NIST</title>
    </sec>
    <sec id="sec-3">
      <title>NIST</title>
      <p>• Relevance can be approximated by topical
similarity
– relevance of one doc is independent of others
– all relevant documents equally desirable
– user information need doesn’t change
• Single set of judgments is representative of user
population
• Complete judgments (i.e., recall is knowable)
• [Binary judgments]
• Relevance judgments
– vary too much to be the basis of evaluation
– topical similarity is not utility
– static set of judgments cannot reflect user’s changing
information need
• Recall is unknowable
• Results on test collections are not representative
of operational retrieval systems</p>
    </sec>
    <sec id="sec-4">
      <title>NIST</title>
    </sec>
    <sec id="sec-5">
      <title>NIST</title>
      <p>Response to Criticism</p>
      <sec id="sec-5-1">
        <title>Documents</title>
        <p>• Must be representative of real task of interest
– genre
– amount
– diversity (subjects, style, vocabulary)
– full text vs. abstract
RUN A
RUN B
401
401</p>
        <p>Top 100</p>
        <p>Pools
401
402
403
Creating Relevance Judgments
Alphabetized</p>
        <p>Docnos</p>
        <p>• Recap</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>NIST</title>
    </sec>
    <sec id="sec-7">
      <title>NIST NIST</title>
      <p>Using Pooling to Create Large</p>
      <p>Test Collections</p>
      <sec id="sec-7-1">
        <title>Assessors create topics.</title>
      </sec>
      <sec id="sec-7-2">
        <title>Systems are evaluated using relevance judgments.</title>
      </sec>
      <sec id="sec-7-3">
        <title>A variety of different systems retrieve the top 1000 documents for each topic.</title>
      </sec>
      <sec id="sec-7-4">
        <title>Form pools of unique</title>
        <p>documents from all
submissions which the
assessors judge for
relevance.</p>
        <sec id="sec-7-4-1">
          <title>Topics</title>
          <p>• Distinguish between statement of user need
(topic) &amp; system data structure (query)
– topic gives criteria for relevance
– allows for different query construction techniques</p>
          <p>Test Collection Reliability
of different approaches
• Two dimensions to explore
• test collections are abstractions of operational retrieval
settings used to explore the relative merits of different
retrieval strategies
• test collections are reliable if they predict the relative worth
• inconsistency: differences in relevance judgments caused
by using different assessors
• incompleteness: violation of assumption that all documents
are judged for all test queries</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>NIST</title>
    </sec>
    <sec id="sec-9">
      <title>NIST</title>
      <p>NIST
0.3
n
o
ii
s
c
e
rP0.2
e
g
a
re
v
A0.1
0</p>
      <p>Average Precision by Qrel</p>
      <p>System</p>
      <p>Incompleteness
• Relatively new concern regarding test collection
quality
– early test collections were small enough to have
complete judgments
– current collections can have only a small portion
examined for relevance for each query; portion
judged is usually selected by pooling</p>
    </sec>
    <sec id="sec-10">
      <title>NIST</title>
    </sec>
    <sec id="sec-11">
      <title>NIST NIST</title>
      <p>Inconsistency
• Most frequently cited “problem” of test collections
– undeniably true that relevance is highly subjective;
judgments vary by assessor and for same assessor
over time ...
– … but no evidence that these differences affect
comparative evaluation of systems
Experiment:
• Given three independent sets of judgments for
each of 48 TREC-4 topics
• Rank the TREC-4 runs by mean average
precision as evaluated using different
combinations of judgments
• Compute correlation among run rankings</p>
    </sec>
    <sec id="sec-12">
      <title>NIST</title>
    </sec>
    <sec id="sec-13">
      <title>NIST</title>
    </sec>
    <sec id="sec-14">
      <title>NIST</title>
      <p>Line 1
Line 2
Mean
Original
Union
Intersection</p>
      <p>Effect of Different Judgments
• Similar highly-correlated results found using
• different query sets
• different evaluation measures
• different groups of assessors
• single opinion vs. group opinion judgments
• Conclusion: comparative results are stable
despite the idiosyncratic nature of relevance
judgments</p>
      <p>Incompleteness
• Study by Zobel [SIGIR-98]:
– Quality of relevance judgments does depend on
pool depth and diversity
– TREC judgments not complete
• additional relevant documents distributed roughly
uniformly across systems but highly skewed across topics
– TREC ad hoc collections not biased against systems
that do not contribute to the pools
Uniques Effect on Evaluation
90
80
p
uo70
r
G60
y
eb50
u
iqn40
U
r30
e
bm20
u
N10
0
0.0025
• For test collections, bias is much worse than
incompleteness
– smaller, fair judgment sets always preferable to
larger, potentially-biased sets
– need to carefully evaluate effects of new pool building
paradigms with respect to bias introduced
• Test collections are abstractions, but laboratory
tests are useful nonetheless
– evaluation technology is predictive (i.e., results
transfer to operational settings)
– relevance judgments by different assessors almost
always produce the same comparative results
– adequate pools allow unbiased evaluation of
unjudged runs</p>
    </sec>
    <sec id="sec-15">
      <title>NIST</title>
    </sec>
    <sec id="sec-16">
      <title>NIST</title>
    </sec>
    <sec id="sec-17">
      <title>NIST</title>
      <p>Cross-language Collections
• More difficult to build a cross-language collection
than a monolingual collection
– consistency harder to obtain
• multiple assessors per topic (one per language)
• must take care when comparing different language
evaluations (e.g., cross run to mono baseline)
– pooling harder to coordinate
• need to have large, diverse pools for all languages
• retrieval results are not balanced across languages
• haven’t tended to get recall-oriented manual runs in
crosslanguage tasks
• Note the emphasis on comparative !!
– absolute score of some effectiveness measure not
meaningful
• absolute score changes when assessor changes
• query variability not accounted for
• impact of collection size, generality not accounted for
• theoretical maximum of 1.0 for both recall &amp; precision not
obtainable by humans
– evaluation results are only comparable when they
are from the same collection
• a subset of a collection is a different collection
• direct comparison of scores from two different TREC
collections (e.g., scores from TRECs 7&amp;8) is invalid</p>
    </sec>
    <sec id="sec-18">
      <title>NIST</title>
    </sec>
    <sec id="sec-19">
      <title>NIST</title>
    </sec>
    <sec id="sec-20">
      <title>NIST</title>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>