The fault, dear researchers, is not in Cranfield,
                But in our metrics, that they are unrealistic.

                           Mark D. Smucker                                            Charles L. A. Clarke
                Department of Management Sciences                                  School of Computer Science
                  University of Waterloo, Canada                                  University of Waterloo, Canada
                  mark.smucker@uwaterloo.ca                                       claclark@plg.uwaterloo.ca


1. INTRODUCTION                                                           2.   TIME-BIASED GAIN
   As designers of information retrieval (IR) systems, we                    HCI has a long history of automated usability evalua-
need some way to measure the performance of our systems.                  tion [10], and indeed, so does IR. Cleverdon designed the
An excellent approach to take is to directly measure actual               Cranﬁeld 2 study carefully in terms of a speciﬁc type of
user performance either in situ or in the laboratory [12]. The            user and how this type of user would deﬁne relevance [8,
downside of live user involvement is the prohibitive cost if              p. 9]. Taken together, a test collection (documents, topics,
many evaluations are required. For example, it is common                  relevance judgments) and an evaluation metric allow for the
practice to sweep parameter settings for ranking algorithms               simulation of a user with diﬀerent IR systems.
in order to optimize retrieval metrics on a test collection.                 Järvelin and Kekäläinen produced a signiﬁcant shift in
The Cranﬁeld approach to IR evaluation provides low-cost,                 evaluation metrics with their introduction of cumulated gain-
reusable measures of system performance.                                  based measures [11]. The cumulated gain measures are ex-
   Cranﬁeld-style evaluation frequently has been criticized as            plicitly focused on a model of a user using an IR system. As
being too divorced from the reality of how users search, but              long as the user continues to search, the user can continue to
there really is nothing wrong with the approach [18]. The                 increase their gain. The common notion of gain in IR eval-
Cranﬁeld approach eﬀectively is a simulation of IR system                 uation is the relevant document, but gain can be anything
usage that attempts to make a prediction about the perfor-                we would like to deﬁne it to be.
mance of one system vs. another [15].                                        Cumulated gain can be plotted vs. time to produce a gain
   As such, we should really be thinking of the Cranﬁeld                  curve and compare systems. The curve that rises higher and
approach as the application of models to make predictions,                faster than another curve is the preferred curve. While we
which is common practice in science and engineering. For                  can plot gain curves of one system vs. another, it is well-
example, physics has equations of motion. Civil engineering               known that users do not endlessly search; diﬀerent users
has models of concrete strength. Epidemiology has models                  stop their searches at diﬀerent points in time for a host of
of disease spread. Etc. In all of these ﬁelds, it is well under-          reasons. Given a probability density function f (t) that gives
stood that the models are simpliﬁcations of reality, but that             the distribution of time spent searching, we can compute the
the models provide the ability to make useful predictions.                expected gain as follows:
   Information retrieval’s predictive models are our evalua-                                          ∞
tion metrics.                                                                             E[G(t)] =      G(t)f (t)dt,                (1)
   The criticism of system-oriented IR evaluation should be                                          0
redirected. The problem is not with Cranﬁeld — which is
just another name for making predictions given a model —                  where G(t) is the cumulated gain at time t. Equation 1
the problem is with the metrics.                                          represents time-biased gain in its general form, i.e. time-
   We believe that rather than criticizing Cranﬁeld, the cor-             biased gain is the expected gain for a population of users.
rect response is to develop better metrics. We should make                   While it is natural for us to talk about cumulated gain over
metrics that are more predictive of human performance. We                 time, the traditional cumulated gain measures have substi-
should make metrics that incorporate the user interface and               tuted document rank for time and implicitly model a user
realistically represent the variation in user behavior. We                that takes the same amount of time to evaluate each and
should make metrics that encapsulate our best understand-                 every document. By making time a central part of our met-
ing of search behavior.                                                   ric, we gain the ability to more accurately model behavior.
   In popular parlance, we should bring solutions, not prob-              For example, in a document retrieval system, longer docu-
lems, to the system-oriented IR researcher. To this end,                  ments will in general take users longer to evaluate, and if the
we have developed a new evaluation metric, time-biased                    retrieval system presents results with document summaries
gain (TBG), that predicts IR system performance in hu-                    (snippets), we know that users can use summaries to speed
man terms of the expected number of relevant documents                    the rate at which they ﬁnd relevant information [14].
to be found by a user [16].                                                  Another signiﬁcant advantage of using time directly in our
                                                                          retrieval metric is that we now make testable predictions of
                                                                          human performance. Our predictions are in the same units
                                                                          as would be obtained as part of a user study. To our knowl-
Presented at EuroHCIR2012. Copyright (C) 2012 for the individual papers
by the papers’ authors. Copying permitted only for private and academic   edge, this alignment between the units of Cranﬁeld-style
purposes. This volume is published and copyrighted by its editors.        metrics and user study metrics has not previously existed.
   Time-biased gain in the form of Equation 1 makes no men-       4.   ACKNOWLEDGMENTS
tion of ranked lists of documents, for it is a general purpose       This work was supported in part by the NSERC, in part
description of users using an IR system over time. To pro-        by GRAND NCE, in part by Google, in part by Amazon,
duce a metric suitable for use in evaluating ranked lists, we     in part by the facilities of SHARCNET, and in part by the
followed a process common to development of new simula-           University of Waterloo. Any opinions, ﬁndings and conclu-
tions [3]:                                                        sions or recommendations expressed in this material are the
  1. Creation of model.                                           authors’ and do not necessarily reﬂect those of the sponsors.
  2. Calibration of model.
  3. Validation of model.                                         5.   REFERENCES
Our ﬁrst step in model creation was to adopt the standard          [1] L. Azzopardi. Usage based eﬀectiveness measures:
model of a user that works down a result list and move                 monitoring application performance in information
Equation 1 to a form common to cumulated gain measures:                retrieval. CIKM, pages 631–640, 2009.
                                                                   [2] L. Azzopardi, K. Järvelin, J. Kamps, and M. D. Smucker.
                       
                       ∞
                                                                       Report on the SIGIR 2010 workshop on the simulation of
                             gk D(T (k)),                  (2)         interaction. SIGIR Forum, 44:35–47, January 2011.
                       k=1                                         [3] J. Banks, J. S. Carson II, B. L. Nelson, and D. M. Nicol.
where gk is the gain at rank k, T (k) is the expected time it          Discrete-Event System Simulation. Prentice Hall, 5th
                                                                       edition, 2010.
takes a user to reach rank k, and D(t) is the fraction of the
                                                                   [4] B. Carterette, E. Kanoulas, and E. Yilmaz. Simulating
population that survives to time t and is called the decay             simple user behavior for system eﬀectiveness evaluation. In
function.                                                              CIKM, pages 611–620, 2011.
   Our model for the time it takes a user to reach rank k,         [5] O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan.
T (k), takes into consideration a hypothetical user interface          Expected reciprocal rank for graded relevance. In CIKM,
that presents results to the user in the form of document              pages 621–630, Hong Kong, 2009.
summaries. A click on a document summary takes the user            [6] M. D. Dunlop. Time, relevance and interaction modelling
to the full document. We model both the probabilities of               for information retrieval. In SIGIR, pp. 206–213. 1997.
clicking on summaries given their NIST relevance and the           [7] G. Dupret. Discounted cumulative gain and user decision
                                                                       models. In Proceedings of the 18th international conference
probability of then judging a viewed full document as rele-            on String processing and information retrieval, SPIRE’11,
vant. We separately model the time to view summaries and               pages 2–13, Berlin, Heidelberg, 2011. Springer-Verlag.
full documents. For the time spent on a full document, we          [8] D. Harman. Information Retrieval Evaluation. Morgan &
modeled longer documents taking longer with an additional              Claypool, 2011.
constant amount of spent. We treated duplicate documents           [9] W. Hersh, A. Turpin, S. Price, B. Chan, D. Kramer,
as zero length documents. We then calibrated T (k) using               L. Sacherek, and D. Olson. Do batch and user evaluations
data from a user study, and ﬁnally we validated that our               give the same results? In SIGIR, pages 17–24. ACM, 2000.
T (k) provided a reasonable ﬁt to the user study data. Like-      [10] M. Y. Ivory and M. A. Hearst. The state of the art in
                                                                       automating usability evaluation of user interfaces. ACM
wise, we modeled D(t) as exponential decay ﬁt to a search              Computing Surveys, 33(4):470–516, 2001.
engine’s log data.                                                [11] K. Järvelin and J. Kekäläinen. Cumulated gain-based
   In contrast, older evaluation metrics such as mean average          evaluation of IR techniques. TOIS, 20(4):422–446, 2002.
precision [19, p. 59] cannot be calibrated and have only          [12] D. Kelly. Methods for Evaluating Interactive Information
been validated after their creation. For example, the work             Retrieval Systems with Users, volume 3. Foundations and
of Hersh and Turpin [9] is likely the ﬁrst attempt to validate         Trends in Information Retrieval, 2009.
a metric (average precision). Many recent metrics can be          [13] H. Keskustalo, K. Järvelin, T. Sharma, and M. L. Nielsen.
calibrated to actual user behavior [4, 5, 7, 17, 20, 21], but          Test collection-based IR evaluation needs extension toward
                                                                       sessions: A case of extremely short queries. In AIRS, pp.
their calibration and validation often come after their release        63–74, 2009.
and adoption.                                                     [14] R. Khan, D. Mease, and R. Patel. The impact of result
                                                                       abstracts on task completion time. In Workshop on Web
3. CONCLUSION                                                          Search Result Summarization and Presentation,
                                                                       WWW’09, 2009.
   The Cranﬁeld approach to IR evaluation is merely an-
                                                                  [15] J. Lin and M. D. Smucker. How do users ﬁnd things with
other name for the development and use of predictive mod-              PubMed? Towards automatic utility evaluation with user
els, which is a fundamental part all science and engineering           simulations. In SIGIR’08, pages 19–26. ACM, 2008.
ﬁelds. In particular, IR evaluation ﬁts nicely into the frame-    [16] M. D. Smucker and C. L. A. Clarke. Time-based calibration
work of simulation where models are created, calibrated, and           of eﬀectiveness measures. In SIGIR, 10 pages, 2012.
validated before being used to make predictions. We have          [17] A. Turpin, F. Scholer, K. Järvelin, M. Wu, and J. S.
presented time-biased gain as an example of what we believe            Culpepper. Including summaries in system evaluation. In
the correct direction is for IR system evaluation. We are not          SIGIR’09, pages 508–515. ACM, 2009.
the only ones to be working on better metrics or taking a         [18] E. M. Voorhees. I come not to bury Cranﬁeld, but to praise
                                                                       it. In HCIR’09, pages 13–16, 2009.
simulation based approach [2, 13], and others also consider
                                                                  [19] E. M. Voorhees and D. K. Harman, editors. TREC. MIT
time an important part of evaluation [1, 6].                           Press, 2005.
   Our position is that system-oriented IR research is user-      [20] E. Yilmaz, M. Shokouhi, N. Craswell, and S. Robertson.
oriented IR research given its use of evaluation metrics that          Expected browsing utility for web search evaluation. In
model users. If HCIR researchers can produce better mod-               CIKM, pages 1561–1564, Toronto, 2010.
els than exist today — by better, we mean more predictive         [21] Y. Zhang, L. A. Park, and A. Moﬀat. Click-based evidence
of human performance — then we can help system develop-                for decaying weight distributions in search eﬀectiveness
ment to focus on changes that help users better search.                metrics. Information Retrieval, 13:46–69, February 2010.