The fault, dear researchers, is not in Cranfield, But in our metrics, that they are unrealistic. Mark D. Smucker Charles L. A. Clarke Department of Management Sciences School of Computer Science University of Waterloo, Canada University of Waterloo, Canada mark.smucker@uwaterloo.ca claclark@plg.uwaterloo.ca 1. INTRODUCTION 2. TIME-BIASED GAIN As designers of information retrieval (IR) systems, we HCI has a long history of automated usability evalua- need some way to measure the performance of our systems. tion [10], and indeed, so does IR. Cleverdon designed the An excellent approach to take is to directly measure actual Cranfield 2 study carefully in terms of a specific type of user performance either in situ or in the laboratory [12]. The user and how this type of user would define relevance [8, downside of live user involvement is the prohibitive cost if p. 9]. Taken together, a test collection (documents, topics, many evaluations are required. For example, it is common relevance judgments) and an evaluation metric allow for the practice to sweep parameter settings for ranking algorithms simulation of a user with different IR systems. in order to optimize retrieval metrics on a test collection. Järvelin and Kekäläinen produced a significant shift in The Cranfield approach to IR evaluation provides low-cost, evaluation metrics with their introduction of cumulated gain- reusable measures of system performance. based measures [11]. The cumulated gain measures are ex- Cranfield-style evaluation frequently has been criticized as plicitly focused on a model of a user using an IR system. As being too divorced from the reality of how users search, but long as the user continues to search, the user can continue to there really is nothing wrong with the approach [18]. The increase their gain. The common notion of gain in IR eval- Cranfield approach effectively is a simulation of IR system uation is the relevant document, but gain can be anything usage that attempts to make a prediction about the perfor- we would like to define it to be. mance of one system vs. another [15]. Cumulated gain can be plotted vs. time to produce a gain As such, we should really be thinking of the Cranfield curve and compare systems. The curve that rises higher and approach as the application of models to make predictions, faster than another curve is the preferred curve. While we which is common practice in science and engineering. For can plot gain curves of one system vs. another, it is well- example, physics has equations of motion. Civil engineering known that users do not endlessly search; different users has models of concrete strength. Epidemiology has models stop their searches at different points in time for a host of of disease spread. Etc. In all of these fields, it is well under- reasons. Given a probability density function f (t) that gives stood that the models are simplifications of reality, but that the distribution of time spent searching, we can compute the the models provide the ability to make useful predictions. expected gain as follows: Information retrieval’s predictive models are our evalua-  ∞ tion metrics. E[G(t)] = G(t)f (t)dt, (1) The criticism of system-oriented IR evaluation should be 0 redirected. The problem is not with Cranfield — which is just another name for making predictions given a model — where G(t) is the cumulated gain at time t. Equation 1 the problem is with the metrics. represents time-biased gain in its general form, i.e. time- We believe that rather than criticizing Cranfield, the cor- biased gain is the expected gain for a population of users. rect response is to develop better metrics. We should make While it is natural for us to talk about cumulated gain over metrics that are more predictive of human performance. We time, the traditional cumulated gain measures have substi- should make metrics that incorporate the user interface and tuted document rank for time and implicitly model a user realistically represent the variation in user behavior. We that takes the same amount of time to evaluate each and should make metrics that encapsulate our best understand- every document. By making time a central part of our met- ing of search behavior. ric, we gain the ability to more accurately model behavior. In popular parlance, we should bring solutions, not prob- For example, in a document retrieval system, longer docu- lems, to the system-oriented IR researcher. To this end, ments will in general take users longer to evaluate, and if the we have developed a new evaluation metric, time-biased retrieval system presents results with document summaries gain (TBG), that predicts IR system performance in hu- (snippets), we know that users can use summaries to speed man terms of the expected number of relevant documents the rate at which they find relevant information [14]. to be found by a user [16]. Another significant advantage of using time directly in our retrieval metric is that we now make testable predictions of human performance. Our predictions are in the same units as would be obtained as part of a user study. To our knowl- Presented at EuroHCIR2012. Copyright (C) 2012 for the individual papers by the papers’ authors. Copying permitted only for private and academic edge, this alignment between the units of Cranfield-style purposes. This volume is published and copyrighted by its editors. metrics and user study metrics has not previously existed. Time-biased gain in the form of Equation 1 makes no men- 4. ACKNOWLEDGMENTS tion of ranked lists of documents, for it is a general purpose This work was supported in part by the NSERC, in part description of users using an IR system over time. To pro- by GRAND NCE, in part by Google, in part by Amazon, duce a metric suitable for use in evaluating ranked lists, we in part by the facilities of SHARCNET, and in part by the followed a process common to development of new simula- University of Waterloo. Any opinions, findings and conclu- tions [3]: sions or recommendations expressed in this material are the 1. Creation of model. authors’ and do not necessarily reflect those of the sponsors. 2. Calibration of model. 3. Validation of model. 5. REFERENCES Our first step in model creation was to adopt the standard [1] L. Azzopardi. Usage based effectiveness measures: model of a user that works down a result list and move monitoring application performance in information Equation 1 to a form common to cumulated gain measures: retrieval. CIKM, pages 631–640, 2009. [2] L. Azzopardi, K. Järvelin, J. Kamps, and M. D. Smucker.  ∞ Report on the SIGIR 2010 workshop on the simulation of gk D(T (k)), (2) interaction. SIGIR Forum, 44:35–47, January 2011. k=1 [3] J. Banks, J. S. Carson II, B. L. Nelson, and D. M. Nicol. where gk is the gain at rank k, T (k) is the expected time it Discrete-Event System Simulation. Prentice Hall, 5th edition, 2010. takes a user to reach rank k, and D(t) is the fraction of the [4] B. Carterette, E. Kanoulas, and E. Yilmaz. Simulating population that survives to time t and is called the decay simple user behavior for system effectiveness evaluation. In function. CIKM, pages 611–620, 2011. Our model for the time it takes a user to reach rank k, [5] O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. T (k), takes into consideration a hypothetical user interface Expected reciprocal rank for graded relevance. In CIKM, that presents results to the user in the form of document pages 621–630, Hong Kong, 2009. summaries. A click on a document summary takes the user [6] M. D. Dunlop. Time, relevance and interaction modelling to the full document. We model both the probabilities of for information retrieval. In SIGIR, pp. 206–213. 1997. clicking on summaries given their NIST relevance and the [7] G. Dupret. Discounted cumulative gain and user decision models. In Proceedings of the 18th international conference probability of then judging a viewed full document as rele- on String processing and information retrieval, SPIRE’11, vant. We separately model the time to view summaries and pages 2–13, Berlin, Heidelberg, 2011. Springer-Verlag. full documents. For the time spent on a full document, we [8] D. Harman. Information Retrieval Evaluation. Morgan & modeled longer documents taking longer with an additional Claypool, 2011. constant amount of spent. We treated duplicate documents [9] W. Hersh, A. Turpin, S. Price, B. Chan, D. Kramer, as zero length documents. We then calibrated T (k) using L. Sacherek, and D. Olson. Do batch and user evaluations data from a user study, and finally we validated that our give the same results? In SIGIR, pages 17–24. ACM, 2000. T (k) provided a reasonable fit to the user study data. Like- [10] M. Y. Ivory and M. A. Hearst. The state of the art in automating usability evaluation of user interfaces. ACM wise, we modeled D(t) as exponential decay fit to a search Computing Surveys, 33(4):470–516, 2001. engine’s log data. [11] K. Järvelin and J. Kekäläinen. Cumulated gain-based In contrast, older evaluation metrics such as mean average evaluation of IR techniques. TOIS, 20(4):422–446, 2002. precision [19, p. 59] cannot be calibrated and have only [12] D. Kelly. Methods for Evaluating Interactive Information been validated after their creation. For example, the work Retrieval Systems with Users, volume 3. Foundations and of Hersh and Turpin [9] is likely the first attempt to validate Trends in Information Retrieval, 2009. a metric (average precision). Many recent metrics can be [13] H. Keskustalo, K. Järvelin, T. Sharma, and M. L. Nielsen. calibrated to actual user behavior [4, 5, 7, 17, 20, 21], but Test collection-based IR evaluation needs extension toward sessions: A case of extremely short queries. In AIRS, pp. their calibration and validation often come after their release 63–74, 2009. and adoption. [14] R. Khan, D. Mease, and R. Patel. The impact of result abstracts on task completion time. In Workshop on Web 3. CONCLUSION Search Result Summarization and Presentation, WWW’09, 2009. The Cranfield approach to IR evaluation is merely an- [15] J. Lin and M. D. Smucker. How do users find things with other name for the development and use of predictive mod- PubMed? Towards automatic utility evaluation with user els, which is a fundamental part all science and engineering simulations. In SIGIR’08, pages 19–26. ACM, 2008. fields. In particular, IR evaluation fits nicely into the frame- [16] M. D. Smucker and C. L. A. Clarke. Time-based calibration work of simulation where models are created, calibrated, and of effectiveness measures. In SIGIR, 10 pages, 2012. validated before being used to make predictions. We have [17] A. Turpin, F. Scholer, K. Järvelin, M. Wu, and J. S. presented time-biased gain as an example of what we believe Culpepper. Including summaries in system evaluation. In the correct direction is for IR system evaluation. We are not SIGIR’09, pages 508–515. ACM, 2009. the only ones to be working on better metrics or taking a [18] E. M. Voorhees. I come not to bury Cranfield, but to praise it. In HCIR’09, pages 13–16, 2009. simulation based approach [2, 13], and others also consider [19] E. M. Voorhees and D. K. Harman, editors. TREC. MIT time an important part of evaluation [1, 6]. Press, 2005. Our position is that system-oriented IR research is user- [20] E. Yilmaz, M. Shokouhi, N. Craswell, and S. Robertson. oriented IR research given its use of evaluation metrics that Expected browsing utility for web search evaluation. In model users. If HCIR researchers can produce better mod- CIKM, pages 1561–1564, Toronto, 2010. els than exist today — by better, we mean more predictive [21] Y. Zhang, L. A. Park, and A. Moffat. Click-based evidence of human performance — then we can help system develop- for decaying weight distributions in search effectiveness ment to focus on changes that help users better search. metrics. Information Retrieval, 13:46–69, February 2010.