Reusing Historical Interaction Data for Faster Online Learning to Rank for IR (Abstract) Katja Hofmann Anne Schuth Shimon Whiteson Maarten de Rijke k.hofmann@uva.nl a.g.schuth@uva.nl s.a.whiteson@uva.nl derijke@uva.nl ISLA, University of Amsterdam ABSTRACT 2. METHOD We summarize the findings from Hofmann et al. [6]. Online learning We model online learning to rank for IR as a cycle of interac- to rank for information retrieval (IR) holds promise for allowing the tions between users and retrieval system. Users submit queries to development of “self-learning” search engines that can automatically which the system responds with ranked result lists. The user in- adjust to their users. With the large amount of e.g., click data that teracts with the result lists, and these interactions allow the search can be collected in web search settings, such techniques could enable engine to update its ranking model to improve performance over highly scalable ranking optimization. However, feedback obtained time. We address the problem of learning a ranking function that from user interactions is noisy, and developing approaches that can generalizes over queries and documents, and assume that queries learn from this feedback quickly and reliably is a major challenge. are independent of each other, and of previously presented results. In this paper we investigate whether and how previously collected Learning in this setting is implemented as stochastic gradient (historical) interaction data can be used to speed up learning in online descent to learn a weight vector w for a linear combination of rank- learning to rank for IR. We devise the first two methods that can ing features. Ranking features X encode the relationship between utilize historical data (1) to make feedback available during learning a query and the documents in a document collection (e.g., tf-idf, more reliable and (2) to preselect candidate ranking functions to PageRank, etc.). Given a weight vector w, and ranking features be evaluated in interactions with users of the retrieval system. We X candidate documents are scored using s = wX. Sorting the evaluate both approaches on 9 learning to rank data sets and find documents by these scores results in a result list for the given w. that historical data can speed up learning, leading to substantially Our baseline method learns weight vectors using the dueling bandit and significantly higher online performance. In particular, our pre- gradient descent (DBGD, [8]) algorithm. This algorithm maintains selection method proves highly effective at compensating for noise a current best weight vector, and learns by generating candidate in user feedback. Our results show that historical data can be used weight vectors that are compared to the current best. When a can- to make online learning to rank for IR much more effective than didate is found to improve over the current best weight vector, the previously possible, especially when feedback is noisy. weights are updated. User feedback is interpreted using interleaved comparison meth- 1. INTRODUCTION ods [7]. These methods can infer unbiased relative feedback about In recent years, learning to rank methods have become popular ranker quality from implicit feedback, such as user clicks. In par- in information retrieval (IR) as a means of tuning retrieval systems. ticular, they combine the result lists produced by the two rankers However, most current approaches work offline, meaning that manu- into one result ranking, which is then shown to the user. Clicks on ally annotated data needs to be collected beforehand, and that, once the documents contributed by each ranker can then be interpreted deployed, the system cannot continue to adjust to user needs, unless as votes for that ranker. Our baseline interleaved comparison meth- it is retrained with additional data. An alternative setting is online ods are Balanced Interleave (BI) and Team Draft (TD) [7]. Our learning to rank, where the system learns directly from interactions extensions for reusing historical data are enabled by Probabilistic with its users. These approaches are typically based on reinforce- Interleave (PI) [4]. ment learning techniques, meaning that the system tries out new Based on DBGD and PI, we can now define our two approaches ranking functions (also called rankers), and learns from feedback for reusing historical data to speed up online learning to rank. inferred from users’ interactions with the presented rankings. In Reliable Historical Comparison (RHC). RHC is based on the contrast to offline learning to rank approaches, online approaches intuition that repeating comparisons on historical data should pro- do not require any initial training material, but rather automatically vide additional information to complement live comparisons, which improve rankers while they are being used. can make estimates of relative performance more reliable. This is A main challenge that online learning to rank for IR approaches expected to reduce noise and lead to faster learning. Reusing his- have to address is to learn as quickly as possible from the limited torical interaction data for additional comparisons is possible using quality and quantity of feedback that can be inferred from user PI, but estimates may be biased. To remove bias, we use importance interactions. In this paper we address this challenge by proposing the sampling as proposed in [5]. We combine the resulting historical first two online learning to rank algorithms that can reuse previously estimates with the original live estimate using the Graybill-Deal collected (historical) interaction data to make online learning more estimator [2]. This combined estimator weights the two estimates reliable and faster. by the ratio of their variances. Candidate Pre-Selection (CPS). Our second approach for reusing historical data to speed up online learning to rank for IR uses his- DIR 2013, April 26, 2013, Delft, The Netherlands. torical data to improve candidate generation. Instead of randomly Copyright remains with the authors and/or original copyright holders. 0.8 slowly as the amount of available feedback increases. RHC, learn- ing is significantly and substantially faster, because complementing 0.6 comparisons with historical data makes feedback for learning more reliable. Finally, CPS is able to compensate for most of the noise in 0.4 BI user feedback, leading to significantly faster learning. TD 0.2 PI 4. CONCLUSION RHC CPS In this paper, we investigated whether and how historical data 0 can be reused to speed up online learning to rank for IR. We pro- 0 200 400 600 800 1000 posed the first two online learning to rank approaches that can reuse Figure 1: Offline performance in NDCG (vertical axis, com- historical interaction data. RHC uses historical interaction data to puted on held-out test queries after each learning step) on NP- make feedback inferred from user interactions more reliable. CPS 2003 data set, for the informational click model over 1K queries. uses this data to preselect candidate rankers so that the quality of generating a candidate ranker to test in each comparison, it gener- the rankers compared in live interactions is improved. ates a pool of candidate rankers, and selects the most promising one We found that both proposed methods can improve the reliability using historical data. We hypothesize that historical data can be of online learning to rank for IR under noisy user feedback. Best used to identify promising rankers, and that the increased quality of performance was observed using the CPS method, which can out- candidate rankers can speed up learning. perform all other methods significantly and substantially under all levels of noise. Performance gains of CPS were particularly high 3. EXPERIMENTS AND RESULTS when click feedback was noisy. This result demonstrates that CPS is effective in compensating for noise in click feedback. Our experiments are designed to investigate whether online learn- This work is the first to show that historical data can be used ing to rank for IR can be sped up by using historical data. They are to significantly and substantially improve online performance in based on an existing simulation framework, which combines fully online learning to rank for IR. These methods are expected to make annotated learning to rank data sets with probabilistic user models online learning with noisy feedback more reliable and therefore to simulate user interactions with a search engine that learns online. more widely applicable. We conduct our experiments on the 9 data sets provided as LETOR 3.0 and 4.0. These data sets implement retrieval tasks Acknowledgements. This research was partially supported by the Euro- that range from navigational (e.g., home page finding) to informa- pean Union’s ICT Policy Support Programme as part of the Competitiveness tional (e.g., literature search). They range in size from 50 to 1700 and Innovation Framework Programme, CIP ICT-PSP under grant agreement queries, 45 to 64 features, and up to 1000 judged documents per nr 250430, the European Community’s Seventh Framework Programme topic. Starting a data set, we simulate user queries by uniform sam- (FP7/2007-2013) under grant agreements nr 258191 (PROMISE Network of pling from the provided queries. After the retrieval system returns a Excellence) and 288024 (LiMoSINe project), the Netherlands Organisation ranked result list, user feedback is generated using the Dependent for Scientific Research (NWO) under project nrs 612.061.814, 612.061.815, Click Model (DCM) [3], an extension of the Cascade Model [1] that 640.004.802, 727.011.005, 612.001.116, HOR-11-10, the Center for Cre- has been shown to be effective in explaining users’ click behavior ation, Content and Technology (CCCT), the Hyperlocal Service Platform in web search. We instantiate the user model with three levels of project funded by the Service Innovation & ICT program, the WAHSP and noise. The perfect click model provides reliable feedback. The nav- BILAND projects funded by the CLARIN-nl program, the Dutch national igational and informational model reflect two types of search tasks. program COMMIT, by the ESF Research Network Program ELIAS, and Our experiments compare and contrast three baseline runs (BI: bal- the Elite Network Shifts project funded by the Royal Dutch Academy of anced interleave, TD: team draft, and PI: probabilistic interleave) Sciences. and our proposed methods for reusing historical interaction data, References RHC and CPS. Over all data sets, we find that the performance of [1] N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An the baseline methods substantially degrades with noise as expected. experimental comparison of click position-bias models. In Comparing the performance of these baseline methods to that of WSDM ’08, pages 87–94, 2008. RHC and CPS answers our research question of whether reusing [2] F. Graybill and R. Deal. Combining unbiased estimators. historical interaction data can compensate for this noise. Biometrics, 15(4):543–550, 1959. Performance for our method RHC confirms our hypothesis. Un- [3] F. Guo, C. Liu, and Y. M. Wang. Efficient multiple-click der perfect user feedback, the method’s performance is equivalent models in web search. In WSDM ’09, pages 124–131, 2009. to that of the baseline methods that use live data only. However, its [4] K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic relative performance improves with increased noise. Under the infor- method for inferring preferences from clicks. In CIKM ’11, mational click model, the method performs significantly better than pages 249–258, 2011. the best baseline method on five of the nine data sets. Performance [5] K. Hofmann, S. Whiteson, and M. de Rijke. Estimating is still equivalent on two data sets, and decreases on the remaining interleaved comparison outcomes from historical click data. In two. Best performance under all click models is achieved by our CIKM ’12, 2012. CPS method. While we expected performance improvements under [6] K. Hofmann, A. Schuth, S. Whiteson, and M. de Rijke. noisy click feedback, this method achieves significant improvements Reusing historical interaction data for faster online learning to over the baseline methods even when click feedback is perfect. We rank for IR. In WSDM ’13, pages 183–192, 2013. attribute this improvement to the more exhaustive local exploration [7] F. Radlinski, M. Kurup, and T. Joachims. How does enabled by this approach. Performance improvements are highest clickthrough data reflect retrieval quality? In CIKM ’08, 2008. under noisy feedback. An example is shown in Figure 1. This graph [8] Y. Yue and T. Joachims. Interactively optimizing information shows the offline performance in terms of NDCG on the held-out retrieval systems as a dueling bandits problem. In ICML’09, test folds for the data set NP2003 over the number of iterations pages 1201–1208, 2009. (queries). We see that the baseline methods BI, TD, and PI learn