Reusing Historical Interaction Data for Faster
                    Online Learning to Rank for IR (Abstract)

            Katja Hofmann                       Anne Schuth                  Shimon Whiteson                 Maarten de Rijke
          k.hofmann@uva.nl                   a.g.schuth@uva.nl             s.a.whiteson@uva.nl                derijke@uva.nl
                                                         ISLA, University of Amsterdam


ABSTRACT                                                                    2.    METHOD
We summarize the findings from Hofmann et al. [6]. Online learning             We model online learning to rank for IR as a cycle of interac-
to rank for information retrieval (IR) holds promise for allowing the       tions between users and retrieval system. Users submit queries to
development of “self-learning” search engines that can automatically        which the system responds with ranked result lists. The user in-
adjust to their users. With the large amount of e.g., click data that       teracts with the result lists, and these interactions allow the search
can be collected in web search settings, such techniques could enable       engine to update its ranking model to improve performance over
highly scalable ranking optimization. However, feedback obtained            time. We address the problem of learning a ranking function that
from user interactions is noisy, and developing approaches that can         generalizes over queries and documents, and assume that queries
learn from this feedback quickly and reliably is a major challenge.         are independent of each other, and of previously presented results.
In this paper we investigate whether and how previously collected              Learning in this setting is implemented as stochastic gradient
(historical) interaction data can be used to speed up learning in online    descent to learn a weight vector w for a linear combination of rank-
learning to rank for IR. We devise the first two methods that can           ing features. Ranking features X encode the relationship between
utilize historical data (1) to make feedback available during learning      a query and the documents in a document collection (e.g., tf-idf,
more reliable and (2) to preselect candidate ranking functions to           PageRank, etc.). Given a weight vector w, and ranking features
be evaluated in interactions with users of the retrieval system. We         X candidate documents are scored using s = wX. Sorting the
evaluate both approaches on 9 learning to rank data sets and find           documents by these scores results in a result list for the given w.
that historical data can speed up learning, leading to substantially        Our baseline method learns weight vectors using the dueling bandit
and significantly higher online performance. In particular, our pre-        gradient descent (DBGD, [8]) algorithm. This algorithm maintains
selection method proves highly effective at compensating for noise          a current best weight vector, and learns by generating candidate
in user feedback. Our results show that historical data can be used         weight vectors that are compared to the current best. When a can-
to make online learning to rank for IR much more effective than             didate is found to improve over the current best weight vector, the
previously possible, especially when feedback is noisy.                     weights are updated.
                                                                               User feedback is interpreted using interleaved comparison meth-
1.    INTRODUCTION                                                          ods [7]. These methods can infer unbiased relative feedback about
    In recent years, learning to rank methods have become popular           ranker quality from implicit feedback, such as user clicks. In par-
in information retrieval (IR) as a means of tuning retrieval systems.       ticular, they combine the result lists produced by the two rankers
However, most current approaches work offline, meaning that manu-           into one result ranking, which is then shown to the user. Clicks on
ally annotated data needs to be collected beforehand, and that, once        the documents contributed by each ranker can then be interpreted
deployed, the system cannot continue to adjust to user needs, unless        as votes for that ranker. Our baseline interleaved comparison meth-
it is retrained with additional data. An alternative setting is online      ods are Balanced Interleave (BI) and Team Draft (TD) [7]. Our
learning to rank, where the system learns directly from interactions        extensions for reusing historical data are enabled by Probabilistic
with its users. These approaches are typically based on reinforce-          Interleave (PI) [4].
ment learning techniques, meaning that the system tries out new                Based on DBGD and PI, we can now define our two approaches
ranking functions (also called rankers), and learns from feedback           for reusing historical data to speed up online learning to rank.
inferred from users’ interactions with the presented rankings. In              Reliable Historical Comparison (RHC). RHC is based on the
contrast to offline learning to rank approaches, online approaches          intuition that repeating comparisons on historical data should pro-
do not require any initial training material, but rather automatically      vide additional information to complement live comparisons, which
improve rankers while they are being used.                                  can make estimates of relative performance more reliable. This is
    A main challenge that online learning to rank for IR approaches         expected to reduce noise and lead to faster learning. Reusing his-
have to address is to learn as quickly as possible from the limited         torical interaction data for additional comparisons is possible using
quality and quantity of feedback that can be inferred from user             PI, but estimates may be biased. To remove bias, we use importance
interactions. In this paper we address this challenge by proposing the      sampling as proposed in [5]. We combine the resulting historical
first two online learning to rank algorithms that can reuse previously      estimates with the original live estimate using the Graybill-Deal
collected (historical) interaction data to make online learning more        estimator [2]. This combined estimator weights the two estimates
reliable and faster.                                                        by the ratio of their variances.
                                                                               Candidate Pre-Selection (CPS). Our second approach for reusing
                                                                            historical data to speed up online learning to rank for IR uses his-
DIR 2013, April 26, 2013, Delft, The Netherlands.                           torical data to improve candidate generation. Instead of randomly
Copyright remains with the authors and/or original copyright holders.
       0.8
                                                                         slowly as the amount of available feedback increases. RHC, learn-
                                                                         ing is significantly and substantially faster, because complementing
       0.6
                                                                         comparisons with historical data makes feedback for learning more
                                                                         reliable. Finally, CPS is able to compensate for most of the noise in
       0.4                                     BI                        user feedback, leading to significantly faster learning.
                                              TD
       0.2                                     PI                        4.    CONCLUSION
                                             RHC
                                             CPS                            In this paper, we investigated whether and how historical data
         0                                                               can be reused to speed up online learning to rank for IR. We pro-
             0     200      400      600      800     1000
                                                                         posed the first two online learning to rank approaches that can reuse
Figure 1: Offline performance in NDCG (vertical axis, com-               historical interaction data. RHC uses historical interaction data to
puted on held-out test queries after each learning step) on NP-          make feedback inferred from user interactions more reliable. CPS
2003 data set, for the informational click model over 1K queries.        uses this data to preselect candidate rankers so that the quality of
generating a candidate ranker to test in each comparison, it gener-      the rankers compared in live interactions is improved.
ates a pool of candidate rankers, and selects the most promising one        We found that both proposed methods can improve the reliability
using historical data. We hypothesize that historical data can be        of online learning to rank for IR under noisy user feedback. Best
used to identify promising rankers, and that the increased quality of    performance was observed using the CPS method, which can out-
candidate rankers can speed up learning.                                 perform all other methods significantly and substantially under all
                                                                         levels of noise. Performance gains of CPS were particularly high
3.    EXPERIMENTS AND RESULTS                                            when click feedback was noisy. This result demonstrates that CPS
                                                                         is effective in compensating for noise in click feedback.
   Our experiments are designed to investigate whether online learn-        This work is the first to show that historical data can be used
ing to rank for IR can be sped up by using historical data. They are     to significantly and substantially improve online performance in
based on an existing simulation framework, which combines fully          online learning to rank for IR. These methods are expected to make
annotated learning to rank data sets with probabilistic user models      online learning with noisy feedback more reliable and therefore
to simulate user interactions with a search engine that learns online.   more widely applicable.
   We conduct our experiments on the 9 data sets provided as
LETOR 3.0 and 4.0. These data sets implement retrieval tasks             Acknowledgements. This research was partially supported by the Euro-
that range from navigational (e.g., home page finding) to informa-       pean Union’s ICT Policy Support Programme as part of the Competitiveness
tional (e.g., literature search). They range in size from 50 to 1700     and Innovation Framework Programme, CIP ICT-PSP under grant agreement
queries, 45 to 64 features, and up to 1000 judged documents per          nr 250430, the European Community’s Seventh Framework Programme
topic. Starting a data set, we simulate user queries by uniform sam-     (FP7/2007-2013) under grant agreements nr 258191 (PROMISE Network of
pling from the provided queries. After the retrieval system returns a    Excellence) and 288024 (LiMoSINe project), the Netherlands Organisation
ranked result list, user feedback is generated using the Dependent       for Scientific Research (NWO) under project nrs 612.061.814, 612.061.815,
Click Model (DCM) [3], an extension of the Cascade Model [1] that        640.004.802, 727.011.005, 612.001.116, HOR-11-10, the Center for Cre-
has been shown to be effective in explaining users’ click behavior       ation, Content and Technology (CCCT), the Hyperlocal Service Platform
in web search. We instantiate the user model with three levels of        project funded by the Service Innovation & ICT program, the WAHSP and
noise. The perfect click model provides reliable feedback. The nav-      BILAND projects funded by the CLARIN-nl program, the Dutch national
igational and informational model reflect two types of search tasks.     program COMMIT, by the ESF Research Network Program ELIAS, and
Our experiments compare and contrast three baseline runs (BI: bal-       the Elite Network Shifts project funded by the Royal Dutch Academy of
anced interleave, TD: team draft, and PI: probabilistic interleave)      Sciences.
and our proposed methods for reusing historical interaction data,        References
RHC and CPS. Over all data sets, we find that the performance of         [1] N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An
the baseline methods substantially degrades with noise as expected.          experimental comparison of click position-bias models. In
Comparing the performance of these baseline methods to that of               WSDM ’08, pages 87–94, 2008.
RHC and CPS answers our research question of whether reusing             [2] F. Graybill and R. Deal. Combining unbiased estimators.
historical interaction data can compensate for this noise.                   Biometrics, 15(4):543–550, 1959.
   Performance for our method RHC confirms our hypothesis. Un-           [3] F. Guo, C. Liu, and Y. M. Wang. Efficient multiple-click
der perfect user feedback, the method’s performance is equivalent            models in web search. In WSDM ’09, pages 124–131, 2009.
to that of the baseline methods that use live data only. However, its    [4] K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic
relative performance improves with increased noise. Under the infor-         method for inferring preferences from clicks. In CIKM ’11,
mational click model, the method performs significantly better than          pages 249–258, 2011.
the best baseline method on five of the nine data sets. Performance      [5] K. Hofmann, S. Whiteson, and M. de Rijke. Estimating
is still equivalent on two data sets, and decreases on the remaining         interleaved comparison outcomes from historical click data. In
two. Best performance under all click models is achieved by our              CIKM ’12, 2012.
CPS method. While we expected performance improvements under             [6] K. Hofmann, A. Schuth, S. Whiteson, and M. de Rijke.
noisy click feedback, this method achieves significant improvements          Reusing historical interaction data for faster online learning to
over the baseline methods even when click feedback is perfect. We            rank for IR. In WSDM ’13, pages 183–192, 2013.
attribute this improvement to the more exhaustive local exploration      [7] F. Radlinski, M. Kurup, and T. Joachims. How does
enabled by this approach. Performance improvements are highest               clickthrough data reflect retrieval quality? In CIKM ’08, 2008.
under noisy feedback. An example is shown in Figure 1. This graph
                                                                         [8] Y. Yue and T. Joachims. Interactively optimizing information
shows the offline performance in terms of NDCG on the held-out
                                                                             retrieval systems as a dueling bandits problem. In ICML’09,
test folds for the data set NP2003 over the number of iterations
                                                                             pages 1201–1208, 2009.
(queries). We see that the baseline methods BI, TD, and PI learn