<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Multileaving for online evaluation of rankers. In Proceed-
ings of the first Internationl Workshop on LEARning Next gEneration
Rankers, Amsterdam, The Netherlands, October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multileaving for online evaluation of rankers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Brian Brost</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Copenhagen Department of Computer Science</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>1</volume>
      <issue>2017</issue>
      <abstract>
        <p>In online learning to rank we are faced with a tradeof between exploring new, potentially superior rankers, and exploiting our preexisting knowledge of what rankers have performed well in the past. Multileaving methods ofer an attractive approach to this problem since they can eficiently use online feedback to simultaneously evaluate a potentially arbitrary number of rankers. In this talk we discuss some of the main challenges in multileaving, and discuss promising areas for future research.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Online evaluation of rankers is the evaluation of rankers in a fully
functioning system based on implicit measurement of real users’
experiences of the system in a natural usage environment [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        By evaluating in a natural usage environment, i.e. based on how
people use the system in their day-to-day lives, we can avoid a
key problem of ofline evaluation methods, that they can only
approximate a real user’s feedback, and we can avoid the possible
distortions that can occur due to the less natural usage
environment present in a lab-based study. Furthermore user behaviour can
be easily logged with no additional efort from the user. This
provides online evaluation methods with inexpensive access to large
amounts of timely training data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. On the other hand, the implicit
measurement of feedback, meaning the logging of clicks and other
user behaviours, is noisy and dificult to interpret. As a result, these
large amounts of training data are necessary to reliably infer quality
diferences between rankers. A click during a web search session
can be a mistake, or even if it is not a mistake, cannot be interpreted
as an absolute signal of quality. Instead, a click on a given document
may only support the relative judgement that this document was
more useful than the other documents inspected by the user.
      </p>
      <p>
        One of the key drawbacks of online evaluation methods is that
the outputs of new, potentially poor, rankers are presented to
actual users. If a new ranker is poor, users will be presented with
∗Also with Spotify Research.
poor results and, in the worst case, might abandon the service [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Conversely, if new rankers are not presented there is a risk of
overlooking better rankers in the pool of rankers. In online learning
the question of determining a proper exploration level is known as
the exploration-exploitation tradeof. The problem of managing this
tradeof for multileaving methods was first addressed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        The gold standard for online evaluation of rankers is A/B testing
of the rankers on separate random subsets of the users or queries
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. A/B testing allows for rankers to be compared on real users,
according to the exact, specific use case that the experimenter wishes
to examine, and according to the exact metric by which the
experimenter measures success. The primary cost associated with A/B
testing is the number of user impressions that are required to
reliably distinguish performance. Since we can measure exactly what
we want with A/B testing, the goal of alternative online evaluation
methods should be to replicate the expected outcomes of A/B tests,
while requiring fewer user impressions than A/B testing.
      </p>
      <p>
        In online evaluation, it is often easier for users to make relative
judgements, rather than absolute judgements. For example, it is
easier for a user to say that document A is more relevant for a
certain query than document B, than to say how relevant each
document is. This intuition partly motivates the introduction of
interleaving as a method to compare rankers. Interleaving methods
have two stages, and compare pairs of rankers by first combining
the ranked lists produced by each ranker into a single ranked list
and displaying this list to the user. They then infer which ranker
is better from implicit feedback, e.g. clicks, collected from the user.
This approach has the benefit that the comparison is carried out
on the same user, eliminating the between user variance which
would afect a comparison between rankers A and B on separate
users. Interleaving methods were found to require 1-2 orders of
magnitude less interaction data than absolute metrics to detect
even small diferences in retrieval quality [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Additionally, it has
been shown that the credit inference stage of interleaving methods
can be tuned so that their outcomes agree well with the relative
outcomes of A/B testing [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Multileaving is a generalisation of interleaving that allows more
than two rankers to be simultaneously compared [
        <xref ref-type="bibr" rid="ref2 ref7 ref9">2, 7, 9</xref>
        ]. In this
case, K &gt; 2 rankers are compared by creating a new ranked
results list that consists of documents selected from the documents
retrieved by the K rankers and then inferring based on the user’s
clicks how good each ranker is. Multileaving has been shown to
use click feedback more eficiently than interleaving [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>Like interleaving, multileaving methods have two distinct stages;
the rfist stage involves sampling the documents to be displayed to
the user, and the second stage assigns credit to the rankers based
on the user’s clicks. The sampling stage is often a straightforward
generalization of those proposed in the interleaving literature, for
example one method is to randomly order the rankers and then, in
turns, sample the top remaining document from each ranker. These
sampling strategies are not uniform, i.e. some documents are much
more likely to be sampled than others. The expected outcome of
the credit assignment stage is afected by the probabilities of the
documents being sampled during the first stage. Specifically, the
ranker quality estimates in multileaving methods are skewed by
artefacts of the sampling process, and this can cause substantial
errors in the accuracies of multileaving estimates of ranker quality.</p>
      <p>Multileaving ofers dramatically improved eficiency over
interleaving, allowing large numbers of rankers to be compared with
very little interaction data, however this comes at the price of the
above described problem which can afect the accuracy of the
comparisons1.</p>
    </sec>
    <sec id="sec-2">
      <title>2 FUTURE CHALLENGES</title>
      <p>This talk identified several challenges and areas for future work in
multileaving.</p>
      <p>
        • An important challenge for multileaving methods is to
maximize agreement with A/B testing. Ideally, the goal should
be to provide theoretical guarantees that the outcome of a
multileaving experiment agrees with that of A/B testing for
a broad class of A/B testing metrics, and under as weak a set
of assumptions as possible.
• There are tradeofs between the eficiency with which we
can learn, and the quality of the displayed list to the user.
For example, if all the rankers being compared agree about
a given document being relevant, it is probably a good idea
from a user experience point of view to display this document
to users, but we learn nothing about the relative quality of
the rankers from clicks on this document. How can this
tradeof between information gain and document quality be
managed in an optimal manner?
• Can document selection be done in a more intelligent manner
than those currently employed by multileaving methods? In
particular is it possible to aggregate the information between
rankers during learning so that the multileaved list can be
expected to be better than that of most of the individual
rankers?
• How can counterfactual methods and multileaving methods
be combined to minimize the deterioration in user experience
during evaluation?
1This introduction is adapted from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Brost</surname>
          </string-name>
          .
          <article-title>Online Evaluation of Rankers Using Multileaving</article-title>
          . University of Copenhagen,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Brost</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Cox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Seldin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Lioma</surname>
          </string-name>
          .
          <article-title>An improved multileaving algorithm for online ranker evaluation</article-title>
          .
          <source>In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <fpage>745</fpage>
          -
          <lpage>748</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Brost</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Seldin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Cox</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Lioma</surname>
          </string-name>
          <article-title>. Multi-dueling bandits and their application to online ranker evaluation</article-title>
          .
          <source>In Proceedings of the 25th International ACM Conference on Information and Knowledge Management</source>
          , pages
          <fpage>2161</fpage>
          -
          <lpage>2166</lpage>
          . ACM,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Chapelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Radlinski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yue</surname>
          </string-name>
          .
          <article-title>Large-scale validation and analysis of interleaved search evaluation</article-title>
          .
          <source>ACM Transactions on Information Systems (TOIS)</source>
          ,
          <volume>30</volume>
          (
          <issue>1</issue>
          ):
          <fpage>6</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Radlinski</surname>
          </string-name>
          , et al.
          <article-title>Online evaluation for information retrieval</article-title>
          .
          <source>Foundations and Trends® in Information Retrieval</source>
          ,
          <volume>10</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>117</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Whiteson</surname>
          </string-name>
          , and M. de Rijke.
          <article-title>Balancing exploration and exploitation in learning to rank online</article-title>
          .
          <source>In Advances in Information Retrieval</source>
          , pages
          <fpage>251</fpage>
          -
          <lpage>263</lpage>
          . Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Schuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.-J.</given-names>
            <surname>Bruintjes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Buüttner</surname>
          </string-name>
          , J. van Doorn,
          <string-name>
            <given-names>C.</given-names>
            <surname>Groenland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Oosterhuis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-N.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Veeling</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. van der Velde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wechsler</surname>
          </string-name>
          , et al.
          <article-title>Probabilistic multileave for online retrieval evaluation</article-title>
          .
          <source>In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <fpage>955</fpage>
          -
          <lpage>958</lpage>
          . ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Schuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Radlinski</surname>
          </string-name>
          .
          <article-title>Predicting search satisfaction metrics with interleaved comparisons</article-title>
          .
          <source>In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <fpage>463</fpage>
          -
          <lpage>472</lpage>
          . ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Schuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sietsma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Whiteson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lefortier</surname>
          </string-name>
          , and M. de Rijke.
          <article-title>Multileaved comparisons for fast online evaluation</article-title>
          .
          <source>In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management</source>
          , pages
          <fpage>71</fpage>
          -
          <lpage>80</lpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>