<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Unbiased Counterfactual Estimation of Ranking Metrics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Haining Yu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Amazon</institution>
          ,
          <addr-line>Seattle WA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a novel method to estimate metrics for a ranking policy, based on behavioral signal data (e.g. clicks or viewing of video contents) generated by a second diferent policy. Building on [ 1], we prove the counterfactual estimator is unbiased, and discuss its low-variance property. The estimator can be used to evaluate ranking model performance ofline, to validate and selection positional bias models, and to serve as learning objectives when training new models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Learning to rank</kwd>
        <kwd>presentation bias</kwd>
        <kwd>counterfactual inference</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        can be loosely grouped into training and evaluation. For
training, the question is how to properly use knowledge
Ranking algorithms power large scale information re- in positional bias to train a target policy and maximize
trieval systems. They rank web pages when users look relevancy. To start, positional bias models estimate
probfor information in search engines, or products when users ability for a document to be examined by a user in a given
shop on e-retailers’ websites. Such systems process bil- position; the estimation is based on diferent user
behavlions of queries on a daily basis; they also generate large ioral models. Such models, often called “click models”,
amount of logs. The logs capture online user behavior have become widely available; see [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ][
        <xref ref-type="bibr" rid="ref7">7</xref>
        ][
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
(e.g., clicking URLs or viewing video contents) and can be [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ][
        <xref ref-type="bibr" rid="ref10">10</xref>
        ][
        <xref ref-type="bibr" rid="ref11">11</xref>
        ][
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Built on positional bias models, the
semused to improve ranking algorithms. As a result, training inal work of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] established a framework to
new ranking models from logs is a central task in Learn optimize relevancy using noisy behavioral signal data,
To Rank theory and application; it is also often referred proving unbiasedness results for ranking metrics with
to as learning from “implicit feedback” or “counterfactual additive form. For evaluation, the question is how to
learn to rank” in literature (e.g., [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). evaluate the target policy, once trained. For industry
      </p>
      <p>
        Counterfactual learn-to-rank is complicated by the ranking applications, the gold standard for evaluation is
presence of “positional bias”. Ranking algorithms deter- to A/B test target policy against behavioral policy, collect
mines the position of ranked documents. If the search data on both, and compare ranking metrics such as
Averresult page has a “vertical list” layout, the document with age Precision and NDCG. This approach is restricted by
rank of 1 is on top of the page; if the result page has a limited experimentation time. As an alternative, ofline
horizontal layout, the document with rank of 1 is on top evaluation like [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] predicts target policy ranking metrics
left corner. When positional bias is present, a document using data from behavioral policy.
has a higher chance to be examined by user when ranked The research discussed above, in particular [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and
higher. As a result, when user clicks a document, the [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], has advanced our understanding to counterfactual
click (the “behavioral signal”) can be due to one of two learn-to-rank significantly. Meanwhile, each line of
rereasons: either the document is relevant for the given search has its pros and cons. Let us use [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to
query, or the document is on top of the list. When posi- highlight. First of all, the two research focuses on
difertional bias is present, document ranking and relevancy ent subjects in a causal relationship. Borrowing a causal
jointly determine behavioral signals, making the signal a lens where relevancy and positional bias jointly drive
noisy proxy for relevancy, the primary goal of ranking behavioral signals, [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] focuses on relevancy, the “cause”
optimization. while [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] focuses on behavioral signals, the “efect”. It
      </p>
      <p>
        In the context of counterfactual learn-to-rank, we refer is an open question whether the approach in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] can be
to the algorithm generating the log data as the “behav- extended to optimize behavioral signal-based metrics (e.g.
ioral policy”. Data generated by behavioral policy is used clicks). Secondly, the two research also difers in
valito train a hopefully better algorithm, called the “target dation: once developed, models in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] can be validated
policy”. Research work in counterfactual learn-to-rank by comparing ofline evaluation and online
experimentation measurement. For [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], even if we can optimize
Causality in Search and Recommendation (CSR) and Simulation of relevancy, we cannot easily evaluate how much
improveInformation Retrieval Evaluation (Sim4IR) workshops at SIGIR, 2021 ment is made, even with online experimentation. This
" hainiy@©2a02m1Caozpyoringh.tcfoormthis(Hpap.erYbuy)its authors. Use permitted under Creative is because evaluating relevancy (and its improvement)
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) ultimately requires manual annotation; for large-scale
online search engines that process billions of queries
daily, such efort is costly. Last but not least, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] relies
on high-variance Inverse Propensity Scoring techniques
based on the entire ranking permutation (or “action”).
      </p>
      <p>
        In ranking, the action space is large, for example there
are 100!/(100 − 20)! = 1.3 × 1039 ways to select 20
documents out of 100. As a result, action probabilities
are small. The ratio between two small probabilities can
generate extremely small or large ratios (high variance),
making the technique challenging to implement in
practical situations. Rank-based propensity adjustment in
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] using positional efect models has more desirable
variance property. Such diference is key to accuracy of
ofline evaluation.
      </p>
      <p>This paper brings the two lines of research together.</p>
      <p>
        The main contribution is the unbiased estimation for
ranking metrics for behavioral signals. In this sense,
it is part of study on “efect” of ranking dynamics. By
focusing on the “efect”, it can be validated by ofline
evaluation and experimentation. Meanwhile, it retains
the desirable unbiasedness properties [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], but
replaces high variance Inverse Propensity Scoring
adjustments with positional biases, borrowing the key insight
from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Since the focus switches from cause to efect,
this requires new techniques and yields unbiased
estimators of a new kind. This unbiased estimator can serve
as the learning goal for new target policy and enables
ofline/online evaluation. It can be also used to establish
a method to validate and select positional bias models, a
key input to counterfactual estimation framework.
to train target policy  later. The new policy ranks
differently, i.e.,  = [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">3, 1, 2</xref>
        ]. This seems an
improvement. But user never sees the documents in order of
[200, 300, 100] and we don’t know if they will click
differently. In other words,  is missing data. This is an
example of the causal inference, thus the name
counterfactual. Estimating   =  ( ,  ) without
observing  is the central task for this paper.
      </p>
      <p>All quantities defined above are random variables (or
2. Problem Set Up vectors). The dependency among them are as follows:
query  determines the document set , i.e.,  = ().</p>
      <p>Let  be a random query. For , the set of documents to , , and the ranking policy  jointly determine the
rank is  = [1, 2, . . .]. A ranking policy  assigns rank vector  = (, , ). ,  and  then jointly
ranks  = [(1), (2), . . .] for documents in . , determine the behavioral signal vector  = (, , ).
a random permutation of [1, . . . , ‖‖], determines the Last but not least  and  determine  =  (, ).
position of products  on web page. For example, in a The table below visualizes the structural causal model.
“vertical list” layout, the product with  = 1 is on top The randomness in the system comes from multiple
of the page. After presenting  in order of  to user, sources: distribution of , conditional distribution of
canwe observe the behavioral signal . A binary vector, didate set |, conditional distribution of |, , , and
 = [(1), (2), ...], where () = 1 if and only if conditional distribution |, , . The only exception
user engages with any  ∈  (e.g., clicking a web page is that no distribution is needed on  |, . Given 
or watching a video). Given a ranking vector  and the and , the value of  is deterministic for most practical
behavioral signal vector , we define a ranking metric ranking systems and metrics.
of interest  =  (, ) such as Precision and DCG. Because the analysis involves two policies, we always</p>
      <p>
        Table 1 shows a hypothetical example. For  = 1, specify the policy generating the data. That is, we use
 = [100, 200, 300] represents three documents to rank.  to denote ranks generated by policy , use  to
deThe behavioral policy  ranks them as  = [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ], note the behavioral signal generated when showing  in
i.e., to show document 100 first and 300 last. Seeing order of  to user, and   =  ( ,  ) to denote
the list, user ignores the top document 100 and clicks the ranking metrics calculated. This helps distinguish
the other two, i.e.,  = [
        <xref ref-type="bibr" rid="ref1 ref1">0, 1, 1</xref>
        ]. If we use Preci- between random variables under diferent policies. We
sion@3 to measure performance of ranking policy, we omit other dependencies in notations when confusion
get  ( ,  ) = 0.667. Saved in log, the data is used can be avoided.
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. The Unbiased Counterfactual</title>
    </sec>
    <sec id="sec-3">
      <title>Estimator</title>
      <sec id="sec-3-1">
        <title>In this section we define the counterfactual estimator and prove its unbiasedness. We first present the main results in Section 3.1, prove the unbiased results in Section 3.2, and discuss technical details in Section 3.3.</title>
        <sec id="sec-3-1-1">
          <title>3.1. Main Result</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>We first define assumptions necessary for defining the estimators and proving unbiasedness. First</title>
        <p>Assumption 3.1. For a policy , conditional on
the ranking vector  and behavioral signal  ,
  ( ,  ) are conditionally independent of  and ,
i.e.,   ( ,  ) ⊥⊥ , | ,</p>
      </sec>
      <sec id="sec-3-3">
        <title>This is easily satisfied for most ranking metrics such as MRR, MAP, Precision, and NDCG. Next, similar to [1] [9][13], we assume the ranking metric of interest is linearly decomposable, i.e.,</title>
        <p>is an unbiased estimator for ,, , [  ( ,  )],
i.e.,</p>
        <p>⎡
,, , [  ( ,  )] =  ⎣ ‖1 ‖ ∑∈︁  ( ,  ,  )⎦
⎤
(3)
where the expectation on the right hand side is taken over
query set  ,  for every  ∈  , over , , 
over , ,  , and  over , .</p>
        <p>
          Let us use the same example in Table 1 to illustrate
how the estimator is computed. Assume we have the
following positional bias estimates:  (1) = 0.9,  (2) =
0.7,  (3) = 0.5 (as a reminder, such estimates can be
made available via statistical estimation procedures; see
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and references within for implementation). Recall
that ranking vector  = [
          <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
          ] and behavioral
signal  = [
          <xref ref-type="bibr" rid="ref1 ref1">0, 1, 1</xref>
          ]. For the metric of Precision@3,
  = 0.667. For policy  , we observed  but not  .
        </p>
        <p>So we use  ,  , and  and equation (1) to compute
the following estimate:  = (0+1×  ((12)) +1×  ((23)) )/3 =
Assumption 3.2.   ( ,  ) = 0.895. Averaging  s over queries in  yields the
coun∑︀∈ (()) (), where () is a determinis- terfactual estimator (2).
tic function of rank .</p>
        <p>For Precision@3, () = 1 if  ≤ 3, and 0 otherwise.</p>
        <p>
          For  ∈ ,  () is a binary random variable and We now set up a series of unbiasedness results, eventually
[ ()] is the click probability. Similar to [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], leading to proof of Theorem 3.1.
we make the following assumption based on
positionbased click model (PBM)[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]:
Lemma 3.2. Let  and  be two stochastic policies. Under
Assumptions 3.1, 3.2, 3.3, and 3.4,  ( ,  ,  ) is an
        </p>
        <sec id="sec-3-3-1">
          <title>3.2. Proof of Unbiasedness</title>
          <p>unbiased estimator for  |,, [  ], i.e.,
 |,, [ ( ,  )] =  |,, , [ ( ,  ,  )]
(4)</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Proof. Via Assumptions 3.1, 3.2 and 3.3</title>
        <p>Assumption 3.3.  |,[ ()] =  ( ()) (, ),
where  () &gt; 0 is the probability of examining a certain
rank .  (, ) is probability of click conditional on being
examined.</p>
      </sec>
      <sec id="sec-3-5">
        <title>When using data generated by behavioral policy  to train a new policy  , we also assume  and  to share</title>
        <p>Assumption 3.4. Given two policies  and ,  and
 are conditionally independent given  and , i.e.,
 ⊥⊥  |, .</p>
      </sec>
      <sec id="sec-3-6">
        <title>The main result of the paper states that:</title>
        <p>Theorem 3.1. Define
 ( ,  ,  ) =</p>
        <p>∑︁
∈ and  ()=1
( ())
 ( ())
 ( ())</p>
        <p>(1)
Let  be queries randomly sampled from the query
universe where policy  is applied. Under Assumptions 3.1,
3.2, 3.3, and 3.4,
1</p>
        <p>∑︁  ( ,  ,  )
‖ ‖ ∈
nothing in common except inputs  and . For example,  |,, [  ] = ∑︁ ( ()) |,, [ ()]
the output of one policy is not used as input to another:
∈</p>
        <p>(5)</p>
      </sec>
      <sec id="sec-3-7">
        <title>By Assumption 3.3,</title>
        <p>|,, [ ()] =  ( ()) (, )
Defining a shorthand
Ψ =
( ())  ( ())
( ())  ( ())
 |,, [  ]
∑︁ ( ()) ( ()) (, )
, it follows that
=
=
∈
∈
(2)</p>
        <p>∑︁ ( ()) ( ()) (, )Ψ
=
∑︁ ( ()) |,, [ ()]Ψ
∈</p>
        <p>
          Theorem 3.1 holds when both  and  are deterministic
= ∑︁ ( ()) |,, , [ ()]Ψ policies, without Assumption 3.4. The proof is omitted
∈ due to space limit. In practical ranking systems, output of
[︃ ]︃ one ranker is frequently incorporated into another. This
=  |,, , ∑︁ ( ()) ()Ψ violates Assumption 3.4, which requires two policies to
∈ share nothing except inputs.
=  |,, , ⎣⎡ ∑︁ ( ())  (((()))) ⎦⎤ terTphaertuinnbeiqasueadtieosnti(m4)atoofr[l1o]o,kwshdeirfeeretnhtefproomsitiitosncaolubnia-s
 ()=1 appears only once. It is easy to understand the
difer=  |,, , [ ( ,  ,  )] ence with a causal view: the common assumption in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
and the current work is that relevancy and positional
The third step in the derivation is due to Assumption 3.3; bias jointly drive behavioral signals. When it comes to
the fourth step is due to Assumption 3.4. estimation, we are interested in diferent subjects. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
is interested in estimating relevancy (the cause) from
Lemma 3.3. Under Assumptions 3.1, 3.2, 3.4, and 3.3, clicks (the efect). So it has the 1/ factor to cancel out
,, , [ ( ,  )] the positional bias from behavioral policy. The present
work is interested in estimating metrics defined on
be= ,, , , [ ( ,  ,  )] havioral signal (the efect) on target policy, from data
generated by a behavioral policy policy (a second efect).
        </p>
        <p>
          Proof. By Assumption 3.4,  and  are conditionally Two positional bias terms are thus needed to cancel the
independent. As a result,   ( ,  ) is also condi- efect.
tionally independent from  . Therefore The counterfactual estimators (2) aims to avoid the
 |,, [  ( ,  )] high variance challenge facing other IPS estimators, e.g.,
in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. It is a common practice to use IPS estimators to
=  |,, , [  ( ,  )] construct estimates for metrics of interest. While such
Combining this equation with Lemma 3.2 yields estimators enjoy the desirable property of unbiasedness,
their variance profile is of concern. The core of any
 |,, , [  ( ,  )] IPS estimator is the ratio for a ranking  to be selected
=  |,, , [ ( ,  ,  )] bPyrt
w(o|d,ifere)n.tInpoplriacicetsice,thaenrdank,ini.ge.,spParce(is|(co,mb)/iThe expectations on both sides of the above equa- natorially) large and action probabilities are small.
Dition are conditioned on the same joint distribution of viding one small number over another can generate
ex, ,  ,  . Taking expectation over both sides of the tremely small or large ratios. When any policy is
deterequation yields: ministic, Pr (|, ) is ill-defined. The problem gets
worse when the behavioral and target policy difer
signifi,, , , [  ( ,  )] cantly, i.e, when accurate ofline performance evaluation
= ,, , , [ ( ,  ,  )] (6) is needed most. As a result, the ratio can have high
variance; this prevents IPS estimators from being useful
        </p>
        <p>Again using Assumption 3.4, we can remove  from in industry applications. The current approach solves
the expectation on   in left hand side of equation (6). this problem. Counterfactual estimation using equation
This yields (2) no longer needs the high variance action probability
ratio. Instead it uses the ratio between positional bias
,, , [  ( ,  )] estimates ( ), a function of rank positions. The ratio of 
= ,, , , [ ( ,  ,  )] empirically has much less variance than estimated action
probability ratio.</p>
        <p>The current framework can be generalized in three
Theorem 3.1 can now be proved as follows: diferent ways. First, it naturally extends to
contextual ranking problems, where  represents not only
Pr,ooift. Sisincaen u‖n1bia‖s∑e︀d∈estimatoirs osfa mthpeletrmueeamneaonf tahbelesfeoarrcrhanqkueerr.yS,ebcuotnadlslyo,
aitllccaonnbteexgteinnfeorramlizaetidontoaovpatiil-,, , , [ ] which, by Lemma 3.3, equal to mize query/document specific rewards. This makes it
,, , [  ( ,  )]. Thus it is an unbiased esti- easy when diferent documents have diferent economic
value. Assumption 3.2 can be relaxed to  (, ) =
mator of ,, , [  ( ,  )].
∑︀∈ ((), , )() , where (, , ) is a deter- previously defined. In the third step, we use the two
ministic function of rank , query , and document estimates to construct a model validation test. A
simset . Last but not least, the probability of examina- ple approach is to treat the sample mean estimator as
tion  and condition click probability  can depend the “ground truth”, as long as the sample size of data
on query  and candidate document set . That is, is big enough. The diference between two estimators
Assumption 3.3 can be relaxed to  |,[ ()] = can thus be used to quantify the correctness of model.
 ( (), , ) (, , ). Same is true for Assumption A method with more statistical rigor is to treat the two
3.4. estimates as group means of random variables with
estimated standard deviations. Standard hypothesis testing
readily applies.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Validating and Selecting</title>
    </sec>
    <sec id="sec-5">
      <title>Positional Bias Models</title>
      <p>
        The unbiased counterfactual estimator has three
potential uses: to evaluate ofline ranking performance, to
validate and selection positional bias estimates, and to
serve as loss (or reward in reinforcement learning
setting) in training new ranking models. Some have been
covered by literature. See [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for discussion on ofline
ranking performance evaluation and [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for discussion
on training loss improvement. The rest of this section
focuses on validating and selecting positional bias models,
an area not covered in past works. positional bias models
can be developed in many ways, dependent upon theory
(e.g., underlying statistical model, the causal structure,
inclusion and exclusion of predictive features), data, and
estimation processes. When there is one model, the
question is how correct it is. When there are multiple models,
the question is how to select the best one for a specific
use case.
      </p>
      <p>Using the counterfactual estimator, a method can be
developed to validate and selection of positional bias
models. It is based on the following idea: we already
have one unbiased estimator of [  ] using positional
bias estimates as input; they are  s defined in
equations (1) and (2). If we find a second unbiased estimators
for [  ] without using positional bias estimates, the
diference between the two estimators can be used to
evaluate correctness of positional bias models. Two
unbiased estimates of the same quantity (the population
mean) should converge. In fact, if we run policy  on a
set of queries  , [  ] can be directly estimated as
 [︁ ‖1 ‖ ∑︀∈  ( ,  )]︁.</p>
      <p>The model validation process takes three steps: data
collection, estimation, and testing. The first step is to
collect data via an online ranking experiment. The
experiment should have two treatment groups (C and T),
each with a diferent ranking policy. We then observe
behavioral signals (e.g. clicks) for both groups. For
every query in T, we also rank the documents with policy
C in “shadow mode” and log the ranking from C, even
though we don’t know which documents would have
been clicked had policy C been applied. The second step
is to use the data to compute two unbiased estimators</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <sec id="sec-6-1">
        <title>We built a counterfactual estimator for ranking metrics</title>
        <p>defined on behavioral signals. The estimator is unbiased
and has low variance. We discuss its usage in selecting
and validating positional bias models. This estimator can
be applied to ranking models with strong counterfactual
efect.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Swaminathan</surname>
          </string-name>
          , T. Schnabel,
          <article-title>Unbiased learning-to-rank with biased feedback</article-title>
          ,
          <source>in: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>781</fpage>
          -
          <lpage>789</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <article-title>Optimizing search engines using clickthrough data</article-title>
          ,
          <source>in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '02</source>
          ,
          <year>2002</year>
          , p.
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Granka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Hembrooke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Radlinski</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Gay, Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search</article-title>
          ,
          <source>ACM Transactions on Information Systems</source>
          <volume>25</volume>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Zoeter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , B.
          <string-name>
            <surname>Ramsey</surname>
          </string-name>
          ,
          <article-title>An experimental comparison of click position-bias models</article-title>
          ,
          <source>in: Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM '08</source>
          ,
          <year>2008</year>
          , p.
          <fpage>87</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Chapelle</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>A dynamic bayesian network click model for web search ranking</article-title>
          ,
          <source>in: Proceedings of the 18th International Conference on World Wide Web, WWW '09</source>
          ,
          <year>2009</year>
          , p.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chuklin</surname>
          </string-name>
          , I. Markov, M. d. Rijke,
          <article-title>Click models for web search</article-title>
          ,
          <source>Synthesis Lectures on Information Concepts</source>
          ,
          <source>Retrieval, and Services</source>
          <volume>7</volume>
          (
          <year>2015</year>
          )
          <fpage>1</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Borisov</surname>
          </string-name>
          , I. Markov, M. de Rijke,
          <string-name>
            <given-names>P.</given-names>
            <surname>Serdyukov</surname>
          </string-name>
          ,
          <article-title>A neural click model for web search</article-title>
          ,
          <source>in: Proceedings of the 25th International Conference on World Wide Web, WWW '16</source>
          ,
          <year>2016</year>
          , p.
          <fpage>531</fpage>
          -
          <lpage>541</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schnabel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Swaminathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Frazier</surname>
          </string-name>
          , T. Joachims,
          <article-title>Unbiased comparative evaluation of ranking functions</article-title>
          ,
          <year>2016</year>
          . arXiv:
          <volume>1604</volume>
          .
          <fpage>07209</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Abbasi-Yadkori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kveton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Muthukrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vishwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>Ofline evaluation of ranking policies with click models</article-title>
          ,
          <source>in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1685</fpage>
          -
          <lpage>1694</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Golbandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Najork</surname>
          </string-name>
          ,
          <article-title>Position bias estimation for unbiased learning to rank in personal search</article-title>
          ,
          <source>in: Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>610</fpage>
          -
          <lpage>618</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          , I. Zaitsev, T. Joachims,
          <article-title>Consistent position bias estimation without online interventions for learning-to-</article-title>
          <string-name>
            <surname>rank</surname>
          </string-name>
          ,
          <year>2018</year>
          . arXiv:
          <year>1806</year>
          .03555.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Zaitsev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Najork</surname>
          </string-name>
          , T. Joachims,
          <article-title>Estimating position bias without intrusive interventions</article-title>
          ,
          <source>in: WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Takatsu</surname>
          </string-name>
          , I. Zaitsev,
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <article-title>A general framework for counterfactual learning-torank</article-title>
          ,
          <source>in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>5</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>