<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Contextual Position Bias Estimation Using a Single Stochastic Logging Policy</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giuseppe Di Benedetto</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Buchholz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ben London</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matej Jakimov</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yannik Stein</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Malte Lichtenberg</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vito Bellini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Rufini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thorsten Joachims</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cornell University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Addressing the position bias is of pivotal importance for performing unbiased of-policy training and evaluation in Learning To Rank (LTR). This requires accurate estimates of the probabilities of the users examining the slots where items are displayed, which in many applications is likely to depend on multiple factors, e.g. the screen size. This leads to a position-bias curve that is no longer constant, but depends on the context. Existing position-bias estimators are either non-contextual or require multiple deployed ranking policies. We propose a novel contextual position-bias estimator that only requires propensities logged from a single stochastic logging policy. Empirical evaluations assess the accuracy of the model in recovering the position-bias curves as well as the impact on of-policy evaluation, showing how a contextual position-bias estimator can deliver better reward estimates which are more robust to non-stationarity compared to a non-contextual one.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Position-based model</kwd>
        <kwd>contextual position bias</kwd>
        <kwd>of-policy evaluation</kwd>
        <kwd>non-stationarity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>based model, which models the click as the realisation
of two independent events: examination of the position
Recommender systems have large catalogs from which and relevance of the item (see section 2.1 for more
deto source content to users, and users are usually served tails). To apply this click model to of-policy training and
with a list of items from which they can choose which evaluation, one must estimate the vector of examination
items to consume. Optimizing the ranking of presented probabilities for each displayed position, also called the
items heavily impacts the success of recommendation, position-bias curve. The first methods that appeared in the
since users typically only interact with items at the top literature provided estimators for a single position-bias
of a ranking. Industrial systems can leverage vast quanti- curve to be used for every query [10, 11, 12]. However,
ties of past user interactions, which can be used to train in many applications, the examination probabilities are
new ranking policies and evaluate them ofline, before influenced by many factors: the size and shape of the
deploying them. Most of the time, these logged inter- user’s screen; the time of day or day of week; the
willactions only provide implicit feedback that is subject ingness of a user to explore the recommended options;
to diferent sources of biases [ 1], which need to be ad- the type of subscription to a paid service, which could
dressed both in training and evaluation. For instance, limit the number of arbitrary interactions (e.g.
numWhen considering clicks—which arguably constitute the ber of on-demand streams in a streaming media service)
most abundant signal in recommender systems—one can- and hence push the user to explore the available options
not directly interpret a non-click as the user not being more carefully. One strategy to tackle this dependency
interested in the recommended item. In fact, when users consists of partitioning the data and estimating a
sepaare presented a list of items to interact with, they can rate position bias curve for each combination of factors.
only click on items that the production policy decided Unfortunately, this solution would not scale, since (i)
to present to the user (i.e., selection bias), and they are the number of combinations grows exponentially with
more likely to examine top positions than bottom ones the amount of contextual information, and (ii) for some
(i.e., position bias) [2]. These biases can be addressed combinations there might be not enough data for a
sufiusing click models that describe how the user interacts ciently accurate estimate. On the other hand, contextual
with the recommended items [3, 4, 5]. By incorporating information can be encoded as features in a parametric
these modelling assumptions, we can perform unbiased model, and recent works [11, 13] have proposed such
conof-policy training [6, 7, 8] and evaluation [9]. textual position-bias estimators to provide examination</p>
      <p>One of the most popular click models is the position- probabilities at a query level. However, existing models
Workshop on Learning and Evaluating Recommendations with Impres- present some limitations, as they either require multiple
sions (LERI) @ RecSys 2023, September 18-22 2023, Singapore deployed rankers, or they require accurately estimating
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License the items’ relevances, which is arguably as dificult as
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>the ranking problem itself.</p>
      <p>In this work, we extend the contextual estimator
from [14], requiring only a single stochastic policy to
be deployed, and for which propensities are known. The
contributions of the paper can be summarised as follows:
large. Small inverse propensities cause large variance
in reward estimates. Hence, assumptions on the users’
click behaviour are usually introduced, so as to motivate
lower variance estimators. Among the most popular click
models, the Position-Based Model (PBM) [21, 2, 19]
assumes that clicks on the ranked items are independent,
and only characterized by the relevance of the item and
the probability of the user examining the position where
the item was displayed. Specifically, given a context ,
the probability of a click on an item  in position  is
• We propose Policy-Aware Contextual
Intervention Harvesting (PA-C-IH), a contextual
positionbias estimator, which only requires propensities
logged from a single stochastic policy.
• We empirically confirm that the position-bias
curve can be accurately recovered when there P( = 1 | , , ) = P( = 1 | , ) rel(, )
is dependence on contextual information.
• We explore the impact of contextual position- where  denotes the examination random variable, and
bias estimation in of-policy evaluation, when rel(, ) is the relevance of the item  given the context
using reward estimators relying on the PBM as-  (i.e. the probability of clicking on that item conditional
sumption. In particular, we show that contextual on having observed it). The object of interest is the
exposition-bias estimation can provide of-policy amination probability () = P( = 1 | , ), and for
evaluations that are more accurate and more ro- many position-bias estimators, the problem is simplified
bust to non-stationarity in the context distribu- by assuming there is no dependence of the examination
tion compared to non-contextual estimation. probabilities on the context, reducing the problem to
estimating a vector of —the number of visible slots—
probabilities  = (1, . . . ,  ). Contextual position-bias
estimation instead focuses on the general case, with the
goal of estimating a position-bias curve () for each
query defined by a context vector  ∈  .</p>
      <sec id="sec-2-1">
        <title>The process of selecting the best ranking policy to be</title>
        <p>deployed can be costly and time consuming. Running
A/B tests to compare multiple models can negatively
affect the user experience, as well as requiring operational 3. Related work
efort and time to gather enough data. In addition, A/B
testing does not scale when there are many policies to Position-bias estimation plays a central role in developing
be compared; for example, when considering a large set ranking policies for recommendation and information
of hyper-parameter configurations for a neural network- retrieval, as it provides the weights used to de-bias losses
based policy. Of-policy evaluation greatly simplifies this in of-policy training and rewards in of-policy
evaluaprocess, allowing comparison of multiple policies using tion. Diferent estimators have been proposed over the
data logged by a previously deployed policy, without the years, starting from the simplest approach proposed by
risk of impacting the user experience. However, obtain- Joachims et al. [10], which requires items to be randomly
ing accurate of-policy evaluation requires methods to swapped in order to estimate the examination
probabilde-bias the estimated rewards. Many estimators have ities. Following the PBM assumption, when uniformly
been developed over the past decades [15, 16, 17, 18]. swapping items in two positions,  and ′, the
diferFor ranking, these estimators often rely on assumptions ence in the CTR logged at those position is due to the
about users’ click behaviour [19, 9, 7]. diference in the expected examination of the positions;
hence, we have /′ = CTR/CTR′ . Pivoting on
2.1. The Position-Based Model a specific position, e.g. the first position, it is possible
to consistently estimate the position-bias curve, up to a
Many of-policy training and evaluation techniques are multiplicative constant, by the CTR ratios using random
based on Inverse Propensity Scoring (IPS) [20], an impor- swaps. These interventions can however be harmful to
tance weighting technique used to counteract biases in the user’s experience, as displayed items deviate from the
the data. In IPS estimators, rewards are re-weighted optimized policy, pushing non-relevant items in higher
by the inverse of their probabilities of occurring in the positions. Agarwal et al. [11] alleviated this problem
logged data (i.e., propensities). Without any assumptions by introducing a way to fetch those interventions from
on users’ click behaviour, each of these propensities is the multiple diferent policies deployed online. However,
probability that the logging policy produced the entire the deployment and maintenance of multiple policies
ranking; and due to the combinatorial nature of rankings, can be cumbersome. Thus Rufini et al. [12] extended
this probability could tend to zero, even if the number of the approach by requiring a single stochastic policy in
items to rank and the number of available slots are not production. All of the aforementioned works estimate a
single, non-contextual position bias curve, whereas we
study contextual position bias estimation. The closest
work to ours is by Fang et al. [14], which extends the
intervention harvesting approach of Agarwal et al. [11]
to contextual position-bias estimation. The downside of
this approach is again the requirement of having
multiple diferent policies deployed, which is mitigated by the
method proposed in this paper, where we instead use a
single stochastic policy with known propensities.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Another stream of research worth mentioning focuses</title>
        <p>on regression-based estimation.</p>
        <p>Wang et al. [8]
propose estimators that use Expectation-Maximization (EM),
and in [22, 13] this method was extended for contextual
position-bias estimation. The regression approach has
the advantage of not needing randomized data, nor
interventions, but at the cost of requiring accurate relevance
estimates for the ranked items. The latter requirement is
very challenging in practice, and is arguably as hard as
solving the ranking problem itself.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Contextual position-bias estimator</title>
      <p>Like [14], our proposed method does not require explicit
interventions, but rather harvests them from already
deployed policies. The estimator in [14] requires multiple
diferent policies; each query is served by one of the
policies with a pre-defined probability. Here we propose an
estimator that instead uses the propensities of a single
the average relevance of the intervention set ,′ ().</p>
      <p>The latter two quantities are hence modelled by two
neural networks ℎ(, ) and (, ′, ) respectively. It
is worth noting that (, ′, ) aims at estimating the
average relevance of the items that can be appear in
positions  and ′ under the context , rather than trying
to regress on the relevance of each item. The two neural
networks can be optimized by minimizing the loss
ℒ(ℎ, , ) = ∑︁ ∑︁ ^ℓ,′ () log ℎ(, ℓ)(, ′, ℓ))︁</p>
      <p>︁(
ℓ∈ ̸=′</p>
      <p>ℓ
+ ¬^,′ () log 1
︁(</p>
      <p>− ℎ(, ℓ)(, ′, ℓ))︁ .</p>
      <p>The contextual position-bias estimator PA-C-IH is thus
ˆ() = ℎ* (, ) = arg max ℒ(ℎ, , ). Following
analogous steps of Proposition 1 in [14], it can be proven that
the loss ℒ is equivalent to a weighted cross-entropy loss:</p>
      <p>︁(
^,′ (, ) log ℎ(, )(, ′, )
∑︁ ∑︁ ^ ,′ ()</p>
      <p>[︁
∈ ̸=′</p>
      <p>︁(
+ ¬^,′ (, ) log 1
− ℎ(, )(, ′, )
︁)
︁)
where ^ ,′ () := ∑︁ 1{ℓ=}1{(ℓ,ℓ)∈,′ }</p>
      <p>^,′ (, ) =
¬^,′ (, ) =
ℓ∈
∑︀ℓ∈ 1{ℓ=}^
ℓ
,′ ()
,′ ()
∑︀ℓ∈ 1{ℓ=}¬^,′ ()</p>
      <p>ℓ
,′ ()
stochastic logging policy  0. For each position pair , ′, for which E[ˆ,′ (, )]
the intervention sets are defined as
,′ = {(, ) :  0(, |) 0(′, |) &gt; 0
}
and the logging policy  0 is required to satisfy ,′ ̸= ∅
for all position pairs  ̸= ′. This assumption boils
down to requiring that for every context  and every
pair of positions , ′, there exists at least one action
that can be displayed in both positions by the logging
policy. This difers from [ 14] where the intervention
sets consisted of items that could have been placed in
both positions under the multiple logging policies. As
in the case of explicit interventions, the CTRs in the
intervention sets can be used to estimate position-bias,
with the caveat that in this case the position-bias depends
on the context. For each observation in the set of the
propensity-weighted click labels as follows:
 click logs  = {(ℓ, ℓ, ℓ, ℓ)}ℓ=1: we can define
ℓ
^,′ () := 1{(ℓ,ℓ)∈,′ }1{ℓ=}  0(ℓ, ℓ|ℓ)
ℓ
¬^,′ () := 1{(ℓ,ℓ)∈,′ }1{ℓ=}  0(ℓ, ℓ|ℓ) .</p>
      <p>ℓ
1
− ℓ</p>
      <sec id="sec-3-1">
        <title>Conditioned on the context , in expectation ˆ</title>
        <p>ℓ
,′ () is
=</p>
        <p>ℎ(, )(, ′, ) and
ℎ(, )(, ′, ) hold.
AnalE[¬ˆ,′ (, )]</p>
        <p>= 1 −
ogous to [14], in our experiments, both neural networks
ℎ(, ) and (, ′, ) have one hidden layer with
sigmoid activation function in order to force the output
to be in the unit interval. The average relevance
network (, ′, ) has an additional hidden layer to
ensure that the output is a symmetric matrix; namely,
denotes the output of the first layer of the network.
(, ′, ) = 12 (1(, ′, ) + 1(, ′, )), where 1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Experiments</title>
      <p>In this section, we empirically compare our
contextual estimator PA-C-IH estimator against its
noncontextual counterpart, PA-IH [12]. We use synthetic
data consisting of 200K queries, with 5 items to be
ranked, of which two are relevant. Each query is
described by a context vector 
∈</p>
      <p>R
5 sampled from
with cluster means  1
a mixture of three Gaussian distributions  (  , 0.1)
= (0, 1, − 1, 0, 0.5),  2
=
(1, 0.2, − 0.2, 0.2, 1),  3 = (0.2, 0, 1, 0.3, − 0.4), and
mixture weights (1, 2, 3) = (0.3, 0.3, 0.4).
Following the experimental setup in [14, 13], the examination
proportional to the examination probability () times
probabilities for position , given context , are defined
1
as P( = 1 | , ) = max(0,⟨,⟩+1) . The parameter
 ∈ R5 determines the dependence of the examination
probability on the context. Its entries are sampled from
Uniform(− 0.5, 0.5), and are then fixed for all queries.
The logging policy is a deterministic policy selecting the
same ranking for all queries, and perturbed by random
swaps such that each item maintains its original rank
with probability 0.55, or with probability 0.45 is swapped
uniformly at random with one of the other items. Clicks
are generated according to the contextual PBM.
5.1. Position-bias curve estimation
RelError(ˆ) =</p>
      <p>⃒ ˆ() ⃒⃒⃒ ,
1 ∑︁ 1 ∑︁ ⃒⃒ 1 − () ⃒
 =1  =1 ⃒
where  is the number of queries,  is the number of
slots in the displayed ranking, and  () and ˆ ()
are the true and estimated examination probabilities for
position  in request  with context , respectively. Since
position-bias estimates are used in of-policy training
and evaluation as inverse propensity scores, this metric
can better quantify—as it uses ratios instead of absolute
values—how accuracy in position-bias estimation would
afect accuracy of of-policy evaluation. Table 1 shows
the relative error of PA-C-IH and PA-IH on the synthetic
data, showing that the contextual position-bias estimator
can lead to significantly improved accuracy.</p>
      <p>RelError
95% CI</p>
      <p>PA-C-IH
0.0556
[0.0556, 0.0557]</p>
      <p>PA-IH
0.3434
[0.3427, 0.3443]</p>
      <sec id="sec-4-1">
        <title>In order to estimate the position-bias curve, we first tune</title>
        <p>the hyper-parameters of the two estimators: the
optimization parameters for PA-IH and PA-C-IH, and the number
of hidden layers for the two neural networks in PA-C-IH.</p>
        <p>Figure 1 qualitatively shows that the contextual
positionbias estimator is able to recover the position-bias curve in
each cluster by using the context information, while the
non-contextual estimator only fits a position-bias curve
that averages across the clusters’ position-bias curves.</p>
        <p>To quantify the accuracy of the position-bias estimates, 5.2. Of-policy evaluation
we compute the relative error,</p>
        <p>Among the of-policy estimators developed in the
literature (see [9] for a comprehensive overview), an unbiased
reward estimator that leverages the PBM assumption in
of-policy evaluation is given by
ˆ ( ) =</p>
        <p>1 ∑︁ ∑︁ () ⟨(),  (· , |)⟩ , (1)
 =1 =1 ⟨(),  0(· , |)⟩
where  and  0 are the target and logging policies,
respectively; () is the reward logged for item ; () is
the position-bias curve for the request with context ,
and  (· , |) denotes a vector of propensities for
ranking action  at each of the  positions, given context .</p>
        <p>We use this reward estimator below to compare diferent
position-bias estimators for of-policy evaluation.</p>
        <sec id="sec-4-1-1">
          <title>5.2.1. Stationary environment</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>In the first experiment, of-policy evaluation was run on</title>
        <p>the same data source used for position-bias estimation.
This setting is realistic under the assumption that the
environment does not change over time. Under
stationarity, one can expect that a position-bias curve estimated
on past data will still be valid when used in the future
for of-policy training and evaluation. The target
policy to be evaluated here is a deterministic policy that
selects among three diferent rankings, serving the same
ranking for all queries within the same cluster. Figure
2 shows the reward estimated on the diferent clusters,
and on the full data set. The PA-C-IH estimator provides
much more accurate reward estimates for each of the
clusters, as well as the overall data set, compared to PA-IH,
which sufers from the bias introduced by using a single,
non-contextual position-bias curve when examination
probabilities are in fact contextual.
data generation step. It is worth recalling that this target
policy always selects the same ranking regardless of
context. Figure 3 shows the error in the of-policy estimation
on the test data, using PA-IH and PA-C-IH position-bias
curves estimated on the training data with a diferent
cluster distribution. PA-C-IH proves to be robust to such
distribution shifts, providing more accurate position-bias
estimates, which translate into more accurate of-policy
reward estimates, both within clusters and on the full
data set. PA-IH, on the other hand, estimates an
overall average position-bias curve, which does not reflect
the actual average position-bias curve due to the context
distribution shift between the two data sets.</p>
        <p>In this experiment the rankings selected by logging
and target policies do not depend on the context. Yet
even in this very simple scenario, if the position-bias is
contextual, a shift in the context distribution can cause
systematic bias in of-policy evaluation when using a
non-contextual position-bias estimator.</p>
        <sec id="sec-4-2-1">
          <title>5.2.2. Non-stationary environment</title>
          <p>A less restrictive, and more realistic, setting is where the
distribution of queries shifts over time. Position-bias
estimation requires data to be collected from a randomized
policy, without interventions that can afect the accuracy
of the logged propensities (e.g. promotion rules that alter
the ranking produced by the policy, thereby invalidating
the logged propensities). Such requirements could be
dificult to fulfill in real-world applications, thus
preventing us from collecting a constant stream of randomized Figure 3: Of-policy evaluation on the synthetic data under
data to update position-bias estimates. In addition to non-stationarity using position-bias estimates from PA-C-IH
that, it is reasonable to assume that shifts in the context and PA-IH. Bias of the estimated rewards with 95% CI are
distribution can occur over time, for instance the change reported for each cluster and for the full data.
in the distribution of the device used, or of the user
adhering to diferent subscription plans, or more generally
the non-stationarity induced by the launch of a new user
interface. It is therefore interesting to analyze how ro- 6. Conclusion
bust position-bias estimators are under non-stationarity
when used for of-policy evaluation. In the synthetic ex- We have proposed a new contextual position-bias
estiperiment presented, we induce non-stationarity by using mator, PA-C-IH, which does not require multiple rankers
a second data set, generated using the same procedure as to be deployed, but rather a single stochastic ranker for
the data used for position-bias estimation, but with dif- which propensities are known. The latter is commonly
ferent cluster proportions. While in the training data the adopted in recommender systems in order to ensure a
cluster weights are (0.3, 0.3, 0.4), in the test data they certain level of exploration [23, 24, 25, 12], and our
estiare set to (0.15, 0.1, 0.75). In order to isolate the efect mator exploits the randomness of the logging policy to
of non-stationarity in the context distribution, we eval- provide a contextual estimate of the position-bias curve.
uate a simpler policy than the one used in the previous We have empirically shown that the PA-C-IH estimator
experiment. Here, the target policy is the determinis- provides better position-bias estimates (compared to a
tic version of the data generation policy—namely, the non-contextual estimator) when there is dependence on
logging policy without the random swaps used in the contextual information, and we explored the impact this
can have on of-policy evaluation. We further demon- ference on Knowledge Discovery Data Mining,
strated how PA-C-IH can yield more robust of-policy KDD ’18, Association for Computing Machinery,
estimates in the presence of non-stationary distributions. New York, NY, USA, 2018, p. 1685–1694. URL: https:</p>
          <p>As part of future work, there are several directions //doi.org/10.1145/3219819.3220028. doi:10.1145/
that can be investigated: (i) extend the evaluation of our 3219819.3220028.
methods to real-world data; (ii) assess the impact of our [10] T. Joachims, A. Swaminathan, T. Schnabel,
Unestimator in of-policy training of LTR algorithms [ 26, 6], biased learning-to-rank with biased feedback, in:
(iii) generalize our approach to incorporate other types Proceedings of the tenth ACM international
conof click noises, such as trust bias. ference on web search and data mining, 2017, pp.
781–789.
[11] A. Agarwal, I. Zaitsev, X. Wang, C. Li, M. Najork,
References T. Joachims, Estimating position bias without
intrusive interventions, in: Proceedings of the Twelfth
[1] T. Joachims, L. Granka, B. Pan, H. Hembrooke, ACM International Conference on Web Search and
G. Gay, Accurately interpreting clickthrough data Data Mining, 2019, pp. 474–482.
as implicit feedback, in: Acm Sigir Forum, vol- [12] M. Rufini, V. Bellini, A. Buchholz, G. Di Benedetto,
ume 51, Acm New York, NY, USA, 2017, pp. 4–11. Y. Stein, Modeling position bias ranking for
stream[2] N. Craswell, O. Zoeter, M. Taylor, B. Ramsey, An ex- ing media services, in: Companion Proceedings of
perimental comparison of click position-bias mod- the Web Conference 2022, 2022, pp. 72–76.
els, in: Proceedings of the 2008 international con- [13] O. B. Mayor, V. Bellini, A. Buchholz, G. Di Benedetto,
ference on web search and data mining, 2008, pp. D. M. Granziol, M. Rufini, Y. Stein, Ranker-agnostic
87–94. contextual position bias estimation, arXiv preprint
[3] F. Guo, C. Liu, Y. M. Wang, Eficient multiple-click arXiv:2107.13327 (2021).</p>
          <p>models in web search, in: Proceedings of the second [14] Z. Fang, A. Agarwal, T. Joachims, Intervention
haracm international conference on web search and vesting for context-dependent examination-bias
esdata mining, 2009, pp. 124–131. timation, in: Proceedings of the 42nd international
[4] G. E. Dupret, B. Piwowarski, A user browsing model ACM SIGIR conference on research and
developto predict search engine click data from past obser- ment in information retrieval, 2019, pp. 825–834.
vations., in: Proceedings of the 31st annual inter- [15] E. L. Ionides, Truncated importance sampling,
Journational ACM SIGIR conference on Research and nal of Computational and Graphical Statistics 17
development in information retrieval, 2008, pp. 331– (2008) 295–311.</p>
          <p>338. [16] M. Dudík, J. Langford, L. Li, Doubly robust
[5] O. Chapelle, Y. Zhang, A dynamic bayesian network policy evaluation and learning, arXiv preprint
click model for web search ranking, in: Proceedings arXiv:1103.4601 (2011).
of the 18th international conference on World wide [17] M. Farajtabar, Y. Chow, M. Ghavamzadeh, More
web, 2009, pp. 1–10. robust doubly robust of-policy evaluation, in:
Inter[6] A. Agarwal, K. Takatsu, I. Zaitsev, T. Joachims, A national Conference on Machine Learning, PMLR,
general framework for counterfactual learning-to- 2018, pp. 1447–1456.
rank, in: Proceedings of the 42nd International [18] A. Swaminathan, T. Joachims, The self-normalized
ACM SIGIR Conference on Research and Develop- estimator for counterfactual learning, advances in
ment in Information Retrieval, 2019, pp. 5–14. neural information processing systems 28 (2015).
[7] H. Oosterhuis, M. de Rijke, Policy-aware unbiased [19] A. Chuklin, I. Markov, M. De Rijke, Click models
learning to rank for top-k rankings, in: Proceedings for web search, Springer Nature, 2022.
of the 43rd International ACM SIGIR Conference [20] G. W. Imbens, D. B. Rubin, Causal inference in
statison Research and Development in Information Re- tics, social, and biomedical sciences, Cambridge
trieval, 2020, pp. 489–498. University Press, 2015.
[8] X. Wang, N. Golbandi, M. Bendersky, D. Metzler, [21] M. Richardson, E. Dominowska, R. Ragno,
PreM. Najork, Position bias estimation for unbiased dicting clicks: estimating the click-through rate for
learning to rank in personal search, in: Proceedings new ads, in: Proceedings of the 16th international
of the eleventh ACM international conference on conference on World Wide Web, 2007, pp. 521–530.
web search and data mining, 2018, pp. 610–618. [22] Z. Qin, S. J. Chen, D. Metzler, Y. Noh, J. Qin,
[9] S. Li, Y. Abbasi-Yadkori, B. Kveton, S. Muthukr- X. Wang, Attribute-based propensity for unbiased
ishnan, V. Vinay, Z. Wen, Ofline evaluation of learning in recommender systems: Algorithm and
ranking policies with click models, in: Proceed- case studies, in: Proceedings of the 26th ACM
ings of the 24th ACM SIGKDD International Con- SIGKDD International Conference on Knowledge
Discovery &amp; Data Mining, 2020, pp. 2359–2367.
[23] K. Hofmann, S. Whiteson, M. de Rijke, Balancing
exploration and exploitation in listwise and pairwise
online learning to rank for information retrieval,</p>
          <p>Information Retrieval 16 (2013) 63–90.
[24] B. Ermis, P. Ernst, Y. Stein, G. Zappella, Learning
to rank in the position based model with bandit
feedback, in: Proceedings of the 29th ACM
International Conference on Information &amp; Knowledge</p>
          <p>Management, 2020, pp. 2405–2412.
[25] J. McInerney, B. Lacker, S. Hansen, K. Higley,</p>
          <p>H. Bouchard, A. Gruson, R. Mehrotra, Explore,
exploit, and explain: personalizing explainable
recommendations with bandits, in: Proceedings of the
12th ACM conference on recommender systems,
2018, pp. 31–39.
[26] K. Xiao, X. Cao, P. Huang, S. Chen, X. Zhou, Y. Xian,</p>
          <p>Learning-to-rank with context-aware position
debiasing (2018).</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>