1. Introduction

Contextual Position Bias Estimation Using a Single Stochastic Logging Policy

Giuseppe Di Benedetto

Alexander Buchholz

Ben London

Matej Jakimov

Yannik Stein

Jan Malte Lichtenberg

Vito Bellini

Matteo Rufini

Thorsten Joachims

0 0 Cornell University

Addressing the position bias is of pivotal importance for performing unbiased of-policy training and evaluation in Learning To Rank (LTR). This requires accurate estimates of the probabilities of the users examining the slots where items are displayed, which in many applications is likely to depend on multiple factors, e.g. the screen size. This leads to a position-bias curve that is no longer constant, but depends on the context. Existing position-bias estimators are either non-contextual or require multiple deployed ranking policies. We propose a novel contextual position-bias estimator that only requires propensities logged from a single stochastic logging policy. Empirical evaluations assess the accuracy of the model in recovering the position-bias curves as well as the impact on of-policy evaluation, showing how a contextual position-bias estimator can deliver better reward estimates which are more robust to non-stationarity compared to a non-contextual one.

eol>Position-based model contextual position bias of-policy evaluation non-stationarity

1. Introduction

based model, which models the click as the realisation of two independent events: examination of the position Recommender systems have large catalogs from which and relevance of the item (see section 2.1 for more deto source content to users, and users are usually served tails). To apply this click model to of-policy training and with a list of items from which they can choose which evaluation, one must estimate the vector of examination items to consume. Optimizing the ranking of presented probabilities for each displayed position, also called the items heavily impacts the success of recommendation, position-bias curve. The first methods that appeared in the since users typically only interact with items at the top literature provided estimators for a single position-bias of a ranking. Industrial systems can leverage vast quanti- curve to be used for every query [10, 11, 12]. However, ties of past user interactions, which can be used to train in many applications, the examination probabilities are new ranking policies and evaluate them ofline, before influenced by many factors: the size and shape of the deploying them. Most of the time, these logged inter- user’s screen; the time of day or day of week; the willactions only provide implicit feedback that is subject ingness of a user to explore the recommended options; to diferent sources of biases [ 1], which need to be ad- the type of subscription to a paid service, which could dressed both in training and evaluation. For instance, limit the number of arbitrary interactions (e.g. numWhen considering clicks—which arguably constitute the ber of on-demand streams in a streaming media service) most abundant signal in recommender systems—one can- and hence push the user to explore the available options not directly interpret a non-click as the user not being more carefully. One strategy to tackle this dependency interested in the recommended item. In fact, when users consists of partitioning the data and estimating a sepaare presented a list of items to interact with, they can rate position bias curve for each combination of factors. only click on items that the production policy decided Unfortunately, this solution would not scale, since (i) to present to the user (i.e., selection bias), and they are the number of combinations grows exponentially with more likely to examine top positions than bottom ones the amount of contextual information, and (ii) for some (i.e., position bias) [2]. These biases can be addressed combinations there might be not enough data for a sufiusing click models that describe how the user interacts ciently accurate estimate. On the other hand, contextual with the recommended items [3, 4, 5]. By incorporating information can be encoded as features in a parametric these modelling assumptions, we can perform unbiased model, and recent works [11, 13] have proposed such conof-policy training [6, 7, 8] and evaluation [9]. textual position-bias estimators to provide examination

One of the most popular click models is the position- probabilities at a query level. However, existing models Workshop on Learning and Evaluating Recommendations with Impres- present some limitations, as they either require multiple sions (LERI) @ RecSys 2023, September 18-22 2023, Singapore deployed rankers, or they require accurately estimating © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License the items’ relevances, which is arguably as dificult as CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)

2. Background

the ranking problem itself.

In this work, we extend the contextual estimator from [14], requiring only a single stochastic policy to be deployed, and for which propensities are known. The contributions of the paper can be summarised as follows: large. Small inverse propensities cause large variance in reward estimates. Hence, assumptions on the users’ click behaviour are usually introduced, so as to motivate lower variance estimators. Among the most popular click models, the Position-Based Model (PBM) [21, 2, 19] assumes that clicks on the ranked items are independent, and only characterized by the relevance of the item and the probability of the user examining the position where the item was displayed. Specifically, given a context , the probability of a click on an item in position is • We propose Policy-Aware Contextual Intervention Harvesting (PA-C-IH), a contextual positionbias estimator, which only requires propensities logged from a single stochastic policy. • We empirically confirm that the position-bias curve can be accurately recovered when there P( = 1 | , , ) = P( = 1 | , ) rel(, ) is dependence on contextual information. • We explore the impact of contextual position- where denotes the examination random variable, and bias estimation in of-policy evaluation, when rel(, ) is the relevance of the item given the context using reward estimators relying on the PBM as- (i.e. the probability of clicking on that item conditional sumption. In particular, we show that contextual on having observed it). The object of interest is the exposition-bias estimation can provide of-policy amination probability () = P( = 1 | , ), and for evaluations that are more accurate and more ro- many position-bias estimators, the problem is simplified bust to non-stationarity in the context distribu- by assuming there is no dependence of the examination tion compared to non-contextual estimation. probabilities on the context, reducing the problem to estimating a vector of —the number of visible slots— probabilities = (1, . . . , ). Contextual position-bias estimation instead focuses on the general case, with the goal of estimating a position-bias curve () for each query defined by a context vector ∈ .

The process of selecting the best ranking policy to be

deployed can be costly and time consuming. Running A/B tests to compare multiple models can negatively affect the user experience, as well as requiring operational 3. Related work efort and time to gather enough data. In addition, A/B testing does not scale when there are many policies to Position-bias estimation plays a central role in developing be compared; for example, when considering a large set ranking policies for recommendation and information of hyper-parameter configurations for a neural network- retrieval, as it provides the weights used to de-bias losses based policy. Of-policy evaluation greatly simplifies this in of-policy training and rewards in of-policy evaluaprocess, allowing comparison of multiple policies using tion. Diferent estimators have been proposed over the data logged by a previously deployed policy, without the years, starting from the simplest approach proposed by risk of impacting the user experience. However, obtain- Joachims et al. [10], which requires items to be randomly ing accurate of-policy evaluation requires methods to swapped in order to estimate the examination probabilde-bias the estimated rewards. Many estimators have ities. Following the PBM assumption, when uniformly been developed over the past decades [15, 16, 17, 18]. swapping items in two positions, and ′, the diferFor ranking, these estimators often rely on assumptions ence in the CTR logged at those position is due to the about users’ click behaviour [19, 9, 7]. diference in the expected examination of the positions; hence, we have /′ = CTR/CTR′ . Pivoting on 2.1. The Position-Based Model a specific position, e.g. the first position, it is possible to consistently estimate the position-bias curve, up to a Many of-policy training and evaluation techniques are multiplicative constant, by the CTR ratios using random based on Inverse Propensity Scoring (IPS) [20], an impor- swaps. These interventions can however be harmful to tance weighting technique used to counteract biases in the user’s experience, as displayed items deviate from the the data. In IPS estimators, rewards are re-weighted optimized policy, pushing non-relevant items in higher by the inverse of their probabilities of occurring in the positions. Agarwal et al. [11] alleviated this problem logged data (i.e., propensities). Without any assumptions by introducing a way to fetch those interventions from on users’ click behaviour, each of these propensities is the multiple diferent policies deployed online. However, probability that the logging policy produced the entire the deployment and maintenance of multiple policies ranking; and due to the combinatorial nature of rankings, can be cumbersome. Thus Rufini et al. [12] extended this probability could tend to zero, even if the number of the approach by requiring a single stochastic policy in items to rank and the number of available slots are not production. All of the aforementioned works estimate a single, non-contextual position bias curve, whereas we study contextual position bias estimation. The closest work to ours is by Fang et al. [14], which extends the intervention harvesting approach of Agarwal et al. [11] to contextual position-bias estimation. The downside of this approach is again the requirement of having multiple diferent policies deployed, which is mitigated by the method proposed in this paper, where we instead use a single stochastic policy with known propensities.

Another stream of research worth mentioning focuses

on regression-based estimation.

Wang et al. [8] propose estimators that use Expectation-Maximization (EM), and in [22, 13] this method was extended for contextual position-bias estimation. The regression approach has the advantage of not needing randomized data, nor interventions, but at the cost of requiring accurate relevance estimates for the ranked items. The latter requirement is very challenging in practice, and is arguably as hard as solving the ranking problem itself.

4. Contextual position-bias estimator

Like [14], our proposed method does not require explicit interventions, but rather harvests them from already deployed policies. The estimator in [14] requires multiple diferent policies; each query is served by one of the policies with a pre-defined probability. Here we propose an estimator that instead uses the propensities of a single the average relevance of the intervention set ,′ ().

The latter two quantities are hence modelled by two neural networks ℎ(, ) and (, ′, ) respectively. It is worth noting that (, ′, ) aims at estimating the average relevance of the items that can be appear in positions and ′ under the context , rather than trying to regress on the relevance of each item. The two neural networks can be optimized by minimizing the loss ℒ(ℎ, , ) = ∑︁ ∑︁ ^ℓ,′ () log ℎ(, ℓ)(, ′, ℓ))︁

︁( ℓ∈ ̸=′

ℓ + ¬^,′ () log 1 ︁(

− ℎ(, ℓ)(, ′, ℓ))︁ .

The contextual position-bias estimator PA-C-IH is thus ˆ() = ℎ* (, ) = arg max ℒ(ℎ, , ). Following analogous steps of Proposition 1 in [14], it can be proven that the loss ℒ is equivalent to a weighted cross-entropy loss:

︁( ^,′ (, ) log ℎ(, )(, ′, ) ∑︁ ∑︁ ^ ,′ ()

[︁ ∈ ̸=′

︁( + ¬^,′ (, ) log 1 − ℎ(, )(, ′, ) ︁) ︁) where ^ ,′ () := ∑︁ 1{ℓ=}1{(ℓ,ℓ)∈,′ }

^,′ (, ) = ¬^,′ (, ) = ℓ∈ ∑︀ℓ∈ 1{ℓ=}^ ℓ ,′ () ,′ () ∑︀ℓ∈ 1{ℓ=}¬^,′ ()

ℓ ,′ () stochastic logging policy 0. For each position pair , ′, for which E[ˆ,′ (, )] the intervention sets are defined as ,′ = {(, ) : 0(, |) 0(′, |) > 0 } and the logging policy 0 is required to satisfy ,′ ̸= ∅ for all position pairs ̸= ′. This assumption boils down to requiring that for every context and every pair of positions , ′, there exists at least one action that can be displayed in both positions by the logging policy. This difers from [ 14] where the intervention sets consisted of items that could have been placed in both positions under the multiple logging policies. As in the case of explicit interventions, the CTRs in the intervention sets can be used to estimate position-bias, with the caveat that in this case the position-bias depends on the context. For each observation in the set of the propensity-weighted click labels as follows: click logs = {(ℓ, ℓ, ℓ, ℓ)}ℓ=1: we can define ℓ ^,′ () := 1{(ℓ,ℓ)∈,′ }1{ℓ=} 0(ℓ, ℓ|ℓ) ℓ ¬^,′ () := 1{(ℓ,ℓ)∈,′ }1{ℓ=} 0(ℓ, ℓ|ℓ) .

ℓ 1 − ℓ

Conditioned on the context , in expectation ˆ

ℓ ,′ () is =

ℎ(, )(, ′, ) and ℎ(, )(, ′, ) hold. AnalE[¬ˆ,′ (, )]

= 1 − ogous to [14], in our experiments, both neural networks ℎ(, ) and (, ′, ) have one hidden layer with sigmoid activation function in order to force the output to be in the unit interval. The average relevance network (, ′, ) has an additional hidden layer to ensure that the output is a symmetric matrix; namely, denotes the output of the first layer of the network. (, ′, ) = 12 (1(, ′, ) + 1(, ′, )), where 1

5. Experiments

In this section, we empirically compare our contextual estimator PA-C-IH estimator against its noncontextual counterpart, PA-IH [12]. We use synthetic data consisting of 200K queries, with 5 items to be ranked, of which two are relevant. Each query is described by a context vector ∈

R 5 sampled from with cluster means 1 a mixture of three Gaussian distributions ( , 0.1) = (0, 1, − 1, 0, 0.5), 2 = (1, 0.2, − 0.2, 0.2, 1), 3 = (0.2, 0, 1, 0.3, − 0.4), and mixture weights (1, 2, 3) = (0.3, 0.3, 0.4). Following the experimental setup in [14, 13], the examination proportional to the examination probability () times probabilities for position , given context , are defined 1 as P( = 1 | , ) = max(0,⟨,⟩+1) . The parameter ∈ R5 determines the dependence of the examination probability on the context. Its entries are sampled from Uniform(− 0.5, 0.5), and are then fixed for all queries. The logging policy is a deterministic policy selecting the same ranking for all queries, and perturbed by random swaps such that each item maintains its original rank with probability 0.55, or with probability 0.45 is swapped uniformly at random with one of the other items. Clicks are generated according to the contextual PBM. 5.1. Position-bias curve estimation RelError(ˆ) =

⃒ ˆ() ⃒⃒⃒ , 1 ∑︁ 1 ∑︁ ⃒⃒ 1 − () ⃒ =1 =1 ⃒ where is the number of queries, is the number of slots in the displayed ranking, and () and ˆ () are the true and estimated examination probabilities for position in request with context , respectively. Since position-bias estimates are used in of-policy training and evaluation as inverse propensity scores, this metric can better quantify—as it uses ratios instead of absolute values—how accuracy in position-bias estimation would afect accuracy of of-policy evaluation. Table 1 shows the relative error of PA-C-IH and PA-IH on the synthetic data, showing that the contextual position-bias estimator can lead to significantly improved accuracy.

RelError 95% CI

PA-C-IH 0.0556 [0.0556, 0.0557]

PA-IH 0.3434 [0.3427, 0.3443]

In order to estimate the position-bias curve, we first tune

the hyper-parameters of the two estimators: the optimization parameters for PA-IH and PA-C-IH, and the number of hidden layers for the two neural networks in PA-C-IH.

Figure 1 qualitatively shows that the contextual positionbias estimator is able to recover the position-bias curve in each cluster by using the context information, while the non-contextual estimator only fits a position-bias curve that averages across the clusters’ position-bias curves.

To quantify the accuracy of the position-bias estimates, 5.2. Of-policy evaluation we compute the relative error,

Among the of-policy estimators developed in the literature (see [9] for a comprehensive overview), an unbiased reward estimator that leverages the PBM assumption in of-policy evaluation is given by ˆ ( ) =

1 ∑︁ ∑︁ () ⟨(), (· , |)⟩ , (1) =1 =1 ⟨(), 0(· , |)⟩ where and 0 are the target and logging policies, respectively; () is the reward logged for item ; () is the position-bias curve for the request with context , and (· , |) denotes a vector of propensities for ranking action at each of the positions, given context .

We use this reward estimator below to compare diferent position-bias estimators for of-policy evaluation.

5.2.1. Stationary environment In the first experiment, of-policy evaluation was run on

the same data source used for position-bias estimation. This setting is realistic under the assumption that the environment does not change over time. Under stationarity, one can expect that a position-bias curve estimated on past data will still be valid when used in the future for of-policy training and evaluation. The target policy to be evaluated here is a deterministic policy that selects among three diferent rankings, serving the same ranking for all queries within the same cluster. Figure 2 shows the reward estimated on the diferent clusters, and on the full data set. The PA-C-IH estimator provides much more accurate reward estimates for each of the clusters, as well as the overall data set, compared to PA-IH, which sufers from the bias introduced by using a single, non-contextual position-bias curve when examination probabilities are in fact contextual. data generation step. It is worth recalling that this target policy always selects the same ranking regardless of context. Figure 3 shows the error in the of-policy estimation on the test data, using PA-IH and PA-C-IH position-bias curves estimated on the training data with a diferent cluster distribution. PA-C-IH proves to be robust to such distribution shifts, providing more accurate position-bias estimates, which translate into more accurate of-policy reward estimates, both within clusters and on the full data set. PA-IH, on the other hand, estimates an overall average position-bias curve, which does not reflect the actual average position-bias curve due to the context distribution shift between the two data sets.

In this experiment the rankings selected by logging and target policies do not depend on the context. Yet even in this very simple scenario, if the position-bias is contextual, a shift in the context distribution can cause systematic bias in of-policy evaluation when using a non-contextual position-bias estimator.

5.2.2. Non-stationary environment

A less restrictive, and more realistic, setting is where the distribution of queries shifts over time. Position-bias estimation requires data to be collected from a randomized policy, without interventions that can afect the accuracy of the logged propensities (e.g. promotion rules that alter the ranking produced by the policy, thereby invalidating the logged propensities). Such requirements could be dificult to fulfill in real-world applications, thus preventing us from collecting a constant stream of randomized Figure 3: Of-policy evaluation on the synthetic data under data to update position-bias estimates. In addition to non-stationarity using position-bias estimates from PA-C-IH that, it is reasonable to assume that shifts in the context and PA-IH. Bias of the estimated rewards with 95% CI are distribution can occur over time, for instance the change reported for each cluster and for the full data. in the distribution of the device used, or of the user adhering to diferent subscription plans, or more generally the non-stationarity induced by the launch of a new user interface. It is therefore interesting to analyze how ro- 6. Conclusion bust position-bias estimators are under non-stationarity when used for of-policy evaluation. In the synthetic ex- We have proposed a new contextual position-bias estiperiment presented, we induce non-stationarity by using mator, PA-C-IH, which does not require multiple rankers a second data set, generated using the same procedure as to be deployed, but rather a single stochastic ranker for the data used for position-bias estimation, but with dif- which propensities are known. The latter is commonly ferent cluster proportions. While in the training data the adopted in recommender systems in order to ensure a cluster weights are (0.3, 0.3, 0.4), in the test data they certain level of exploration [23, 24, 25, 12], and our estiare set to (0.15, 0.1, 0.75). In order to isolate the efect mator exploits the randomness of the logging policy to of non-stationarity in the context distribution, we eval- provide a contextual estimate of the position-bias curve. uate a simpler policy than the one used in the previous We have empirically shown that the PA-C-IH estimator experiment. Here, the target policy is the determinis- provides better position-bias estimates (compared to a tic version of the data generation policy—namely, the non-contextual estimator) when there is dependence on logging policy without the random swaps used in the contextual information, and we explored the impact this can have on of-policy evaluation. We further demon- ference on Knowledge Discovery Data Mining, strated how PA-C-IH can yield more robust of-policy KDD ’18, Association for Computing Machinery, estimates in the presence of non-stationary distributions. New York, NY, USA, 2018, p. 1685–1694. URL: https:

As part of future work, there are several directions //doi.org/10.1145/3219819.3220028. doi:10.1145/ that can be investigated: (i) extend the evaluation of our 3219819.3220028. methods to real-world data; (ii) assess the impact of our [10] T. Joachims, A. Swaminathan, T. Schnabel, Unestimator in of-policy training of LTR algorithms [ 26, 6], biased learning-to-rank with biased feedback, in: (iii) generalize our approach to incorporate other types Proceedings of the tenth ACM international conof click noises, such as trust bias. ference on web search and data mining, 2017, pp. 781–789. [11] A. Agarwal, I. Zaitsev, X. Wang, C. Li, M. Najork, References T. Joachims, Estimating position bias without intrusive interventions, in: Proceedings of the Twelfth [1] T. Joachims, L. Granka, B. Pan, H. Hembrooke, ACM International Conference on Web Search and G. Gay, Accurately interpreting clickthrough data Data Mining, 2019, pp. 474–482. as implicit feedback, in: Acm Sigir Forum, vol- [12] M. Rufini, V. Bellini, A. Buchholz, G. Di Benedetto, ume 51, Acm New York, NY, USA, 2017, pp. 4–11. Y. Stein, Modeling position bias ranking for stream[2] N. Craswell, O. Zoeter, M. Taylor, B. Ramsey, An ex- ing media services, in: Companion Proceedings of perimental comparison of click position-bias mod- the Web Conference 2022, 2022, pp. 72–76. els, in: Proceedings of the 2008 international con- [13] O. B. Mayor, V. Bellini, A. Buchholz, G. Di Benedetto, ference on web search and data mining, 2008, pp. D. M. Granziol, M. Rufini, Y. Stein, Ranker-agnostic 87–94. contextual position bias estimation, arXiv preprint [3] F. Guo, C. Liu, Y. M. Wang, Eficient multiple-click arXiv:2107.13327 (2021).

models in web search, in: Proceedings of the second [14] Z. Fang, A. Agarwal, T. Joachims, Intervention haracm international conference on web search and vesting for context-dependent examination-bias esdata mining, 2009, pp. 124–131. timation, in: Proceedings of the 42nd international [4] G. E. Dupret, B. Piwowarski, A user browsing model ACM SIGIR conference on research and developto predict search engine click data from past obser- ment in information retrieval, 2019, pp. 825–834. vations., in: Proceedings of the 31st annual inter- [15] E. L. Ionides, Truncated importance sampling, Journational ACM SIGIR conference on Research and nal of Computational and Graphical Statistics 17 development in information retrieval, 2008, pp. 331– (2008) 295–311.

338. [16] M. Dudík, J. Langford, L. Li, Doubly robust [5] O. Chapelle, Y. Zhang, A dynamic bayesian network policy evaluation and learning, arXiv preprint click model for web search ranking, in: Proceedings arXiv:1103.4601 (2011). of the 18th international conference on World wide [17] M. Farajtabar, Y. Chow, M. Ghavamzadeh, More web, 2009, pp. 1–10. robust doubly robust of-policy evaluation, in: Inter[6] A. Agarwal, K. Takatsu, I. Zaitsev, T. Joachims, A national Conference on Machine Learning, PMLR, general framework for counterfactual learning-to- 2018, pp. 1447–1456. rank, in: Proceedings of the 42nd International [18] A. Swaminathan, T. Joachims, The self-normalized ACM SIGIR Conference on Research and Develop- estimator for counterfactual learning, advances in ment in Information Retrieval, 2019, pp. 5–14. neural information processing systems 28 (2015). [7] H. Oosterhuis, M. de Rijke, Policy-aware unbiased [19] A. Chuklin, I. Markov, M. De Rijke, Click models learning to rank for top-k rankings, in: Proceedings for web search, Springer Nature, 2022. of the 43rd International ACM SIGIR Conference [20] G. W. Imbens, D. B. Rubin, Causal inference in statison Research and Development in Information Re- tics, social, and biomedical sciences, Cambridge trieval, 2020, pp. 489–498. University Press, 2015. [8] X. Wang, N. Golbandi, M. Bendersky, D. Metzler, [21] M. Richardson, E. Dominowska, R. Ragno, PreM. Najork, Position bias estimation for unbiased dicting clicks: estimating the click-through rate for learning to rank in personal search, in: Proceedings new ads, in: Proceedings of the 16th international of the eleventh ACM international conference on conference on World Wide Web, 2007, pp. 521–530. web search and data mining, 2018, pp. 610–618. [22] Z. Qin, S. J. Chen, D. Metzler, Y. Noh, J. Qin, [9] S. Li, Y. Abbasi-Yadkori, B. Kveton, S. Muthukr- X. Wang, Attribute-based propensity for unbiased ishnan, V. Vinay, Z. Wen, Ofline evaluation of learning in recommender systems: Algorithm and ranking policies with click models, in: Proceed- case studies, in: Proceedings of the 26th ACM ings of the 24th ACM SIGKDD International Con- SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 2359–2367. [23] K. Hofmann, S. Whiteson, M. de Rijke, Balancing exploration and exploitation in listwise and pairwise online learning to rank for information retrieval,

Information Retrieval 16 (2013) 63–90. [24] B. Ermis, P. Ernst, Y. Stein, G. Zappella, Learning to rank in the position based model with bandit feedback, in: Proceedings of the 29th ACM International Conference on Information & Knowledge

Management, 2020, pp. 2405–2412. [25] J. McInerney, B. Lacker, S. Hansen, K. Higley,

H. Bouchard, A. Gruson, R. Mehrotra, Explore, exploit, and explain: personalizing explainable recommendations with bandits, in: Proceedings of the 12th ACM conference on recommender systems, 2018, pp. 31–39. [26] K. Xiao, X. Cao, P. Huang, S. Chen, X. Zhou, Y. Xian,

Learning-to-rank with context-aware position debiasing (2018).