=Paper=
{{Paper
|id=Vol-2911/paper4
|storemode=property
|title=Unbiased Counterfactual Estimation of Ranking Metrics
|pdfUrl=https://ceur-ws.org/Vol-2911/paper4.pdf
|volume=Vol-2911
|authors=Haining Yu
}}
==Unbiased Counterfactual Estimation of Ranking Metrics==
Unbiased Counterfactual Estimation of Ranking Metrics Haining Yu4 4 Amazon, Seattle WA, USA Abstract We propose a novel method to estimate metrics for a ranking policy, based on behavioral signal data (e.g. clicks or viewing of video contents) generated by a second different policy. Building on [1], we prove the counterfactual estimator is unbiased, and discuss its low-variance property. The estimator can be used to evaluate ranking model performance offline, to validate and selection positional bias models, and to serve as learning objectives when training new models. Keywords Learning to rank, presentation bias, counterfactual inference 1. Introduction can be loosely grouped into training and evaluation. For training, the question is how to properly use knowledge Ranking algorithms power large scale information re- in positional bias to train a target policy and maximize trieval systems. They rank web pages when users look relevancy. To start, positional bias models estimate prob- for information in search engines, or products when users ability for a document to be examined by a user in a given shop on e-retailers’ websites. Such systems process bil- position; the estimation is based on different user behav- lions of queries on a daily basis; they also generate large ioral models. Such models, often called “click models”, amount of logs. The logs capture online user behavior have become widely available; see [2][3][4][5][6][7][8] (e.g., clicking URLs or viewing video contents) and can be [9][10][11][12]. Built on positional bias models, the sem- used to improve ranking algorithms. As a result, training inal work of [1] and [13] established a framework to new ranking models from logs is a central task in Learn optimize relevancy using noisy behavioral signal data, To Rank theory and application; it is also often referred proving unbiasedness results for ranking metrics with to as learning from “implicit feedback” or “counterfactual additive form. For evaluation, the question is how to learn to rank” in literature (e.g., [1]). evaluate the target policy, once trained. For industry Counterfactual learn-to-rank is complicated by the ranking applications, the gold standard for evaluation is presence of “positional bias”. Ranking algorithms deter- to A/B test target policy against behavioral policy, collect mines the position of ranked documents. If the search data on both, and compare ranking metrics such as Aver- result page has a “vertical list” layout, the document with age Precision and NDCG. This approach is restricted by rank of 1 is on top of the page; if the result page has a limited experimentation time. As an alternative, offline horizontal layout, the document with rank of 1 is on top evaluation like [9] predicts target policy ranking metrics left corner. When positional bias is present, a document using data from behavioral policy. has a higher chance to be examined by user when ranked The research discussed above, in particular [1] and higher. As a result, when user clicks a document, the [9], has advanced our understanding to counterfactual click (the “behavioral signal”) can be due to one of two learn-to-rank significantly. Meanwhile, each line of re- reasons: either the document is relevant for the given search has its pros and cons. Let us use [1] and [9] to query, or the document is on top of the list. When posi- highlight. First of all, the two research focuses on differ- tional bias is present, document ranking and relevancy ent subjects in a causal relationship. Borrowing a causal jointly determine behavioral signals, making the signal a lens where relevancy and positional bias jointly drive noisy proxy for relevancy, the primary goal of ranking behavioral signals, [1] focuses on relevancy, the “cause” optimization. while [9] focuses on behavioral signals, the “effect”. It In the context of counterfactual learn-to-rank, we refer is an open question whether the approach in [1] can be to the algorithm generating the log data as the “behav- extended to optimize behavioral signal-based metrics (e.g. ioral policy”. Data generated by behavioral policy is used clicks). Secondly, the two research also differs in vali- to train a hopefully better algorithm, called the “target dation: once developed, models in [9] can be validated policy”. Research work in counterfactual learn-to-rank by comparing offline evaluation and online experimen- tation measurement. For [1], even if we can optimize Causality in Search and Recommendation (CSR) and Simulation of relevancy, we cannot easily evaluate how much improve- Information Retrieval Evaluation (Sim4IR) workshops at SIGIR, 2021 ment is made, even with online experimentation. This " hainiy@amazon.com (H. Yu) © 2021 Copyright for this paper by its authors. Use permitted under Creative is because evaluating relevancy (and its improvement) Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) ultimately requires manual annotation; for large-scale online search engines that process billions of queries Table 1 daily, such effort is costly. Last but not least, [9] relies Illustrative Data Sample on high-variance Inverse Propensity Scoring techniques Query Document Rank by 𝐶 Clicked? (1=yes) Rank by 𝑇 based on the entire ranking permutation (or “action”). In ranking, the action space is large, for example there 1 100 1 0 3 are 100!/(100 − 20)! = 1.3 × 1039 ways to select 20 1 200 2 1 1 documents out of 100. As a result, action probabilities 1 300 3 1 2 are small. The ratio between two small probabilities can generate extremely small or large ratios (high variance), making the technique challenging to implement in prac- tical situations. Rank-based propensity adjustment in [1] using positional effect models has more desirable variance property. Such difference is key to accuracy of offline evaluation. This paper brings the two lines of research together. The main contribution is the unbiased estimation for ranking metrics for behavioral signals. In this sense, it is part of study on “effect” of ranking dynamics. By focusing on the “effect”, it can be validated by offline evaluation and experimentation. Meanwhile, it retains Figure 1: Causal relationship among the random variables the desirable unbiasedness properties [1] and [9], but replaces high variance Inverse Propensity Scoring adjust- ments with positional biases, borrowing the key insight from [1]. Since the focus switches from cause to effect, to train target policy 𝑇 later. The new policy ranks dif- this requires new techniques and yields unbiased estima- ferently, i.e., 𝑅𝑇 = [3, 1, 2]. This seems an improve- tors of a new kind. This unbiased estimator can serve ment. But user never sees the documents in order of as the learning goal for new target policy and enables [200, 300, 100] and we don’t know if they will click dif- offline/online evaluation. It can be also used to establish ferently. In other words, 𝐵 𝑇 is missing data. This is an a method to validate and select positional bias models, a example of the causal inference, thus the name counter- key input to counterfactual estimation framework. factual. Estimating 𝑀 𝑇 = 𝑀 (𝑅𝑇 , 𝐵 𝑇 ) without observ- ing 𝐵 𝑇 is the central task for this paper. All quantities defined above are random variables (or 2. Problem Set Up vectors). The dependency among them are as follows: query 𝑞 determines the document set 𝐷, i.e., 𝐷 = 𝐷(𝑞). Let 𝑞 be a random query. For 𝑞, the set of documents to 𝑞, 𝐷, and the ranking policy 𝐶 jointly determine the rank is 𝐷 = [𝑑1 , 𝑑2 , . . .]. A ranking policy 𝐶 assigns rank vector 𝑅 = 𝑅(𝑞, 𝐷, 𝐶). 𝑞, 𝐷 and 𝑅 then jointly ranks 𝑅 = [𝑟(𝑑1 ), 𝑟(𝑑2 ), . . .] for documents in 𝐷. 𝑅, determine the behavioral signal vector 𝐵 = 𝐵(𝑞, 𝐷, 𝑅). a random permutation of [1, . . . , ‖𝐷‖], determines the Last but not least 𝑅 and 𝐵 determine 𝑀 = 𝑀 (𝐵, 𝑅). position of products 𝐷 on web page. For example, in a The table below visualizes the structural causal model. “vertical list” layout, the product with 𝑟 = 1 is on top The randomness in the system comes from multiple of the page. After presenting 𝐷 in order of 𝑅 to user, sources: distribution of 𝑞, conditional distribution of can- we observe the behavioral signal 𝐵. A binary vector, didate set 𝐷|𝑞, conditional distribution of 𝑅|𝑞, 𝐷, 𝐶, and 𝐵 = [𝑏(𝑑1 ), 𝑏(𝑑2 ), ...], where 𝑏(𝑑) = 1 if and only if conditional distribution 𝐵|𝑞, 𝐷, 𝑅. The only exception user engages with any 𝑑 ∈ 𝐷 (e.g., clicking a web page is that no distribution is needed on 𝑀 |𝑅, 𝐵. Given 𝑅 or watching a video). Given a ranking vector 𝑅 and the and 𝐵, the value of 𝑀 is deterministic for most practical behavioral signal vector 𝐵, we define a ranking metric ranking systems and metrics. of interest 𝑀 = 𝑀 (𝑅, 𝐵) such as Precision and DCG. Because the analysis involves two policies, we always Table 1 shows a hypothetical example. For 𝑞 = 1, specify the policy generating the data. That is, we use 𝐷 = [100, 200, 300] represents three documents to rank. 𝑅𝑇 to denote ranks generated by policy 𝐶, use 𝐵 𝑇 to de- The behavioral policy 𝐶 ranks them as 𝑅𝐶 = [1, 2, 3], note the behavioral signal generated when showing 𝐷 in i.e., to show document 100 first and 300 last. Seeing order of 𝑅𝑇 to user, and 𝑀 𝑇 = 𝑀 (𝐵 𝑇 , 𝑅𝑇 ) to denote the list, user ignores the top document 100 and clicks the ranking metrics calculated. This helps distinguish the other two, i.e., 𝐵 𝐶 = [0, 1, 1]. If we use Preci- between random variables under different policies. We sion@3 to measure performance of ranking policy, we omit other dependencies in notations when confusion get 𝑀 (𝑅𝐶 , 𝐵 𝐶 ) = 0.667. Saved in log, the data is used can be avoided. 3. The Unbiased Counterfactual is an unbiased estimator for 𝐸𝑞,𝐷,𝑅𝑇 ,𝐵 𝑇 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )], i.e., Estimator ⎡ ⎤ In this section we define the counterfactual estimator and 1 ∑︁ 𝐸𝑞,𝐷,𝑅𝑇 ,𝐵 𝑇 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )] = 𝐸 ⎣ 𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 )⎦ prove its unbiasedness. We first present the main results ‖𝑄𝐶 ‖ 𝑞∈𝑄 𝐶 in Section 3.1, prove the unbiased results in Section 3.2, (3) and discuss technical details in Section 3.3. where the expectation on the right hand side is taken over query set 𝑄𝐶 , 𝐷 for every 𝑞 ∈ 𝑄𝐶 ,𝑅𝑇 over 𝑞, 𝐷, 𝐵 𝑇 3.1. Main Result over 𝑞, 𝐷, 𝑅𝑇 , and 𝑅𝐶 over 𝑞, 𝐷. We first define assumptions necessary for defining the Let us use the same example in Table 1 to illustrate estimators and proving unbiasedness. First how the estimator is computed. Assume we have the following positional bias estimates: 𝜂(1) = 0.9, 𝜂(2) = Assumption 3.1. For a policy 𝐶, conditional on 0.7, 𝜂(3) = 0.5 (as a reminder, such estimates can be the ranking vector 𝑅𝐶 and behavioral signal 𝐵 𝐶 , made available via statistical estimation procedures; see 𝑀 𝐶 (𝑅𝐶 , 𝐵 𝐶 ) are conditionally independent of 𝑞 and 𝐷, [12] and references within for implementation). Recall i.e., 𝑀 𝐶 (𝑅𝐶 , 𝐵 𝐶 ) ⊥ ⊥ 𝑞, 𝐷|𝑅𝐶 , 𝐵 𝐶 that ranking vector 𝑅𝐶 = [1, 2, 3] and behavioral sig- This is easily satisfied for most ranking metrics such nal 𝐵 𝐶 = [0, 1, 1]. For the metric of Precision@3, as MRR, MAP, Precision, and NDCG. 𝑀 𝐶 = 0.667. For policy 𝑇 , we observed 𝑅𝑇 but not 𝐵 𝑇 . Next, similar to [1] [9][13], we assume the ranking So we use 𝑅𝑇 , 𝑅𝐶 , and 𝐵 𝐶 and equation (1) to compute metric of interest is linearly decomposable, i.e., the following estimate: 𝑌 = (0+1× 𝜂(2) 𝜂(1) +1× 𝜂(2) )/3 = 𝜂(3) Assumption 3.2. 𝑀 𝐶 (𝑅𝐶 , 𝐵 𝐶 ) = 0.895. Averaging 𝑌 s over queries in 𝑄𝐶 yields the coun- ∑︀ 𝐿(𝑟(𝑑))𝑏 𝐶 (𝑑), where 𝐿(𝑟) is a determinis- terfactual estimator (2). 𝑑∈𝐷 tic function of rank 𝑟. For Precision@3, 𝐿(𝑟) = 1 if 𝑟 ≤ 3, and 0 otherwise. 3.2. Proof of Unbiasedness For 𝑑 ∈ 𝐷, 𝑏𝐶 (𝑑) is a binary random variable and We now set up a series of unbiasedness results, eventually 𝐸[𝑏𝐶 (𝑑)] is the click probability. Similar to [9] and [10], leading to proof of Theorem 3.1. we make the following assumption based on position- based click model (PBM)[6]: Lemma 3.2. Let 𝑇 and 𝐶 be two stochastic policies. Under Assumptions 3.1, 3.2, 3.3, and 3.4, 𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 ) is an Assumption 3.3. 𝐸𝐵 𝐶 |𝑞,𝐷 [𝑏𝐶 (𝑑)] = 𝜂(𝑟𝐶 (𝑑))𝛾(𝑑, 𝑞), unbiased estimator for 𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 [𝑀 𝑇 ], i.e., where 𝜂(𝑟) > 0 is the probability of examining a certain rank 𝑟. 𝛾(𝑑, 𝑞) is probability of click conditional on being 𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 [𝑀 (𝑅𝑇 , 𝐵 𝑇 )] = 𝐸𝐵 𝐶 |𝑞,𝐷,𝑅𝐶 ,𝑅𝑇 [𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 )] examined. (4) When using data generated by behavioral policy 𝐶 to Proof. Via Assumptions 3.1, 3.2 and 3.3 train a new policy 𝑇 , we also assume 𝑇 and 𝐶 to share nothing in common except inputs 𝑞 and 𝐷. For example, ∑︁ 𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 [𝑀 𝑇 ] = 𝐿(𝑟𝑇 (𝑑))𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 [𝑏𝑇 (𝑑)] the output of one policy is not used as input to another: 𝑑∈𝐷 𝑇 (5) Assumption 3.4. Given two policies 𝑇 and 𝐶, 𝑅 and By Assumption 3.3, 𝑅𝐶 are conditionally independent given 𝑞 and 𝐷, i.e., 𝑅𝑇 ⊥⊥ 𝑅𝐶 |𝑞, 𝐷. 𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 [𝑏𝑇 (𝑑)] = 𝜂(𝑟𝑇 (𝑑))𝛾(𝑑, 𝑞) The main result of the paper states that: Defining a shorthand Theorem 3.1. Define 𝐿(𝑟𝑇 (𝑑)) 𝜂(𝑟𝑇 (𝑑)) 𝜂(𝑟𝑇 (𝑑)) Ψ= 𝐿(𝑟𝐶 (𝑑)) 𝜂(𝑟𝐶 (𝑑)) ∑︁ 𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 ) = 𝐿(𝑟𝑇 (𝑑)) 𝜂(𝑟𝐶 (𝑑)) 𝑑∈𝐷 and 𝑏𝐶 (𝑑)=1 , it follows that (1) Let 𝑄𝐶 be queries randomly sampled from the query uni- 𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 [𝑀 𝑇 ] verse where policy 𝐶 is applied. Under Assumptions 3.1, ∑︁ = 𝐿(𝑟𝑇 (𝑑))𝜂(𝑟𝑇 (𝑑))𝛾(𝑑, 𝑞) 3.2, 3.3, and 3.4, 𝑑∈𝐷 1 ∑︁ 𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 ) (2) ∑︁ = 𝐿(𝑟𝐶 (𝑑))𝜂(𝑟𝐶 (𝑑))𝛾(𝑑, 𝑞)Ψ ‖𝑄𝐶 ‖ 𝑞∈𝑄 𝐶 𝑑∈𝐷 ∑︁ = 𝐿(𝑟𝐶 (𝑑))𝐸𝐵 𝐶 |𝑞,𝐷,𝑅𝑇 [𝑏𝐶 (𝑑)]Ψ 3.3. Technical Discussion 𝑑∈𝐷 ∑︁ Theorem 3.1 holds when both 𝑇 and 𝐶 are deterministic = 𝐿(𝑟𝐶 (𝑑))𝐸𝐵 𝐶 |𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 [𝑏𝐶 (𝑑)]Ψ policies, without Assumption 3.4. The proof is omitted 𝑑∈𝐷 due to space limit. In practical ranking systems, output of one ranker is frequently incorporated into another. This [︃ ]︃ ∑︁ 𝐶 𝐶 = 𝐸𝐵 𝐶 |𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 𝐿(𝑟 (𝑑))𝑏 (𝑑)Ψ violates Assumption 3.4, which requires two policies to 𝑑∈𝐷 share nothing except inputs. The unbiased estimator looks different from its coun- ⎡ ⎤ ∑︁ 𝑇 𝜂(𝑟𝑇 (𝑑)) ⎦ = 𝐸𝐵 𝐶 |𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 ⎣ 𝐿(𝑟 (𝑑)) terpart in equation (4) of [1], where the positional bias 𝜂(𝑟𝐶 (𝑑)) 𝑏𝐶 (𝑑)=1 appears only once. It is easy to understand the differ- ence with a causal view: the common assumption in [1] = 𝐸𝐵 𝐶 |𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 [𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 )] and the current work is that relevancy and positional The third step in the derivation is due to Assumption 3.3; bias jointly drive behavioral signals. When it comes to the fourth step is due to Assumption 3.4. estimation, we are interested in different subjects. [1] is interested in estimating relevancy (the cause) from Lemma 3.3. Under Assumptions 3.1, 3.2, 3.4, and 3.3, clicks (the effect). So it has the 1/𝑄 factor to cancel out the positional bias from behavioral policy. The present 𝐸𝑞,𝐷,𝑅𝑇 ,𝐵 𝑇 [𝑀 (𝑅𝑇 , 𝐵 𝑇 )] work is interested in estimating metrics defined on be- = 𝐸𝑞,𝐷,𝑅𝐶 ,𝐵 𝐶 ,𝑅𝑇 [𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 )] havioral signal (the effect) on target policy, from data generated by a behavioral policy policy (a second effect). Proof. By Assumption 3.4, 𝑅𝑇 and 𝑅𝐶 are conditionally Two positional bias terms are thus needed to cancel the independent. As a result, 𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 ) is also condi- effect. tionally independent from 𝑅𝐶 . Therefore The counterfactual estimators (2) aims to avoid the 𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )] high variance challenge facing other IPS estimators, e.g., in [9]. It is a common practice to use IPS estimators to = 𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )] construct estimates for metrics of interest. While such estimators enjoy the desirable property of unbiasedness, Combining this equation with Lemma 3.2 yields their variance profile is of concern. The core of any 𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )] IPS estimator is the ratio for a ranking 𝑅 to be selected by two different policies 𝑇 and 𝐶, i.e., Pr𝑇 (𝑅|𝑞, 𝐷)/ = 𝐸𝐵 𝐶 |𝑞,𝐷,𝑅𝐶 ,𝑅𝑇 [𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 )] Pr𝐶 (𝑅|𝑞, 𝐷). In practice, the ranking space is (combi- The expectations on both sides of the above equa- natorially) large and action probabilities are small. Di- tion are conditioned on the same joint distribution of viding one small number over another can generate ex- 𝑞, 𝐷, 𝑅𝑇 , 𝑅𝐶 . Taking expectation over both sides of the tremely small or large ratios. When any policy is deter- equation yields: ministic, Pr𝑇 (𝑅|𝑞, 𝐷) is ill-defined. The problem gets worse when the behavioral and target policy differ signifi- 𝐸𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 ,𝐵 𝑇 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )] cantly, i.e, when accurate offline performance evaluation = 𝐸𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 ,𝐵 𝐶 [𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 )] (6) is needed most. As a result, the ratio can have high variance; this prevents IPS estimators from being useful Again using Assumption 3.4, we can remove 𝑅𝐶 from in industry applications. The current approach solves the expectation on 𝑀 𝑇 in left hand side of equation (6). this problem. Counterfactual estimation using equation This yields (2) no longer needs the high variance action probability ratio. Instead it uses the ratio between positional bias 𝐸𝑞,𝐷,𝑅𝑇 ,𝐵 𝑇 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )] estimates (𝜂), a function of rank positions. The ratio of 𝜂 = 𝐸𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 ,𝐵 𝐶 [𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 )] empirically has much less variance than estimated action probability ratio. The current framework can be generalized in three different ways. First, it naturally extends to contex- Theorem 3.1 can now be proved as follows: tual ranking problems, where 𝑞 represents not only Proof. Since ‖𝑄1𝐶 ‖ 𝑞∈𝑄𝐶 𝑌 is sample mean of ∑︀ the search query, but also all context information avail- 𝑌 , it is an unbiased estimator of the true mean able for ranker. Secondly, it can be generalized to opti- 𝐸𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 ,𝐵 𝐶 [𝑌 ] which, by Lemma 3.3, equal to mize query/document specific rewards. This makes it easy when different documents have different economic 𝐸𝑞,𝐷,𝑅𝑇 ,𝐵 𝑇 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )]. Thus it is an unbiased esti- value. Assumption 3.2 can be relaxed to 𝑀 (𝑅, 𝐵) = mator of 𝐸𝑞,𝐷,𝑅𝑇 ,𝐵 𝑇 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )]. 𝑑∈𝐷 𝐿(𝑟(𝑑), 𝑞, 𝐷)𝑏(𝑑) , where 𝐿(𝑟, 𝑞, 𝐷) is a deter- previously defined. In the third step, we use the two ∑︀ ministic function of rank 𝑟, query 𝑞, and document estimates to construct a model validation test. A sim- set 𝐷. Last but not least, the probability of examina- ple approach is to treat the sample mean estimator as tion 𝜂 and condition click probability 𝛾 can depend the “ground truth”, as long as the sample size of data on query 𝑞 and candidate document set 𝐷. That is, is big enough. The difference between two estimators Assumption 3.3 can be relaxed to 𝐸𝐵 𝑇 |𝑞,𝐷 [𝑏𝑇 (𝑑)] = can thus be used to quantify the correctness of model. 𝜂(𝑟𝑇 (𝑑), 𝑞, 𝐷)𝛾(𝑑, 𝑞, 𝐷). Same is true for Assumption A method with more statistical rigor is to treat the two 3.4. estimates as group means of random variables with esti- mated standard deviations. Standard hypothesis testing readily applies. 4. Validating and Selecting Positional Bias Models 5. Conclusion The unbiased counterfactual estimator has three poten- We built a counterfactual estimator for ranking metrics tial uses: to evaluate offline ranking performance, to defined on behavioral signals. The estimator is unbiased validate and selection positional bias estimates, and to and has low variance. We discuss its usage in selecting serve as loss (or reward in reinforcement learning set- and validating positional bias models. This estimator can ting) in training new ranking models. Some have been be applied to ranking models with strong counterfactual covered by literature. See [9] for discussion on offline effect. ranking performance evaluation and [1] for discussion on training loss improvement. The rest of this section fo- cuses on validating and selecting positional bias models, References an area not covered in past works. positional bias models can be developed in many ways, dependent upon theory [1] T. Joachims, A. Swaminathan, T. Schnabel, Un- (e.g., underlying statistical model, the causal structure, biased learning-to-rank with biased feedback, in: inclusion and exclusion of predictive features), data, and Proceedings of the Tenth ACM International Con- estimation processes. When there is one model, the ques- ference on Web Search and Data Mining, 2017, pp. tion is how correct it is. When there are multiple models, 781–789. the question is how to select the best one for a specific [2] T. Joachims, Optimizing search engines using use case. clickthrough data, in: Proceedings of the Eighth Using the counterfactual estimator, a method can be ACM SIGKDD International Conference on Knowl- developed to validate and selection of positional bias edge Discovery and Data Mining, KDD ’02, 2002, p. models. It is based on the following idea: we already 133–142. have one unbiased estimator of 𝐸[𝑀 𝑇 ] using positional [3] T. Joachims, L. A. Granka, B. Pan, H. A. Hembrooke, bias estimates as input; they are 𝑌 s defined in equa- F. Radlinski, G. Gay, Evaluating the accuracy of tions (1) and (2). If we find a second unbiased estimators implicit feedback from clicks and query reformula- for 𝐸[𝑀 𝑇 ] without using positional bias estimates, the tions in web search, ACM Transactions on Infor- difference between the two estimators can be used to mation Systems 25 (2007). evaluate correctness of positional bias models. Two un- [4] N. Craswell, O. Zoeter, M. Taylor, B. Ramsey, An ex- biased estimates of the same quantity (the population perimental comparison of click position-bias mod- mean) should converge. In fact, if we run policy 𝐶 on a els, in: Proceedings of the 2008 International Con- set of[︁queries 𝑄𝑇 , 𝐸[𝑀 𝑇 ] can be ]︁directly estimated as ference on Web Search and Data Mining, WSDM 𝐸𝑄𝑇 ‖𝑄1𝑇 ‖ 𝑞∈𝑄𝑇 𝑀 (𝑅𝑇 , 𝐵 𝑇 ) . ∑︀ ’08, 2008, p. 87–94. [5] O. Chapelle, Y. Zhang, A dynamic bayesian network The model validation process takes three steps: data click model for web search ranking, in: Proceedings collection, estimation, and testing. The first step is to of the 18th International Conference on World Wide collect data via an online ranking experiment. The ex- Web, WWW ’09, 2009, p. 1–10. periment should have two treatment groups (C and T), [6] A. Chuklin, I. Markov, M. d. Rijke, Click models each with a different ranking policy. We then observe for web search, Synthesis Lectures on Information behavioral signals (e.g. clicks) for both groups. For ev- Concepts, Retrieval, and Services 7 (2015) 1–115. ery query in T, we also rank the documents with policy [7] A. Borisov, I. Markov, M. de Rijke, P. Serdyukov, A C in “shadow mode” and log the ranking from C, even neural click model for web search, in: Proceedings though we don’t know which documents would have of the 25th International Conference on World Wide been clicked had policy C been applied. The second step Web, WWW ’16, 2016, p. 531–541. is to use the data to compute two unbiased estimators [8] T. Schnabel, A. Swaminathan, P. Frazier, T. Joachims, Unbiased comparative evaluation of ranking functions, 2016. arXiv:1604.07209. [9] S. Li, Y. Abbasi-Yadkori, B. Kveton, S. Muthukrish- nan, V. Vishwa, Z. Wen, Offline evaluation of rank- ing policies with click models, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018, pp. 1685–1694. [10] X. Wang, N. Golbandi, M. Bendersky, D. Metzler, M. Najork, Position bias estimation for unbiased learning to rank in personal search, in: Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM), 2018, pp. 610– 618. [11] A. Agarwal, I. Zaitsev, T. Joachims, Consistent po- sition bias estimation without online interventions for learning-to-rank, 2018. arXiv:1806.03555. [12] A. Agarwal, I. Zaitsev, X. Wang, C. Li, M. Najork, T. Joachims, Estimating position bias without in- trusive interventions, in: WSDM ’19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 2019. [13] A. Agarwal, K. Takatsu, I. Zaitsev, T. Joachims, A general framework for counterfactual learning-to- rank, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, 2019, pp. 5–14.