=Paper=
{{Paper
|id=Vol-2911/paper4
|storemode=property
|title=Unbiased Counterfactual Estimation of Ranking Metrics
|pdfUrl=https://ceur-ws.org/Vol-2911/paper4.pdf
|volume=Vol-2911
|authors=Haining Yu
}}
==Unbiased Counterfactual Estimation of Ranking Metrics==
<pdf width="1500px">https://ceur-ws.org/Vol-2911/paper4.pdf</pdf>
<pre>
Unbiased Counterfactual Estimation of Ranking Metrics
Haining Yu4
4
    Amazon, Seattle WA, USA


                                             Abstract
                                             We propose a novel method to estimate metrics for a ranking policy, based on behavioral signal data (e.g. clicks or viewing of
                                             video contents) generated by a second different policy. Building on [1], we prove the counterfactual estimator is unbiased,
                                             and discuss its low-variance property. The estimator can be used to evaluate ranking model performance offline, to validate
                                             and selection positional bias models, and to serve as learning objectives when training new models.

                                             Keywords
                                             Learning to rank, presentation bias, counterfactual inference


1. Introduction                                                                                                       can be loosely grouped into training and evaluation. For
                                                                                                                      training, the question is how to properly use knowledge
Ranking algorithms power large scale information re-                                                                  in positional bias to train a target policy and maximize
trieval systems. They rank web pages when users look                                                                  relevancy. To start, positional bias models estimate prob-
for information in search engines, or products when users                                                             ability for a document to be examined by a user in a given
shop on e-retailers’ websites. Such systems process bil-                                                              position; the estimation is based on different user behav-
lions of queries on a daily basis; they also generate large                                                           ioral models. Such models, often called “click models”,
amount of logs. The logs capture online user behavior                                                                 have become widely available; see [2][3][4][5][6][7][8]
(e.g., clicking URLs or viewing video contents) and can be                                                            [9][10][11][12]. Built on positional bias models, the sem-
used to improve ranking algorithms. As a result, training                                                             inal work of [1] and [13] established a framework to
new ranking models from logs is a central task in Learn                                                               optimize relevancy using noisy behavioral signal data,
To Rank theory and application; it is also often referred                                                             proving unbiasedness results for ranking metrics with
to as learning from “implicit feedback” or “counterfactual                                                            additive form. For evaluation, the question is how to
learn to rank” in literature (e.g., [1]).                                                                             evaluate the target policy, once trained. For industry
   Counterfactual learn-to-rank is complicated by the                                                                 ranking applications, the gold standard for evaluation is
presence of “positional bias”. Ranking algorithms deter-                                                              to A/B test target policy against behavioral policy, collect
mines the position of ranked documents. If the search                                                                 data on both, and compare ranking metrics such as Aver-
result page has a “vertical list” layout, the document with                                                           age Precision and NDCG. This approach is restricted by
rank of 1 is on top of the page; if the result page has a                                                             limited experimentation time. As an alternative, offline
horizontal layout, the document with rank of 1 is on top                                                              evaluation like [9] predicts target policy ranking metrics
left corner. When positional bias is present, a document                                                              using data from behavioral policy.
has a higher chance to be examined by user when ranked                                                                   The research discussed above, in particular [1] and
higher. As a result, when user clicks a document, the                                                                 [9], has advanced our understanding to counterfactual
click (the “behavioral signal”) can be due to one of two                                                              learn-to-rank significantly. Meanwhile, each line of re-
reasons: either the document is relevant for the given                                                                search has its pros and cons. Let us use [1] and [9] to
query, or the document is on top of the list. When posi-                                                              highlight. First of all, the two research focuses on differ-
tional bias is present, document ranking and relevancy                                                                ent subjects in a causal relationship. Borrowing a causal
jointly determine behavioral signals, making the signal a                                                             lens where relevancy and positional bias jointly drive
noisy proxy for relevancy, the primary goal of ranking                                                                behavioral signals, [1] focuses on relevancy, the “cause”
optimization.                                                                                                         while [9] focuses on behavioral signals, the “effect”. It
   In the context of counterfactual learn-to-rank, we refer                                                           is an open question whether the approach in [1] can be
to the algorithm generating the log data as the “behav-                                                               extended to optimize behavioral signal-based metrics (e.g.
ioral policy”. Data generated by behavioral policy is used                                                            clicks). Secondly, the two research also differs in vali-
to train a hopefully better algorithm, called the “target                                                             dation: once developed, models in [9] can be validated
policy”. Research work in counterfactual learn-to-rank                                                                by comparing offline evaluation and online experimen-
                                                                                                                      tation measurement. For [1], even if we can optimize
Causality in Search and Recommendation (CSR) and Simulation of                                                        relevancy, we cannot easily evaluate how much improve-
Information Retrieval Evaluation (Sim4IR) workshops at SIGIR, 2021                                                    ment is made, even with online experimentation. This
" hainiy@amazon.com (H. Yu)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative   is because evaluating relevancy (and its improvement)
                                       Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                        ultimately requires manual annotation; for large-scale
online search engines that process billions of queries         Table 1
daily, such effort is costly. Last but not least, [9] relies   Illustrative Data Sample
on high-variance Inverse Propensity Scoring techniques
                                                                Query     Document        Rank by 𝐶   Clicked? (1=yes)    Rank by 𝑇
based on the entire ranking permutation (or “action”).
In ranking, the action space is large, for example there        1         100             1           0                   3
are 100!/(100 − 20)! = 1.3 × 1039 ways to select 20             1         200             2           1                   1
documents out of 100. As a result, action probabilities         1         300             3           1                   2
are small. The ratio between two small probabilities can
generate extremely small or large ratios (high variance),
making the technique challenging to implement in prac-
tical situations. Rank-based propensity adjustment in
[1] using positional effect models has more desirable
variance property. Such difference is key to accuracy of
offline evaluation.
   This paper brings the two lines of research together.
The main contribution is the unbiased estimation for
ranking metrics for behavioral signals. In this sense,
it is part of study on “effect” of ranking dynamics. By
focusing on the “effect”, it can be validated by offline
evaluation and experimentation. Meanwhile, it retains          Figure 1: Causal relationship among the random variables
the desirable unbiasedness properties [1] and [9], but
replaces high variance Inverse Propensity Scoring adjust-
ments with positional biases, borrowing the key insight
from [1]. Since the focus switches from cause to effect,       to train target policy 𝑇 later. The new policy ranks dif-
this requires new techniques and yields unbiased estima-       ferently, i.e., 𝑅𝑇 = [3, 1, 2]. This seems an improve-
tors of a new kind. This unbiased estimator can serve          ment. But user never sees the documents in order of
as the learning goal for new target policy and enables         [200, 300, 100] and we don’t know if they will click dif-
offline/online evaluation. It can be also used to establish    ferently. In other words, 𝐵 𝑇 is missing data. This is an
a method to validate and select positional bias models, a      example of the causal inference, thus the name counter-
key input to counterfactual estimation framework.              factual. Estimating 𝑀 𝑇 = 𝑀 (𝑅𝑇 , 𝐵 𝑇 ) without observ-
                                                               ing 𝐵 𝑇 is the central task for this paper.
                                                                  All quantities defined above are random variables (or
2. Problem Set Up                                              vectors). The dependency among them are as follows:
                                                               query 𝑞 determines the document set 𝐷, i.e., 𝐷 = 𝐷(𝑞).
Let 𝑞 be a random query. For 𝑞, the set of documents to        𝑞, 𝐷, and the ranking policy 𝐶 jointly determine the
rank is 𝐷 = [𝑑1 , 𝑑2 , . . .]. A ranking policy 𝐶 assigns      rank vector 𝑅 = 𝑅(𝑞, 𝐷, 𝐶). 𝑞, 𝐷 and 𝑅 then jointly
ranks 𝑅 = [𝑟(𝑑1 ), 𝑟(𝑑2 ), . . .] for documents in 𝐷. 𝑅,       determine the behavioral signal vector 𝐵 = 𝐵(𝑞, 𝐷, 𝑅).
a random permutation of [1, . . . , ‖𝐷‖], determines the       Last but not least 𝑅 and 𝐵 determine 𝑀 = 𝑀 (𝐵, 𝑅).
position of products 𝐷 on web page. For example, in a          The table below visualizes the structural causal model.
“vertical list” layout, the product with 𝑟 = 1 is on top          The randomness in the system comes from multiple
of the page. After presenting 𝐷 in order of 𝑅 to user,         sources: distribution of 𝑞, conditional distribution of can-
we observe the behavioral signal 𝐵. A binary vector,           didate set 𝐷|𝑞, conditional distribution of 𝑅|𝑞, 𝐷, 𝐶, and
𝐵 = [𝑏(𝑑1 ), 𝑏(𝑑2 ), ...], where 𝑏(𝑑) = 1 if and only if       conditional distribution 𝐵|𝑞, 𝐷, 𝑅. The only exception
user engages with any 𝑑 ∈ 𝐷 (e.g., clicking a web page         is that no distribution is needed on 𝑀 |𝑅, 𝐵. Given 𝑅
or watching a video). Given a ranking vector 𝑅 and the         and 𝐵, the value of 𝑀 is deterministic for most practical
behavioral signal vector 𝐵, we define a ranking metric         ranking systems and metrics.
of interest 𝑀 = 𝑀 (𝑅, 𝐵) such as Precision and DCG.               Because the analysis involves two policies, we always
   Table 1 shows a hypothetical example. For 𝑞 = 1,            specify the policy generating the data. That is, we use
𝐷 = [100, 200, 300] represents three documents to rank.        𝑅𝑇 to denote ranks generated by policy 𝐶, use 𝐵 𝑇 to de-
The behavioral policy 𝐶 ranks them as 𝑅𝐶 = [1, 2, 3],          note the behavioral signal generated when showing 𝐷 in
i.e., to show document 100 first and 300 last. Seeing          order of 𝑅𝑇 to user, and 𝑀 𝑇 = 𝑀 (𝐵 𝑇 , 𝑅𝑇 ) to denote
the list, user ignores the top document 100 and clicks         the ranking metrics calculated. This helps distinguish
the other two, i.e., 𝐵 𝐶 = [0, 1, 1]. If we use Preci-         between random variables under different policies. We
sion@3 to measure performance of ranking policy, we            omit other dependencies in notations when confusion
get 𝑀 (𝑅𝐶 , 𝐵 𝐶 ) = 0.667. Saved in log, the data is used      can be avoided.
3. The Unbiased Counterfactual                                  is an unbiased estimator for 𝐸𝑞,𝐷,𝑅𝑇 ,𝐵 𝑇 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )],
                                                                i.e.,
   Estimator                                                                                       ⎡                                 ⎤
In this section we define the counterfactual estimator and                                               1     ∑︁
                                                                𝐸𝑞,𝐷,𝑅𝑇 ,𝐵 𝑇 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )] = 𝐸 ⎣               𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 )⎦
prove its unbiasedness. We first present the main results                                             ‖𝑄𝐶 ‖ 𝑞∈𝑄
                                                                                                                  𝐶
in Section 3.1, prove the unbiased results in Section 3.2,                                                             (3)
and discuss technical details in Section 3.3.                   where the expectation on the right hand side is taken over
                                                                query set 𝑄𝐶 , 𝐷 for every 𝑞 ∈ 𝑄𝐶 ,𝑅𝑇 over 𝑞, 𝐷, 𝐵 𝑇
3.1. Main Result                                                over 𝑞, 𝐷, 𝑅𝑇 , and 𝑅𝐶 over 𝑞, 𝐷.
We first define assumptions necessary for defining the             Let us use the same example in Table 1 to illustrate
estimators and proving unbiasedness. First                      how the estimator is computed. Assume we have the
                                                                following positional bias estimates: 𝜂(1) = 0.9, 𝜂(2) =
Assumption 3.1. For a policy 𝐶, conditional on
                                                                0.7, 𝜂(3) = 0.5 (as a reminder, such estimates can be
the ranking vector 𝑅𝐶 and behavioral signal 𝐵 𝐶 ,
                                                                made available via statistical estimation procedures; see
𝑀 𝐶 (𝑅𝐶 , 𝐵 𝐶 ) are conditionally independent of 𝑞 and 𝐷,
                                                                [12] and references within for implementation). Recall
i.e., 𝑀 𝐶 (𝑅𝐶 , 𝐵 𝐶 ) ⊥
                      ⊥ 𝑞, 𝐷|𝑅𝐶 , 𝐵 𝐶
                                                                that ranking vector 𝑅𝐶 = [1, 2, 3] and behavioral sig-
   This is easily satisfied for most ranking metrics such       nal 𝐵 𝐶 = [0, 1, 1]. For the metric of Precision@3,
as MRR, MAP, Precision, and NDCG.                               𝑀 𝐶 = 0.667. For policy 𝑇 , we observed 𝑅𝑇 but not 𝐵 𝑇 .
   Next, similar to [1] [9][13], we assume the ranking          So we use 𝑅𝑇 , 𝑅𝐶 , and 𝐵 𝐶 and equation (1) to compute
metric of interest is linearly decomposable, i.e.,              the following estimate: 𝑌 = (0+1× 𝜂(2)
                                                                                                     𝜂(1)
                                                                                                          +1× 𝜂(2)  )/3 =
                                                                                                               𝜂(3)
Assumption          3.2.    𝑀 𝐶 (𝑅𝐶 , 𝐵 𝐶 )         =           0.895. Averaging 𝑌 s over queries in 𝑄𝐶 yields the coun-
∑︀
        𝐿(𝑟(𝑑))𝑏  𝐶
                    (𝑑), where 𝐿(𝑟)   is  a determinis-         terfactual estimator (2).
   𝑑∈𝐷
tic function of rank 𝑟.
  For Precision@3, 𝐿(𝑟) = 1 if 𝑟 ≤ 3, and 0 otherwise.          3.2. Proof of Unbiasedness
  For 𝑑 ∈ 𝐷, 𝑏𝐶 (𝑑) is a binary random variable and             We now set up a series of unbiasedness results, eventually
𝐸[𝑏𝐶 (𝑑)] is the click probability. Similar to [9] and [10],    leading to proof of Theorem 3.1.
we make the following assumption based on position-
based click model (PBM)[6]:                                     Lemma 3.2. Let 𝑇 and 𝐶 be two stochastic policies. Under
                                                                Assumptions 3.1, 3.2, 3.3, and 3.4, 𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 ) is an
Assumption 3.3. 𝐸𝐵 𝐶 |𝑞,𝐷 [𝑏𝐶 (𝑑)] = 𝜂(𝑟𝐶 (𝑑))𝛾(𝑑, 𝑞),          unbiased estimator for 𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 [𝑀 𝑇 ], i.e.,
where 𝜂(𝑟) > 0 is the probability of examining a certain
rank 𝑟. 𝛾(𝑑, 𝑞) is probability of click conditional on being    𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 [𝑀 (𝑅𝑇 , 𝐵 𝑇 )] = 𝐸𝐵 𝐶 |𝑞,𝐷,𝑅𝐶 ,𝑅𝑇 [𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 )]
examined.                                                                                                            (4)
   When using data generated by behavioral policy 𝐶 to          Proof. Via Assumptions 3.1, 3.2 and 3.3
train a new policy 𝑇 , we also assume 𝑇 and 𝐶 to share
nothing in common except inputs 𝑞 and 𝐷. For example,
                                                                                      ∑︁
                                                                𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 [𝑀 𝑇 ] =    𝐿(𝑟𝑇 (𝑑))𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 [𝑏𝑇 (𝑑)]
the output of one policy is not used as input to another:                              𝑑∈𝐷
                                                       𝑇                                                                 (5)
Assumption 3.4. Given two policies 𝑇 and 𝐶, 𝑅 and
                                                                By Assumption 3.3,
𝑅𝐶 are conditionally independent given 𝑞 and 𝐷, i.e.,
𝑅𝑇 ⊥⊥ 𝑅𝐶 |𝑞, 𝐷.
                                                                        𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 [𝑏𝑇 (𝑑)] = 𝜂(𝑟𝑇 (𝑑))𝛾(𝑑, 𝑞)
  The main result of the paper states that:
                                                                  Defining a shorthand
Theorem 3.1. Define
                                                                                      𝐿(𝑟𝑇 (𝑑)) 𝜂(𝑟𝑇 (𝑑))
                                                    𝜂(𝑟𝑇 (𝑑))                   Ψ=
                                                                                      𝐿(𝑟𝐶 (𝑑)) 𝜂(𝑟𝐶 (𝑑))
                              ∑︁
𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 ) =                      𝐿(𝑟𝑇 (𝑑))
                                                    𝜂(𝑟𝐶 (𝑑))
                       𝑑∈𝐷 and 𝑏𝐶 (𝑑)=1                         , it follows that
                                                   (1)
Let 𝑄𝐶 be queries randomly sampled from the query uni-                  𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 [𝑀 𝑇 ]
verse where policy 𝐶 is applied. Under Assumptions 3.1,                 ∑︁
                                                                   =        𝐿(𝑟𝑇 (𝑑))𝜂(𝑟𝑇 (𝑑))𝛾(𝑑, 𝑞)
3.2, 3.3, and 3.4,
                                                                        𝑑∈𝐷
                   1  ∑︁
                          𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 )       (2)
                                                                        ∑︁
                                                                   =          𝐿(𝑟𝐶 (𝑑))𝜂(𝑟𝐶 (𝑑))𝛾(𝑑, 𝑞)Ψ
               ‖𝑄𝐶 ‖ 𝑞∈𝑄
                          𝐶                                             𝑑∈𝐷
        ∑︁
  =           𝐿(𝑟𝐶 (𝑑))𝐸𝐵 𝐶 |𝑞,𝐷,𝑅𝑇 [𝑏𝐶 (𝑑)]Ψ                     3.3. Technical Discussion
       𝑑∈𝐷
        ∑︁                                                        Theorem 3.1 holds when both 𝑇 and 𝐶 are deterministic
  =           𝐿(𝑟𝐶 (𝑑))𝐸𝐵 𝐶 |𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 [𝑏𝐶 (𝑑)]Ψ                 policies, without Assumption 3.4. The proof is omitted
       𝑑∈𝐷                                                        due to space limit. In practical ranking systems, output of
                                                                  one ranker is frequently incorporated into another. This
                          [︃                           ]︃
                               ∑︁         𝐶       𝐶
  =    𝐸𝐵 𝐶 |𝑞,𝐷,𝑅𝑇 ,𝑅𝐶              𝐿(𝑟 (𝑑))𝑏 (𝑑)Ψ               violates Assumption 3.4, which requires two policies to
                               𝑑∈𝐷                                share nothing except inputs.
                                                                     The unbiased estimator looks different from its coun-
                          ⎡                                  ⎤
                                 ∑︁           𝑇    𝜂(𝑟𝑇 (𝑑)) ⎦
  =    𝐸𝐵 𝐶 |𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 ⎣                 𝐿(𝑟 (𝑑))                terpart in equation (4) of [1], where the positional bias
                                                   𝜂(𝑟𝐶 (𝑑))
                               𝑏𝐶 (𝑑)=1                           appears only once. It is easy to understand the differ-
                                                                  ence with a causal view: the common assumption in [1]
  =    𝐸𝐵 𝐶 |𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 [𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 )]
                                                                  and the current work is that relevancy and positional
The third step in the derivation is due to Assumption 3.3;        bias jointly drive behavioral signals. When it comes to
the fourth step is due to Assumption 3.4.                         estimation, we are interested in different subjects. [1]
                                                                  is interested in estimating relevancy (the cause) from
Lemma 3.3. Under Assumptions 3.1, 3.2, 3.4, and 3.3,              clicks (the effect). So it has the 1/𝑄 factor to cancel out
                                                                  the positional bias from behavioral policy. The present
                𝐸𝑞,𝐷,𝑅𝑇 ,𝐵 𝑇 [𝑀 (𝑅𝑇 , 𝐵 𝑇 )]
                                                                  work is interested in estimating metrics defined on be-
          =     𝐸𝑞,𝐷,𝑅𝐶 ,𝐵 𝐶 ,𝑅𝑇 [𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 )]             havioral signal (the effect) on target policy, from data
                                                                  generated by a behavioral policy policy (a second effect).
Proof. By Assumption 3.4, 𝑅𝑇 and 𝑅𝐶 are conditionally             Two positional bias terms are thus needed to cancel the
independent. As a result, 𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 ) is also condi-          effect.
tionally independent from 𝑅𝐶 . Therefore                             The counterfactual estimators (2) aims to avoid the
                 𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )]                   high variance challenge facing other IPS estimators, e.g.,
                                                                  in [9]. It is a common practice to use IPS estimators to
           =     𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )]               construct estimates for metrics of interest. While such
                                                                  estimators enjoy the desirable property of unbiasedness,
Combining this equation with Lemma 3.2 yields
                                                                  their variance profile is of concern. The core of any
                𝐸𝐵 𝑇 |𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )]                IPS estimator is the ratio for a ranking 𝑅 to be selected
                                                                  by two different policies 𝑇 and 𝐶, i.e., Pr𝑇 (𝑅|𝑞, 𝐷)/
          =     𝐸𝐵 𝐶 |𝑞,𝐷,𝑅𝐶 ,𝑅𝑇 [𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 )]
                                                                  Pr𝐶 (𝑅|𝑞, 𝐷). In practice, the ranking space is (combi-
The expectations on both sides of the above equa-                 natorially) large and action probabilities are small. Di-
tion are conditioned on the same joint distribution of            viding one small number over another can generate ex-
𝑞, 𝐷, 𝑅𝑇 , 𝑅𝐶 . Taking expectation over both sides of the         tremely small or large ratios. When any policy is deter-
equation yields:                                                  ministic, Pr𝑇 (𝑅|𝑞, 𝐷) is ill-defined. The problem gets
                                                                  worse when the behavioral and target policy differ signifi-
                𝐸𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 ,𝐵 𝑇 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )]                cantly, i.e, when accurate offline performance evaluation
          =     𝐸𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 ,𝐵 𝐶 [𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 )]       (6)   is needed most. As a result, the ratio can have high
                                                                  variance; this prevents IPS estimators from being useful
  Again using Assumption 3.4, we can remove 𝑅𝐶 from               in industry applications. The current approach solves
the expectation on 𝑀 𝑇 in left hand side of equation (6).         this problem. Counterfactual estimation using equation
This yields                                                       (2) no longer needs the high variance action probability
                                                                  ratio. Instead it uses the ratio between positional bias
                𝐸𝑞,𝐷,𝑅𝑇 ,𝐵 𝑇 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )]                    estimates (𝜂), a function of rank positions. The ratio of 𝜂
          =     𝐸𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 ,𝐵 𝐶 [𝑌 (𝑅𝐶 , 𝑅𝑇 , 𝐵 𝐶 )]             empirically has much less variance than estimated action
                                                                  probability ratio.
                                                                     The current framework can be generalized in three
                                                                  different ways. First, it naturally extends to contex-
   Theorem 3.1 can now be proved as follows:
                                                                  tual ranking problems, where 𝑞 represents not only
Proof. Since ‖𝑄1𝐶 ‖ 𝑞∈𝑄𝐶 𝑌 is sample mean of
                      ∑︀                                          the search query, but also all context information avail-
𝑌 , it is an unbiased estimator of the true mean                  able for ranker. Secondly, it can be generalized to opti-
𝐸𝑞,𝐷,𝑅𝑇 ,𝑅𝐶 ,𝐵 𝐶 [𝑌 ] which, by Lemma 3.3, equal to               mize query/document specific rewards. This makes it
                                                                  easy when different documents have different economic
𝐸𝑞,𝐷,𝑅𝑇 ,𝐵 𝑇 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )]. Thus it is an unbiased esti-
                                                                  value. Assumption 3.2 can be relaxed to 𝑀 (𝑅, 𝐵) =
mator of 𝐸𝑞,𝐷,𝑅𝑇 ,𝐵 𝑇 [𝑀 𝑇 (𝑅𝑇 , 𝐵 𝑇 )].
   𝑑∈𝐷 𝐿(𝑟(𝑑), 𝑞, 𝐷)𝑏(𝑑) , where 𝐿(𝑟, 𝑞, 𝐷) is a deter-       previously defined. In the third step, we use the two
∑︀
ministic function of rank 𝑟, query 𝑞, and document            estimates to construct a model validation test. A sim-
set 𝐷. Last but not least, the probability of examina-        ple approach is to treat the sample mean estimator as
tion 𝜂 and condition click probability 𝛾 can depend           the “ground truth”, as long as the sample size of data
on query 𝑞 and candidate document set 𝐷. That is,             is big enough. The difference between two estimators
Assumption 3.3 can be relaxed to 𝐸𝐵 𝑇 |𝑞,𝐷 [𝑏𝑇 (𝑑)] =         can thus be used to quantify the correctness of model.
𝜂(𝑟𝑇 (𝑑), 𝑞, 𝐷)𝛾(𝑑, 𝑞, 𝐷). Same is true for Assumption        A method with more statistical rigor is to treat the two
3.4.                                                          estimates as group means of random variables with esti-
                                                              mated standard deviations. Standard hypothesis testing
                                                              readily applies.
4. Validating and Selecting
   Positional Bias Models                                     5. Conclusion
The unbiased counterfactual estimator has three poten-        We built a counterfactual estimator for ranking metrics
tial uses: to evaluate offline ranking performance, to        defined on behavioral signals. The estimator is unbiased
validate and selection positional bias estimates, and to      and has low variance. We discuss its usage in selecting
serve as loss (or reward in reinforcement learning set-       and validating positional bias models. This estimator can
ting) in training new ranking models. Some have been          be applied to ranking models with strong counterfactual
covered by literature. See [9] for discussion on offline      effect.
ranking performance evaluation and [1] for discussion
on training loss improvement. The rest of this section fo-
cuses on validating and selecting positional bias models,     References
an area not covered in past works. positional bias models
can be developed in many ways, dependent upon theory           [1] T. Joachims, A. Swaminathan, T. Schnabel, Un-
(e.g., underlying statistical model, the causal structure,         biased learning-to-rank with biased feedback, in:
inclusion and exclusion of predictive features), data, and         Proceedings of the Tenth ACM International Con-
estimation processes. When there is one model, the ques-           ference on Web Search and Data Mining, 2017, pp.
tion is how correct it is. When there are multiple models,         781–789.
the question is how to select the best one for a specific      [2] T. Joachims, Optimizing search engines using
use case.                                                          clickthrough data, in: Proceedings of the Eighth
   Using the counterfactual estimator, a method can be             ACM SIGKDD International Conference on Knowl-
developed to validate and selection of positional bias             edge Discovery and Data Mining, KDD ’02, 2002, p.
models. It is based on the following idea: we already              133–142.
have one unbiased estimator of 𝐸[𝑀 𝑇 ] using positional        [3] T. Joachims, L. A. Granka, B. Pan, H. A. Hembrooke,
bias estimates as input; they are 𝑌 s defined in equa-             F. Radlinski, G. Gay, Evaluating the accuracy of
tions (1) and (2). If we find a second unbiased estimators         implicit feedback from clicks and query reformula-
for 𝐸[𝑀 𝑇 ] without using positional bias estimates, the           tions in web search, ACM Transactions on Infor-
difference between the two estimators can be used to               mation Systems 25 (2007).
evaluate correctness of positional bias models. Two un-        [4] N. Craswell, O. Zoeter, M. Taylor, B. Ramsey, An ex-
biased estimates of the same quantity (the population              perimental comparison of click position-bias mod-
mean) should converge. In fact, if we run policy 𝐶 on a            els, in: Proceedings of the 2008 International Con-
set of[︁queries 𝑄𝑇 , 𝐸[𝑀 𝑇 ] can be ]︁directly estimated as        ference on Web Search and Data Mining, WSDM
𝐸𝑄𝑇 ‖𝑄1𝑇 ‖ 𝑞∈𝑄𝑇 𝑀 (𝑅𝑇 , 𝐵 𝑇 ) .
              ∑︀                                                   ’08, 2008, p. 87–94.
                                                               [5] O. Chapelle, Y. Zhang, A dynamic bayesian network
   The model validation process takes three steps: data
                                                                   click model for web search ranking, in: Proceedings
collection, estimation, and testing. The first step is to
                                                                   of the 18th International Conference on World Wide
collect data via an online ranking experiment. The ex-
                                                                   Web, WWW ’09, 2009, p. 1–10.
periment should have two treatment groups (C and T),
                                                               [6] A. Chuklin, I. Markov, M. d. Rijke, Click models
each with a different ranking policy. We then observe
                                                                   for web search, Synthesis Lectures on Information
behavioral signals (e.g. clicks) for both groups. For ev-
                                                                   Concepts, Retrieval, and Services 7 (2015) 1–115.
ery query in T, we also rank the documents with policy
                                                               [7] A. Borisov, I. Markov, M. de Rijke, P. Serdyukov, A
C in “shadow mode” and log the ranking from C, even
                                                                   neural click model for web search, in: Proceedings
though we don’t know which documents would have
                                                                   of the 25th International Conference on World Wide
been clicked had policy C been applied. The second step
                                                                   Web, WWW ’16, 2016, p. 531–541.
is to use the data to compute two unbiased estimators
 [8] T. Schnabel, A. Swaminathan, P. Frazier,
     T. Joachims, Unbiased comparative evaluation of
     ranking functions, 2016. arXiv:1604.07209.
 [9] S. Li, Y. Abbasi-Yadkori, B. Kveton, S. Muthukrish-
     nan, V. Vishwa, Z. Wen, Offline evaluation of rank-
     ing policies with click models, in: Proceedings of
     the 24th ACM SIGKDD International Conference
     on Knowledge Discovery and Data Mining, 2018,
     pp. 1685–1694.
[10] X. Wang, N. Golbandi, M. Bendersky, D. Metzler,
     M. Najork, Position bias estimation for unbiased
     learning to rank in personal search, in: Proceedings
     of the 11th ACM International Conference on Web
     Search and Data Mining (WSDM), 2018, pp. 610–
     618.
[11] A. Agarwal, I. Zaitsev, T. Joachims, Consistent po-
     sition bias estimation without online interventions
     for learning-to-rank, 2018. arXiv:1806.03555.
[12] A. Agarwal, I. Zaitsev, X. Wang, C. Li, M. Najork,
     T. Joachims, Estimating position bias without in-
     trusive interventions, in: WSDM ’19: Proceedings
     of the Twelfth ACM International Conference on
     Web Search and Data Mining, 2019.
[13] A. Agarwal, K. Takatsu, I. Zaitsev, T. Joachims, A
     general framework for counterfactual learning-to-
     rank, in: Proceedings of the 42nd International
     ACM SIGIR Conference on Research and Develop-
     ment in Information Retrieval, 2019, pp. 5–14.

</pre>