1. Introduction

Unbiased Counterfactual Estimation of Ranking Metrics

Haining Yu

0 0 Amazon , Seattle WA , USA

We propose a novel method to estimate metrics for a ranking policy, based on behavioral signal data (e.g. clicks or viewing of video contents) generated by a second diferent policy. Building on [ 1], we prove the counterfactual estimator is unbiased, and discuss its low-variance property. The estimator can be used to evaluate ranking model performance ofline, to validate and selection positional bias models, and to serve as learning objectives when training new models.

eol>Learning to rank presentation bias counterfactual inference

1. Introduction

can be loosely grouped into training and evaluation. For training, the question is how to properly use knowledge Ranking algorithms power large scale information re- in positional bias to train a target policy and maximize trieval systems. They rank web pages when users look relevancy. To start, positional bias models estimate probfor information in search engines, or products when users ability for a document to be examined by a user in a given shop on e-retailers’ websites. Such systems process bil- position; the estimation is based on diferent user behavlions of queries on a daily basis; they also generate large ioral models. Such models, often called “click models”, amount of logs. The logs capture online user behavior have become widely available; see [ 2 ][ 3 ][ 4 ][ 5 ][ 6 ][ 7 ][ 8 ] (e.g., clicking URLs or viewing video contents) and can be [ 9 ][ 10 ][ 11 ][ 12 ]. Built on positional bias models, the semused to improve ranking algorithms. As a result, training inal work of [ 1 ] and [ 13 ] established a framework to new ranking models from logs is a central task in Learn optimize relevancy using noisy behavioral signal data, To Rank theory and application; it is also often referred proving unbiasedness results for ranking metrics with to as learning from “implicit feedback” or “counterfactual additive form. For evaluation, the question is how to learn to rank” in literature (e.g., [ 1 ]). evaluate the target policy, once trained. For industry

Counterfactual learn-to-rank is complicated by the ranking applications, the gold standard for evaluation is presence of “positional bias”. Ranking algorithms deter- to A/B test target policy against behavioral policy, collect mines the position of ranked documents. If the search data on both, and compare ranking metrics such as Averresult page has a “vertical list” layout, the document with age Precision and NDCG. This approach is restricted by rank of 1 is on top of the page; if the result page has a limited experimentation time. As an alternative, ofline horizontal layout, the document with rank of 1 is on top evaluation like [ 9 ] predicts target policy ranking metrics left corner. When positional bias is present, a document using data from behavioral policy. has a higher chance to be examined by user when ranked The research discussed above, in particular [ 1 ] and higher. As a result, when user clicks a document, the [ 9 ], has advanced our understanding to counterfactual click (the “behavioral signal”) can be due to one of two learn-to-rank significantly. Meanwhile, each line of rereasons: either the document is relevant for the given search has its pros and cons. Let us use [ 1 ] and [ 9 ] to query, or the document is on top of the list. When posi- highlight. First of all, the two research focuses on difertional bias is present, document ranking and relevancy ent subjects in a causal relationship. Borrowing a causal jointly determine behavioral signals, making the signal a lens where relevancy and positional bias jointly drive noisy proxy for relevancy, the primary goal of ranking behavioral signals, [ 1 ] focuses on relevancy, the “cause” optimization. while [ 9 ] focuses on behavioral signals, the “efect”. It

In the context of counterfactual learn-to-rank, we refer is an open question whether the approach in [ 1 ] can be to the algorithm generating the log data as the “behav- extended to optimize behavioral signal-based metrics (e.g. ioral policy”. Data generated by behavioral policy is used clicks). Secondly, the two research also difers in valito train a hopefully better algorithm, called the “target dation: once developed, models in [ 9 ] can be validated policy”. Research work in counterfactual learn-to-rank by comparing ofline evaluation and online experimentation measurement. For [ 1 ], even if we can optimize Causality in Search and Recommendation (CSR) and Simulation of relevancy, we cannot easily evaluate how much improveInformation Retrieval Evaluation (Sim4IR) workshops at SIGIR, 2021 ment is made, even with online experimentation. This " hainiy@©2a02m1Caozpyoringh.tcfoormthis(Hpap.erYbuy)its authors. Use permitted under Creative is because evaluating relevancy (and its improvement) CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) ultimately requires manual annotation; for large-scale online search engines that process billions of queries daily, such efort is costly. Last but not least, [ 9 ] relies on high-variance Inverse Propensity Scoring techniques based on the entire ranking permutation (or “action”).

In ranking, the action space is large, for example there are 100!/(100 − 20)! = 1.3 × 1039 ways to select 20 documents out of 100. As a result, action probabilities are small. The ratio between two small probabilities can generate extremely small or large ratios (high variance), making the technique challenging to implement in practical situations. Rank-based propensity adjustment in [ 1 ] using positional efect models has more desirable variance property. Such diference is key to accuracy of ofline evaluation.

This paper brings the two lines of research together.

The main contribution is the unbiased estimation for ranking metrics for behavioral signals. In this sense, it is part of study on “efect” of ranking dynamics. By focusing on the “efect”, it can be validated by ofline evaluation and experimentation. Meanwhile, it retains the desirable unbiasedness properties [ 1 ] and [ 9 ], but replaces high variance Inverse Propensity Scoring adjustments with positional biases, borrowing the key insight from [ 1 ]. Since the focus switches from cause to efect, this requires new techniques and yields unbiased estimators of a new kind. This unbiased estimator can serve as the learning goal for new target policy and enables ofline/online evaluation. It can be also used to establish a method to validate and select positional bias models, a key input to counterfactual estimation framework. to train target policy later. The new policy ranks differently, i.e., = [ 3, 1, 2 ]. This seems an improvement. But user never sees the documents in order of [200, 300, 100] and we don’t know if they will click differently. In other words, is missing data. This is an example of the causal inference, thus the name counterfactual. Estimating = ( , ) without observing is the central task for this paper.

All quantities defined above are random variables (or 2. Problem Set Up vectors). The dependency among them are as follows: query determines the document set , i.e., = ().

Let be a random query. For , the set of documents to , , and the ranking policy jointly determine the rank is = [1, 2, . . .]. A ranking policy assigns rank vector = (, , ). , and then jointly ranks = [(1), (2), . . .] for documents in . , determine the behavioral signal vector = (, , ). a random permutation of [1, . . . , ‖‖], determines the Last but not least and determine = (, ). position of products on web page. For example, in a The table below visualizes the structural causal model. “vertical list” layout, the product with = 1 is on top The randomness in the system comes from multiple of the page. After presenting in order of to user, sources: distribution of , conditional distribution of canwe observe the behavioral signal . A binary vector, didate set |, conditional distribution of |, , , and = [(1), (2), ...], where () = 1 if and only if conditional distribution |, , . The only exception user engages with any ∈ (e.g., clicking a web page is that no distribution is needed on |, . Given or watching a video). Given a ranking vector and the and , the value of is deterministic for most practical behavioral signal vector , we define a ranking metric ranking systems and metrics. of interest = (, ) such as Precision and DCG. Because the analysis involves two policies, we always

Table 1 shows a hypothetical example. For = 1, specify the policy generating the data. That is, we use = [100, 200, 300] represents three documents to rank. to denote ranks generated by policy , use to deThe behavioral policy ranks them as = [ 1, 2, 3 ], note the behavioral signal generated when showing in i.e., to show document 100 first and 300 last. Seeing order of to user, and = ( , ) to denote the list, user ignores the top document 100 and clicks the ranking metrics calculated. This helps distinguish the other two, i.e., = [ 0, 1, 1 ]. If we use Preci- between random variables under diferent policies. We sion@3 to measure performance of ranking policy, we omit other dependencies in notations when confusion get ( , ) = 0.667. Saved in log, the data is used can be avoided.

3. The Unbiased Counterfactual Estimator In this section we define the counterfactual estimator and prove its unbiasedness. We first present the main results in Section 3.1, prove the unbiased results in Section 3.2, and discuss technical details in Section 3.3. 3.1. Main Result We first define assumptions necessary for defining the estimators and proving unbiasedness. First

Assumption 3.1. For a policy , conditional on the ranking vector and behavioral signal , ( , ) are conditionally independent of and , i.e., ( , ) ⊥⊥ , | ,

This is easily satisfied for most ranking metrics such as MRR, MAP, Precision, and NDCG. Next, similar to [1] [9][13], we assume the ranking metric of interest is linearly decomposable, i.e.,

is an unbiased estimator for ,, , [ ( , )], i.e.,

⎡ ,, , [ ( , )] = ⎣ ‖1 ‖ ∑∈︁ ( , , )⎦ ⎤ (3) where the expectation on the right hand side is taken over query set , for every ∈ , over , , over , , , and over , .

Let us use the same example in Table 1 to illustrate how the estimator is computed. Assume we have the following positional bias estimates: (1) = 0.9, (2) = 0.7, (3) = 0.5 (as a reminder, such estimates can be made available via statistical estimation procedures; see [ 12 ] and references within for implementation). Recall that ranking vector = [ 1, 2, 3 ] and behavioral signal = [ 0, 1, 1 ]. For the metric of Precision@3, = 0.667. For policy , we observed but not .

So we use , , and and equation (1) to compute the following estimate: = (0+1× ((12)) +1× ((23)) )/3 = Assumption 3.2. ( , ) = 0.895. Averaging s over queries in yields the coun∑︀∈ (()) (), where () is a determinis- terfactual estimator (2). tic function of rank .

For Precision@3, () = 1 if ≤ 3, and 0 otherwise.

For ∈ , () is a binary random variable and We now set up a series of unbiasedness results, eventually [ ()] is the click probability. Similar to [ 9 ] and [ 10 ], leading to proof of Theorem 3.1. we make the following assumption based on positionbased click model (PBM)[ 6 ]: Lemma 3.2. Let and be two stochastic policies. Under Assumptions 3.1, 3.2, 3.3, and 3.4, ( , , ) is an

3.2. Proof of Unbiasedness

unbiased estimator for |,, [ ], i.e., |,, [ ( , )] = |,, , [ ( , , )] (4)

Proof. Via Assumptions 3.1, 3.2 and 3.3

Assumption 3.3. |,[ ()] = ( ()) (, ), where () > 0 is the probability of examining a certain rank . (, ) is probability of click conditional on being examined.

When using data generated by behavioral policy to train a new policy , we also assume and to share

Assumption 3.4. Given two policies and , and are conditionally independent given and , i.e., ⊥⊥ |, .

The main result of the paper states that:

Theorem 3.1. Define ( , , ) =

∑︁ ∈ and ()=1 ( ()) ( ()) ( ())

(1) Let be queries randomly sampled from the query universe where policy is applied. Under Assumptions 3.1, 3.2, 3.3, and 3.4, 1

∑︁ ( , , ) ‖ ‖ ∈ nothing in common except inputs and . For example, |,, [ ] = ∑︁ ( ()) |,, [ ()] the output of one policy is not used as input to another: ∈

(5)

By Assumption 3.3,

|,, [ ()] = ( ()) (, ) Defining a shorthand Ψ = ( ()) ( ()) ( ()) ( ()) |,, [ ] ∑︁ ( ()) ( ()) (, ) , it follows that = = ∈ ∈ (2)

∑︁ ( ()) ( ()) (, )Ψ = ∑︁ ( ()) |,, [ ()]Ψ ∈

Theorem 3.1 holds when both and are deterministic = ∑︁ ( ()) |,, , [ ()]Ψ policies, without Assumption 3.4. The proof is omitted ∈ due to space limit. In practical ranking systems, output of [︃ ]︃ one ranker is frequently incorporated into another. This = |,, , ∑︁ ( ()) ()Ψ violates Assumption 3.4, which requires two policies to ∈ share nothing except inputs. = |,, , ⎣⎡ ∑︁ ( ()) (((()))) ⎦⎤ terTphaertuinnbeiqasueadtieosnti(m4)atoofr[l1o]o,kwshdeirfeeretnhtefproomsitiitosncaolubnia-s ()=1 appears only once. It is easy to understand the difer= |,, , [ ( , , )] ence with a causal view: the common assumption in [ 1 ] and the current work is that relevancy and positional The third step in the derivation is due to Assumption 3.3; bias jointly drive behavioral signals. When it comes to the fourth step is due to Assumption 3.4. estimation, we are interested in diferent subjects. [ 1 ] is interested in estimating relevancy (the cause) from Lemma 3.3. Under Assumptions 3.1, 3.2, 3.4, and 3.3, clicks (the efect). So it has the 1/ factor to cancel out ,, , [ ( , )] the positional bias from behavioral policy. The present work is interested in estimating metrics defined on be= ,, , , [ ( , , )] havioral signal (the efect) on target policy, from data generated by a behavioral policy policy (a second efect).

Proof. By Assumption 3.4, and are conditionally Two positional bias terms are thus needed to cancel the independent. As a result, ( , ) is also condi- efect. tionally independent from . Therefore The counterfactual estimators (2) aims to avoid the |,, [ ( , )] high variance challenge facing other IPS estimators, e.g., in [ 9 ]. It is a common practice to use IPS estimators to = |,, , [ ( , )] construct estimates for metrics of interest. While such Combining this equation with Lemma 3.2 yields estimators enjoy the desirable property of unbiasedness, their variance profile is of concern. The core of any |,, , [ ( , )] IPS estimator is the ratio for a ranking to be selected = |,, , [ ( , , )] bPyrt w(o|d,ifere)n.tInpoplriacicetsice,thaenrdank,ini.ge.,spParce(is|(co,mb)/iThe expectations on both sides of the above equa- natorially) large and action probabilities are small. Dition are conditioned on the same joint distribution of viding one small number over another can generate ex, , , . Taking expectation over both sides of the tremely small or large ratios. When any policy is deterequation yields: ministic, Pr (|, ) is ill-defined. The problem gets worse when the behavioral and target policy difer signifi,, , , [ ( , )] cantly, i.e, when accurate ofline performance evaluation = ,, , , [ ( , , )] (6) is needed most. As a result, the ratio can have high variance; this prevents IPS estimators from being useful

Again using Assumption 3.4, we can remove from in industry applications. The current approach solves the expectation on in left hand side of equation (6). this problem. Counterfactual estimation using equation This yields (2) no longer needs the high variance action probability ratio. Instead it uses the ratio between positional bias ,, , [ ( , )] estimates ( ), a function of rank positions. The ratio of = ,, , , [ ( , , )] empirically has much less variance than estimated action probability ratio.

The current framework can be generalized in three Theorem 3.1 can now be proved as follows: diferent ways. First, it naturally extends to contextual ranking problems, where represents not only Pr,ooift. Sisincaen u‖n1bia‖s∑e︀d∈estimatoirs osfa mthpeletrmueeamneaonf tahbelesfeoarrcrhanqkueerr.yS,ebcuotnadlslyo, aitllccaonnbteexgteinnfeorramlizaetidontoaovpatiil-,, , , [ ] which, by Lemma 3.3, equal to mize query/document specific rewards. This makes it ,, , [ ( , )]. Thus it is an unbiased esti- easy when diferent documents have diferent economic value. Assumption 3.2 can be relaxed to (, ) = mator of ,, , [ ( , )]. ∑︀∈ ((), , )() , where (, , ) is a deter- previously defined. In the third step, we use the two ministic function of rank , query , and document estimates to construct a model validation test. A simset . Last but not least, the probability of examina- ple approach is to treat the sample mean estimator as tion and condition click probability can depend the “ground truth”, as long as the sample size of data on query and candidate document set . That is, is big enough. The diference between two estimators Assumption 3.3 can be relaxed to |,[ ()] = can thus be used to quantify the correctness of model. ( (), , ) (, , ). Same is true for Assumption A method with more statistical rigor is to treat the two 3.4. estimates as group means of random variables with estimated standard deviations. Standard hypothesis testing readily applies.

4. Validating and Selecting Positional Bias Models

The unbiased counterfactual estimator has three potential uses: to evaluate ofline ranking performance, to validate and selection positional bias estimates, and to serve as loss (or reward in reinforcement learning setting) in training new ranking models. Some have been covered by literature. See [ 9 ] for discussion on ofline ranking performance evaluation and [ 1 ] for discussion on training loss improvement. The rest of this section focuses on validating and selecting positional bias models, an area not covered in past works. positional bias models can be developed in many ways, dependent upon theory (e.g., underlying statistical model, the causal structure, inclusion and exclusion of predictive features), data, and estimation processes. When there is one model, the question is how correct it is. When there are multiple models, the question is how to select the best one for a specific use case.

Using the counterfactual estimator, a method can be developed to validate and selection of positional bias models. It is based on the following idea: we already have one unbiased estimator of [ ] using positional bias estimates as input; they are s defined in equations (1) and (2). If we find a second unbiased estimators for [ ] without using positional bias estimates, the diference between the two estimators can be used to evaluate correctness of positional bias models. Two unbiased estimates of the same quantity (the population mean) should converge. In fact, if we run policy on a set of queries , [ ] can be directly estimated as [︁ ‖1 ‖ ∑︀∈ ( , )]︁.

The model validation process takes three steps: data collection, estimation, and testing. The first step is to collect data via an online ranking experiment. The experiment should have two treatment groups (C and T), each with a diferent ranking policy. We then observe behavioral signals (e.g. clicks) for both groups. For every query in T, we also rank the documents with policy C in “shadow mode” and log the ranking from C, even though we don’t know which documents would have been clicked had policy C been applied. The second step is to use the data to compute two unbiased estimators

5. Conclusion We built a counterfactual estimator for ranking metrics

defined on behavioral signals. The estimator is unbiased and has low variance. We discuss its usage in selecting and validating positional bias models. This estimator can be applied to ranking models with strong counterfactual efect.

[1]

Joachims ,

Swaminathan , T. Schnabel, Unbiased learning-to-rank with biased feedback , in: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining , 2017 , pp. 781 - 789 .

[2]

Joachims , Optimizing search engines using clickthrough data , in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '02 , 2002 , p. 133 - 142 .

[3]

Joachims ,

L. A.

Granka ,

Pan ,

H. A.

Hembrooke ,

Radlinski , G. Gay, Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search , ACM Transactions on Information Systems 25 ( 2007 ).

[4]

Craswell ,

Zoeter ,

Taylor , B. Ramsey , An experimental comparison of click position-bias models , in: Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM '08 , 2008 , p. 87 - 94 .

[5]

Chapelle , Y. Zhang, A dynamic bayesian network click model for web search ranking , in: Proceedings of the 18th International Conference on World Wide Web, WWW '09 , 2009 , p. 1 - 10 .

[6]

Chuklin , I. Markov, M. d. Rijke, Click models for web search , Synthesis Lectures on Information Concepts , Retrieval, and Services 7 ( 2015 ) 1 - 115 .

[7]

Borisov , I. Markov, M. de Rijke,

Serdyukov , A neural click model for web search , in: Proceedings of the 25th International Conference on World Wide Web, WWW '16 , 2016 , p. 531 - 541 .

[8]

Schnabel ,

Swaminathan ,

Frazier , T. Joachims, Unbiased comparative evaluation of ranking functions , 2016 . arXiv: 1604 . 07209 .

[9]

Li ,

Abbasi-Yadkori ,

Kveton ,

Muthukrishnan ,

Vishwa ,

Wen , Ofline evaluation of ranking policies with click models , in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2018 , pp. 1685 - 1694 .

[10]

Wang ,

Golbandi ,

Bendersky ,

Metzler ,

Najork , Position bias estimation for unbiased learning to rank in personal search , in: Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM) , 2018 , pp. 610 - 618 .

[11]

Agarwal , I. Zaitsev, T. Joachims, Consistent position bias estimation without online interventions for learning-to- rank , 2018 . arXiv: 1806 .03555.

[12]

Agarwal ,

Zaitsev ,

Wang ,

Li ,

Najork , T. Joachims, Estimating position bias without intrusive interventions , in: WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining , 2019 .

[13]

Agarwal ,

Takatsu , I. Zaitsev,

Joachims , A general framework for counterfactual learning-torank , in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval , 2019 , pp. 5 - 14 .