<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ACM SIGIR Workshop on eCommerce, July</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Counterfactual Learning to Rank via Knowledge Distillation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ehsan Ebrahimzadeh</string-name>
          <email>eebrahimzadeh@ebay.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alex Cozzi</string-name>
          <email>acozzi@ebay.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abraham Bagherjeiran</string-name>
          <email>abagherjeiran@ebay.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robust Estimation</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Learning to Rank</institution>
          ,
          <addr-line>Counterfactual Inference, Knowledge Distillation, Potential Outcome Modeling, Doubly</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>18</volume>
      <issue>2024</issue>
      <abstract>
        <p>Knowledge distillation is a transfer learning technique to improve the performance of a student model trained on a Distilled Empirical Risk, formed by a label distribution defined by some teacher model, which is typically trained on the same task and belongs to a hypothesis class with richer representational capacity. In this work, we study knowledge distillation in the context of counterfactual Learning To Rank(LTR) from implicit user feedback. We consider a generic partial information search ranking scenario, where the relevancy of the items in the logged search context is observed only in the event of an explicit user engagement. The key idea behind using knowledge distillation in this counterfactual setup is to leverage the teacher's distilled knowledge in the form of soft predicted relevance labels to help the student with more efective list-wise comparisons, variance reduction, and improved generalization behavior. We build empirical risk estimates that rely not only on the de-biased observed user feedback via standard Inverse Propensity Weighting, but also on the teacher's distilled knowledge via potential outcome modeling. We analyze the generalization performance of the proposed empirical risk estimators from a theoretical perspective by establishing bounds on their estimation error. We also conduct rigorous counterfactual ofline evaluations as well as online controlled randomized experiments for a product search ranking task in a major E-commerce platform. The primary distinction of the proposed distillation-based perspective in contrast to the standard counterfactual inference based on potential outcome modeling is that we leverage teachers trained on diferent, yet related, tasks to improve the generalization power of a student model for a ranking task. Specifically, we report strong empirical results showing that the distilled knowledge from a teacher trained on expert judgments can significantly improve the generalization performance of a student ranker. We also show how explanatory click models, trained for a click prediction task with privileged encoding of the presentation context in observational data, can explain away the efect of presentation-related confounding for the LTR model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <sec id="sec-2-1">
        <title>1.1. Motivation</title>
        <p>nEvelop-O
Learning To Rank(LTR) from implicit user feedback is the predominant approach in large scale
information retrieval systems, where search context information and user engagement events are
constantly logged and are available at desirable scale, granularity and recency. User engagement
events on the Search Engine Result Pages(SERP) are driven not only by the underlying search
intent, but also through the presentation context, governed by the slotting and organizational
policies of the search engine. Specifically, in a list-wise presentation of the search results, users
are more likely to engage with higher ranked items, due to their inherent sequential browsing
of the page. This signifies the so called position bias in SERP-level user engagement events,
which in turn implies that it is more likely to observe the relevance of the top ranked items in
the logged SERPs than the items ranked in lower slots.</p>
        <p>
          The standard approach to account for the inherent presentation biases in the observational
search activity data is to adopt the propensity matching-based counterfactual inference
framework [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The key idea behind this perspective is to hypothesize an explanatory click model
based on causal constructs that govern users’ browsing behavior[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and estimate a notion of
observation propensity as the probability that the user examines the relevance of an item in
a given search context. The estimated propensity scores are then used to build a de-biased
empirical risk estimate using a technique called Inverse Propensity Weighting(IPW). Note that,
unlike the standard causal inference settings for estimating average treatment efect estimation,
the examination event, which signifies the treatment in this formulation, is not fully observable
and we need to make strong assumptions to build reliable propnesity score estimates.
        </p>
        <p>
          On the other hand, predicting the counterfactual probability of click in the absence of
presentation context confounding is a hard problem. While there is a rich literature on developing
predictive and descriptive click models based on distributed representations of the search context,
e.g. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], or de-confounding click models that disentangle relevance and bias modeling [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] or
causal probabilistic graphical models to explain users’ browsing behavior, e.g. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], the
overwhelming body of research on unbiased LTR rely on slight variations of the IPW technique
based on the so-called position-based examination model[5].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>1.2. Contributions</title>
        <p>
          Building reliable causal click models from observational data is challenging, particularly in the
presence of heterogeneity in search context representation and user activity data. This is because
for a predictive model to be contextually discriminative and well-calibrated, one needs complex
models with rich representations, while LTR models trained on a listwise objectives can achieve
similar ranking performance with cheaper representations. Despite recent attempts for building
causal click models with richer encoding of the search context for disentangling the relevance
from presentation-related confounding from observational data[
          <xref ref-type="bibr" rid="ref4">4, 6</xref>
          ], there is still fundamental
theoretical gap to incorporate an arbitrarily complex click model in a counterfactual learning to
rank setting with rank-discounted Information Retrieval(IR) objectives.
        </p>
        <p>• In this work, we address this theoretical gap by adopting knowledge distillation in
a counterfactual LTR framework, where a relevance teacher trained on a relevance
prediction task is used as a potential outcome model that helps improve the generalization
behavior of the student model for the LTR task.</p>
        <p>Specifically, we study multiple empirical LTR risk estimates that are built via the distilled
knowledge from a teacher model as well as the de-biased observed user feedback through IPW.
To improve the generalization behavior of the vanilla IPW-based and distilled empirical risk
estimates, we adopt an array of diverse approaches, including the doubly robust technique[7],
to build estimators that are robust to inaccuracies of either the teacher model or the propensity
model. The doubly robust estimator for LTR is also developed in the excellent concurrent
work[8].</p>
        <p>• We present a theoretical analysis of the estimation error of the proposed estimators by
characterizing a bias-variance trade-of, which shows that well-calibrated teachers that
approximate the Bayes relevance classifier generalize better.</p>
        <p>By introducing knowledge distillation in the partial information LTR setting, we contribute
to the growing literature on understanding the underlying statistical benefits of knowledge
distillation. Our approach to employing knowledge distillation in counterfactual LTR is, however,
diferent from the standard application of the technique; in that the relevance teacher is trained
on a predictive/explanatory task, while the student uses the de-biased relevance probability
estimates of the teacher as soft training labels for the ranking task. Essentially, through teacher’s
distilled knowledge on un-engaged items, the student can learn from more informative list-wise
comparisons among all pairs of training samples in a query context rather than contrasting
only the engaged items with the un-engaged ones, as in IPW-based techniques.</p>
        <p>
          While our generalization analysis is generic, we discuss specific techniques for developing
efective teacher models. In particular, we first discuss how explanatory click models that
encode the page context confounding independently from relevance covariates help explain
away the confounding presentation context in the logged search session[
          <xref ref-type="bibr" rid="ref4">9, 4</xref>
          ] and propose
simple de-confounding techniques to use observational data for training the relevance teacher.
This is very similar in nature to the IPW-based technique discussed in [8]. Beyond counterfactual
click prediction teachers, we also introduce relevance teachers trained on a diferent, yet highly
related, task.
        </p>
        <p>• We report strong empirical results on a ranking task in a major e-commerce platform
showing that a teacher model trained on expert relevance judgments can significantly
improve the generalization power of the student model.</p>
        <p>The significance of this result is multi fold: first, it ofers a new perspective on knowledge
distillation as a technique to incorporate heterogeneous sources of search utility to the user in
training the ranker. Second, it ofers an efective way to incorporate a broader set of queries,
which may not have rich historic user engagements in training the ranker, thus alleviating
the inherent selection bias of the rankers relying only on search s contexts with rich historic
engagements.</p>
      </sec>
      <sec id="sec-2-3">
        <title>1.3. Summary of Prior Art</title>
        <p>
          Propensity Matching The standard prevalent technique in counterfactual LTR is the Inverse
Propensity Weighting(IPW)[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], where the observed biased clicks are appropriately weighted
in the empirical estimate based on estimated probabilities for observing the item’s relevancy
in a given search context. In [10], it is shown that the IPW estimator fails to account for
trust bias[11], which is the gap between perceived relevance and true relevance that is often
modeled as a position-dependent confounding, and new estimators based on afine corrections
are proposed, which is adopted in multiple subsequent works[12, 13].
        </p>
        <p>
          Explanatory Click Models The simplest, yet most popular, explanatory model is the
examination-based click model[14, 5], defined by operationalizing the causal construct of
observation as a latent Examination random variable. The examination-based models posit
that an item is clicked if it is examined by the user and in addition is relevant to the user’s
intent. It is often further assumed that the examination event depends on the search context
only through the position of the item in the SERP. Examination-based models are enriched
in various ways to incorporate more granular presentation biases given the search context
Most notably, [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] proposes a Dynamic Bayesian Network model (DBN) where new hidden
variables are introduced to explain clicks and search abandonment events. More recently, two
tower models for click prediction[15] with independent relevance and observation towers have
gained popularity. The relevance-observation factorization assumption is revisited in [6] and
deconfounding methods are proposed in [
          <xref ref-type="bibr" rid="ref4">4, 16</xref>
          ] to disentangle relevance from observation.
Propensity Estimation Standard click propensity models are often only based on the position
bias [14] and there is a variety of techniques to estimate the parameters of such models based
either on online interventions, through randomization and interleaving[17], or likelihood
optimization from logged data[17, 18]. Intent aware examination models are proposed in
[
          <xref ref-type="bibr" rid="ref5">19, 20, 18</xref>
          ], where diferent click propensities are learned for diferent intent classes. A dual
learning algorithm is proposed in [
          <xref ref-type="bibr" rid="ref6">21</xref>
          ] to solve the problems of unbiased learning to rank and
unbiased propensity estimation simultaneously.
        </p>
        <p>
          Potential Outcome Modeling The idea of using potential outcome models for counterfactual
inference is popular in the context of contextual bandits and recommendations systems where
predictive models for the reward are developed instead of or in conjunction with inverse
propensity weighting scheme to build empirical reward estimates[
          <xref ref-type="bibr" rid="ref7 ref8 ref9">22, 23, 24</xref>
          ]. There are a
number of recent works in the context of unbiased response prediction that leverage and
analyze the doubly robust technique[
          <xref ref-type="bibr" rid="ref10 ref11">25, 12, 26, 8</xref>
          ].
        </p>
        <p>
          Counterfactual Evaluation Click models are widely used for counterfactual evaluation of
IR metrics, either as data-driven rank-discount functions in utility-based metrics or as estimates
to the user satisfaction gain in efort-based metrics[
          <xref ref-type="bibr" rid="ref12">27</xref>
          ]. The bias of the vanilla Average-Over-All
evaluation metrics is studied in [
          <xref ref-type="bibr" rid="ref13">28</xref>
          ] and propensity-based correction is proposed for debiasing.
Using predicted relevance probabilities as training labels has in fact been considered in [
          <xref ref-type="bibr" rid="ref2">2, 19</xref>
          ],
but it is only limited to settings where the adopted measure of ranking loss only depends on
the relative order of the labels than the actual values.
        </p>
        <p>
          Knowledge Distillation There has been a growing interest in understanding the statistical
underpinnings of knowledge distillation to explain its empirical success[
          <xref ref-type="bibr" rid="ref14 ref15 ref16">29, 30, 31</xref>
          ]. Most
notably, the closest relevant work to our study is [
          <xref ref-type="bibr" rid="ref14">29</xref>
          ], where distillation is studied from a
statistical learning theory perspective by establishing a bias-variance trade-of for the student
based on the quality of the teacher in estimating the true Bayes class probabilities in a multi-label
classification and retrieval. More recently, [
          <xref ref-type="bibr" rid="ref17">32</xref>
          ] extends knowledge distillation to Top K ranking
problem, where ranking at the top is preserved by matching the order of items of student
and teacher, while penalizing items ranked low by the teacher. In [
          <xref ref-type="bibr" rid="ref18">33</xref>
          ] multiple distillation
techniques are proposed to improve the generalization power of the trained recommender model
using data from a random logging policy. A similar uniform data augmentation technique is
used in [
          <xref ref-type="bibr" rid="ref19">34</xref>
          ] to guard against feature distribution shifts. A recent work that ofers a similar
perspective on knowledge distillation in LTR is [
          <xref ref-type="bibr" rid="ref20">35</xref>
          ], where the teacher has access to some
privileged features[36], that the student does not have access to. The other closely related
works are the ranking distillation setting in [37, 38], where the teacher model is trained on the
same ranking task. The main distinction between this study and prior adoptions of knowledge
distillation for the ranking tasks is that we employ teacher models that are trained on a predictive
task and are meant to provide predicted relevance probability estimates to the student.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>1.4. Notation</title>
        <p>Here is a list of notations adopted throughout the paper. Sets and ordered sets, also referred to
as lists, are represented with upper-case calligraphic symbols; such as  . Random quantities
are shown in bold such as d with realization  . The expected value of random variable c is
denoted by [ c] and the conditional expectation of a random variable c =  ( r, o) given r is
denoted by [ c|r]. For a measurable event  , we denote by c a random variable defined in the
measurable space equipped with a regular conditional probability measure ℙ(.| ) with respect
to the conditioning sigma algebra  ( ) . For a function  ∶  → ℝ , the | | dimensional array
[ ()] ∈ is denoted by  ( ) .</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Background and Problem Setup</title>
      <sec id="sec-3-1">
        <title>2.1. Empirical Risk Minimization for LTR</title>
        <p>We consider a generic supervised setup for learning to rank from implicit user feedback in the
logged search activity data. A logged search  ∈  is characterized by the query  , and the
ordered collection of items   ∈   slotted by the search engine’s ranking policy at the logging
time, as well as the click events attributed to the items on the SERP, denoted by (  ) ∈ {0, 1 } . It
is standard to assume a preference model with a context-item desirability distribution ℙr |s(  |)
that produce oracle ground truth relevance labels   . We adopt a simple explanatory model on
the sequential browsing and examination behavior of the user that explains a click event on an
item  in the search context  , c = 1, based on whether the item on the SERP is examined by
the user, o = 1 and that it matches the user’s underlying search intent, r = 1; asserting
ℙ(c = 1|) = ℙ(o = 1|)ℙ(r = 1|).</p>
        <p>The conditional probability ℙ(o = 1|) is referred to as the propensity of observing the relevancy
of the item  in the search context  . We further simplify the adopted examination model by
assuming that the click propensities depend on the presentation context only through the
position of the slotted item on the SERP by the ranking policy; that is
ℙ(o = 1|) = ℙ(o = 1| ()),
(1)
(2)
where  () is the rank attributed to item  by the ranking policy  (⋅) served in the search
context  . There is a large body of literature on estimating position-based user click propensities
either through online interventions or ofline estimation techniques based on latent variable
models from logged search data. The position-based observation probabilities ℙ(o = 1|rank)
define a rank-discount function</p>
        <p>ℓ(rank)that explains the efect of the presentation context on
the user engagement in the search context primarily based on the position of the slotted items,
which one can approximate via fitting a uni-variate model on the estimated propensities as a
function of the ranking slot.
function  (; )
context  , as</p>
        <p>Given the adopted user behavior model and logged search activity data, our objective is to
train a ranking policy by minimizing the statistical risk, corresponding to a search eficiency
metric function, which approximates the expected number of user behavior events. Specifically,
we define the statistical risk for a deterministic ranking policy   (⋅), corresponding to a scoring
that produces a score for each individual document  ∈   given the search
(  ) = [ℒ (  (  ),  (  ))],</p>
        <p>where ℒ (  (  ),  (  )) = −ℓ( (  )) (  ) is defined based on an estimate of the expected
number of clicks in the search context  via a sequential browsing examination model in the
form of Discounted Cumulative Gain(DCG) from the ranked list   (  )produced by the policy
with respect to oracle labels  (  )and the rank-discount function ℓ(⋅). We note that DCG-type
information retrieval metrics are often defined as measures of gain, while with a simple sign
tweak, we consider them as measures of loss for risk minization. While our discussions of the
search eficiency-based loss function are primarily from the perspective of a likelihood model
on the data, we note that our framework generalize to any standard ranking loss function that
is linear as a function of relevance labels.</p>
        <p>Since the underlying joint distribution of the search contexts and relevance labels is not
known to the learner, the standard approach is to build an empirical risk estimate
(3)
(4)
(̂  ) =</p>
        <p>1
| 
|
∑
based on a sampled set   of search contexts with suitably defined empirical labels  (̂  ) to
approximate the ground truth label distribution ℙr |s(  |) . In order to characterize key statistical
properties of an empirical risk estimate, we establish bounds on moments of the estimation
error, which is the divergence of the empirical risk estimates from the Bayes statistical risk
(̂  ) − (  ), based on the divergence of the empirical relevance labels from the underlying
relevance probabilities.</p>
        <p>Lemma 2.1. For any ranking policy  ∶</p>
        <p>→ [ ] and non-negative integer  ,
 [((̂ ) − ( )

) ]≤    [|| [( (̂ s) −r( s)) |s]||2],

where   = ||[ℓ () ]=1 ||2.</p>
        <p>The proof follows by invoking the towering rule and a straightforward application of
CauchySchwarz inequality. Generalization bounds for empirical risk estimates can be obtained by
invoking concentration techniques like Bernstein’s inequality, which characterize the
generalization behavior of a policy based on the bias and variance of the empirical risk estimator.</p>
        <p>There is a remarkable body of work around developing eficient algorithms for Empirical
Risk Minimization(ERM) on DCG-based loss functions over a variety of diferent hypothesis
classes. In this study, we are oblivious to the choice of the hypothesis class and the particular
optimization techniques adopted for training and focus on the generalization power of the
estimators as it relates to the estimation error, oblivious to approximation and optimization
errors.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Counterfactual LTR: Debiasing via Inverse Propensity Weighting</title>
        <p>Given the examination model (1) on the user browsing behavior, we assume that we have an
oracle algorithm to estimate the propensity to engage with item  in the search context  . Once
click propensities, ℙ̂ (o = 1|) , are estimated, an empirical risk estimate can be built using
de-biased labels formed via the Inverse Propensity Weighting(IPW) technique by simply setting
 ̂IPW() =


ℙ̂ (o = 1|)
,
We note that the underlying assumptions on the user behavior data for the propensity
estimation task can in general be diferent/richer than the position-based model ( 2), adopted to
define the ranking eficiency task. While our discussion can be applied to arbitrarily complex
propensity estimation techniques, to simplify the presentation, we assume that the propensities
are estimated based on a vanilla position bias model; that is ℙ̂ (o = 1|) =  ̂ 0() , where  0() is
the rank of the item in logged data   . Owing to the linearity of the loss function ℒ as function
of relevance labels, it is straightforward to show that the IPW empirical risk estimate, denoted
as</p>
        <p>̂
 ̂IPW( ) =</p>
        <p>1
|  |
,(
∑
 )∈</p>
        <p>ℒ ( (  ),  ÎPW(  )),
 ̂KD( ) =</p>
        <p>1
|  |
∑
ℒ ( (  ),  (  ; )).</p>
        <p>(5)
(6)
(7)
is an unbiased estimator of the Bayes statistical risk given the examination model, if the estimated
propensities  ̂ are unbiased. The IPW estimator is known to have high variance, especially
when the estimated propensity values are small, and there is a significant body of literature
around clipping and normalization techniques to reduce the variance of this estimator. In this
work, to improve the generalization of the LTR estimators, we introduce a number of techniques
based on potential outcome modeling, and characterize their bias-variance trade-ofs.</p>
      </sec>
      <sec id="sec-3-3">
        <title>2.3. Knowledge Distillation</title>
        <p>Knowledge distillation is a transfer learning technique where a student model is trained on a
distilled empirical risk built based on a label distribution provided by a teacher model. Given a
teacher model  ∶  ×  → ℝ</p>
        <p>that produces an estimated score for the relevance probability
of each individual document  ∈   given the search context  , a generic distilled empirical
estimate for the risk of the student can be defined as
As shown in the next section, distilled empirical risk estimates has desirable variance-reduction
properties compared to the vanilla empirical risk estimates based on observed feedback and
inverse propensity weighting. As such, a straightforward, yet fundamental, observation to
make about the variance reduction properties of distilled risk estimator is that
Remark 2.2. The variance of the distilled risk estimator built based on the true Bayes relevance
probabilities is no greater than the variance of the vanilla empirical risk estimate formed by
realizations from the relevance distribution.</p>
        <p>
          The proof is trivial and can be found in [
          <xref ref-type="bibr" rid="ref14 ref8">29, 23</xref>
          ]. The improved variance in this estimator
often comes at the cost of some bias, which leads us to explore hybrid estimators that take
advantage of the observed user feedback as well as the teacher’s distilled knowledge.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Debiasing Empirical Risk with Relevance Distillation</title>
      <p>In this section, building upon the ideas discussed in the previous section, we propose empirical
risk estimates that leverage the observed user feedback as well as the distilled knowledge from
a suitably trained teacher. The core idea is to use the observed user feedback to improve the
bias and at the same time use soft labels provided by the teacher to improve the variance of
the estimator. We first discuss our proposed distilled empirical risk estimators oblivious to
the choice of the teacher model and characterize their generalization behavior by developing
bounds on their bias and variance and then discuss a number of techniques to develop efective
teacher models.</p>
      <sec id="sec-4-1">
        <title>3.1. Distilled Empirical Risk Estimates</title>
        <sec id="sec-4-1-1">
          <title>3.1.1. Hybrid Distilled Risk Estimators</title>
          <p>by setting
The main idea for building hybrid distilled risk estimators is to define a simple trade-of for
bias/variance behavior of the estimator by a convex combination of the IPW and distilled
empirical risk estimates. Formally, for a teacher  ∶  ×  → ℝ
propensities, and any 0 ≤  ≤ 1 , a hybrid distilled estimator of the risk  ̂HD
, some estimate  ̂ of the click</p>
          <p>, , ̂ can be obtained
 ̂
 H,D,̂ () =</p>
          <p>− 1, respectively.
in (4). Next, we analyze the bias/variance of this hybrid estimator. Note that the bias/variance
analysis of the IPW and vanilla distilled emprirical risk estimates can be derived as corollary
by setting  to 1 and 0, respectively. For a given search context  , and a document  ∈  
at rank  0(), let  ∗ denote the actual relevance probability ℙ(r = 1|) and  ∗ denote the


actual propensity scores ℙ(o = 1|) . Let us define the deviation of the estimated relevance
probabilities and the estimated propensity scores from the actual quantities as Δ =  (; ) − 
(8)
∗ ,

Theorem 3.1. For any student policy  ,
[  ̂HD( )−( )] ≤ 
∗d d + (1 − )Δ d]∈  2
|| ],
where  1 is the same constant as in Lemma 2.1.</p>
          <p>The proof is straightforward and is followed by characterizing the bias of the empirical labels
and invoking Lemma 2.1. Please refer to the Appendix for a complete proof. Theorem 3.1
provides insight on how to control the bias of the estimator by adjusting the parameter  by
trading of the bias terms due to the IPW and distillation-based components. The variance
analysis in the Appendix shows that the variance due to the IPW component is sensitive to
small propensity score values and the propensity estimation errors. In this case, the variance
reduction that we achieve by including the distillation-based component helps with better
generalization of the hybrid estimator.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>3.1.2. Doubly Robust Risk Estimators</title>
          <p>
            obtained by setting
Although hybrid empirical risk estimates provide a simple and flexible way to control the
biasvariance trade-of, they can incur significant estimation error if either of the risk components
are inaccurate. To alleviate this, we adopt the doubly robust technique[
            <xref ref-type="bibr" rid="ref8">23</xref>
            ] to make the empirical
risk estimator more robust when either of the risk components are accurate. Formally, for a
suitable operationalization assumption for the observation event o , a teacher  ∶  ×  → ℝ
and some estimate  ̂ of click propensities, a doubly robust estimator of the risk  ̂ D,R̂ can be
 ̂
 D,R̂ () =  (; ) +


in (4). The next theorem shows the double robustness property of the doubly robust estimator.
Theorem 3.2. For any student estimator  ,
[  ̂DR( ) − ( )] ≤ 
1 [||[ Δd d]∈  2
|| ],
where  1 is the same constant as in Lemma 2.1.
          </p>
          <p>The proof is followed using similar techniques as in the proof of Theorem 3.1 and is included
in the Supplemental Materials section along with a variance analysis. Theorem 3.2 shows the
desirable double robustness of this estimator, where the bias of the individual terms will be
small if we have good estimates for either the actual click propensities or the true relevance
probabilities by the teacher.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Teacher Models</title>
        <p>
          Our discussion of the distilled risk estimators were focused on their theoretical generalization
properties in terms of the estimation error oblivious to the choice the teacher model. The key
benefit of using knowledge distillation in counterfactual LTR is the improved generalization
properties of the student, achieved through label smoothing via the teacher’s distilled knowledge.
While the student often has to satisfy stringent inference time constraints, the teacher model can
be arbitrarily complex, not only in terms of the representational capacity of the hypothesis class,
but also through leveraging features/representations that the student does not have access to at
the inference time. Such features are referred to as privileged features in [
          <xref ref-type="bibr" rid="ref20">35</xref>
          ]. In this section,
we discuss generic techniques to develop efective teacher models, which we will exemplify in
the empirical results section.
        </p>
        <sec id="sec-4-2-1">
          <title>3.2.1. De-confounding teacher</title>
          <p>In order to train a relevance teacher using logged search data, we adopt the counterfactual
inference framework based on potential outcome modeling. The fundamental inference
challenge in this setting is that we are interested in counterfactual relevance outcome had the item
was observed, denoted by r(o =1), while we only have observed relevancy under explicit user
engagement events. In order to estimate expectations on counterfactual quantities using
observational data, we invoke regression adjustment techniques from potential outcome modeling
framework[39]. The core component in this technique is to control for the presentation-based
confounders, e.g. the rank of the items, which afect both the observed outcome and the
observation event.</p>
          <p>Let  , be the (causal) covariates to predict the relevancy of  in the search context  and let  ,
be the presentation-based confounders. Controlling for confounders and making inference from
observational data amounts to justifying a few technical conditions. Specifically, we have to
ensure (a) ignorability, that there is no unmeasured confounder left out from the covariates  , ,
as well as (b) Overlap, that for any feasible value of covariates, the probability of observing the
relevancy of an item given the covariates, i.e., the propensity score, is bounded away from zero,
ℙ(o = 1| , ,  , ) &gt; 0. Moreover, we have to make sure that the definition of the observation
event is (c) stable in the sense that the potential relevance outcome of an item does not depend
on whether the other items are observed.</p>
          <p>
            We can then compute the expectation of the outcome of counterfactual relevance using
iterative expectations on observational data[39] using
[ r(dod=1)] = [[ cd|od = 1, xd,s, zd,s]],
(10)
where the observational expectations on the right hand side can be estimated via a machine
learned classifier trained on a dataset of items with explicit engagement events using a
crossentropy loss[
            <xref ref-type="bibr" rid="ref3 ref4">4, 9, 3</xref>
            ]. Assuming that the presentation bias is primarily signified in the rank
of the items in the page, a vanilla instantiation of this approach is adopted in [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ], where
the relevance probability is estimated as the observational click probability in rank 1; that is,
ℙ̂ (r = 1|) = ℙ̂ (c = 1| , ,   = 1). Similarly, in the GBDT-based approach in [17], the model is
trained with an interaction depth of 1 to avoid interactions between the rank and the relevance
feature and at the inference time all the trees that use the rank feature are removed. In the
two-tower approach in [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ], a counterfactual model is developed via independent relevance and
observation towers and only the relevance tower is used for inference. A similar approach[8],
which we adopt in our empirical experiments, follows this perspective by training a model on a
cross-entropy loss between the IPW-debiased empirical (Bernoulli) distribution of clicks and
the relevance distribution defined by the predictor; that is
| 1 | ∈∑ ∈∑ [ ( (
, ))||I P̂W()].
          </p>
          <p>(11)</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>3.2.2. Relevance Teacher Trained on Expert Judgements</title>
          <p>In contrast to the relevance teachers trained on the observed biased feedback, discussed in
the previous sub-section, we can train teachers based on relevance labels provided by expert
judgements. The fundamental premise of such relevance teachers is to approximate the marginal
query-based relevance probabilities via expert-annotated labels that are assumed to capture
the canonical intent of a given query. Knowledge distillation is crucial in this case, because the
student does not have access to such judgement-based labels in its own training data, and a
relevance teacher trained on these labels is not strong enough as a standalone ranker, because
it cannot take advantage of the more granular preferences in user engagement data.</p>
          <p>Another important advantage of using relevance teachers trained on expert judgments is that
we can use a broad range of pooling techniques, including active learning, to collect judgements
on queries for which we do not have rich user engagements, alleviating the selection bias
inherent to all the methods discussed so far, which rely strongly upon implicit user feedback.
In the next section, we provide strong empirical results by adopting such relevance teachers.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Empirical Results and Discussions</title>
      <p>We evaluate the performance of the proposed empirical risk estimator by adopting a standard
supervised counterfactual training and evaluation framework on real-world user behavior data
collected from online trafic of a major E-commerce platform. We also report the results from an
online randomized control experiment on a model trained on a distilled empirical risk estimate to
showcase the generalization power of the proposed techniques from a rigorous causal evaluation
perspective. For completeness, we also conduct ofline counterfactual experiments on standard
academic datasets generated via synthetic user behavior data generation on human-judged
web-search data, following the simulation setup and code from [8].</p>
      <p>Since the focus of the paper is on the incremental value from distilled empirical risk
minimization techniques to control the Estimation error in LTR, we are oblivious to the experiment
design choices related to Approximation error, Optimization error, and parameter estimation
techniques for the observation model. As such, we fix the hypothesis class, model hyper
parameters, training optimization techniques, and estimated/simulated propensity scores across
all estimators and limit the set of baseline models accordingly. For the synthetic datasets, we
describe the details of the experiment setup only briefly and refer the reader to [ 8] for extended
discussions and alternative choices data synthesis choices and propensity estimation techniques.
For experiments on the proprietary e-commerce setting, we only report lifts compared to a
standard simple baseline, with a focus on the choices relevant to the estimation error, oblivious
to model training and propensity estimation techniques.</p>
      <sec id="sec-5-1">
        <title>4.1. Experiment Setup for Synthetic Datasets</title>
        <p>
          Dataset Semantics: We use publicly available datsets with manual relevance labels Yahoo
Webscope[40], MSLR-WEB30k[41], and Istella-S LETOR[42] and synthesize user engagement
labels with standard logging policy and user behavior generation semantics, following [
          <xref ref-type="bibr" rid="ref1">8, 1</xref>
          ].
User behavior model: We use a vanilla position-bias based examination model with the
observation probability ℙ(o = 1|rank) = (1 + (rank − 1)/5)−2, ignoring the trust bias. We
also use the same clipping function used in [8] to control the variance of the simulated click
propensities. Moreover, we assume that the parameters of the propensity model are known and
need not be estimated. Since user behavior modeling and propensity estimation are not the
focus of this paper, we avoid unnecessary comparisons with more complex techniques in the
literature.
        </p>
        <p>Estimators: The logging policy is trained based on a standard supervised training on 1% of
training data using an empirical label distribution based on the available relevance judgments.
The logging policy is assumed to be deterministic. The Naive estimator is trained on the biased
empirical click distribution without any corrections, and the IPW estimator is trained via (6)
using the actual propensity scores used for data synthesis. For all distilled risk estimators,
we use the same pre-trained deconfounding teacher model,  1, trained based on the
IPWdebiased empirical distribution via the point-wise cross-entropy loss, discussed in (11). The
KD 1 is trained via (7), and the hybrid estimator HD 1,= 12 and the doubly robust estimator are
trained using the same teacher model  1 and the actual synthesized propensity scores, via label
distributions (8) and (9) respectively.</p>
        <p>Training: Following [8], for the hypothesis class, we use MLPs with two 32-unit hidden layers
and adopt the standard LambdaLoss optimization framework[43] to train the model on the
proposed empirical risk estimates with  = 10 6 training samples randomly sampled from the
synthesized data.</p>
        <p>Evaluation Metrics: We use the Normalized Expected Number of Clicks as the main evaluation
metric, which is aligned with the training objective. Specifically, for a given search context  ,
with candidate set   , the expected number of clicks for a policy  is

ℓ(  (  )) (  ).</p>
        <p>By adopting the synthesized observation probabilities ℙ(o = 1|rank)from our user behavior
model as the discount function ℓ(⋅)and IPW-debiased clicks as the relevance labels  (  ), and
normalizing the contribution of individual context by the maximum attainable value, we get our
primary metric, NCTR. Moreover, by adopting the original Judgment-based relevance labels
as relevance labels  (  )and data-agnostic standard log-rank discounts for ℓ(⋅), we calculate
NDCG</p>
        <p>with respect to the using data-agnostic standard log-rank discount.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Experiment Setup for Real World User Data</title>
        <p>Dataset Semantics: We collect a dataset of around 1M queries from a major e-commerce
platform in 2023 with a single click event attributed to the logged SERP, oblivious to the logging
policy and the post-click engagement events attributed to the clicked items. For training
eficiency, we sample three negative samples at random from impressed items within each
training SERP. For the test data, we use a similar query sampling strategy, but keep all the
candidate items to be re-ranked by the candidate ranker.</p>
        <p>Estimators: The baseline model for all the reported lifts is trained on the Naive estimator
corresponding to the observed clicks, without any debiasing. The IPW estimator, and all other
estimators that rely on propensity correction, we use estimated propensities using the standard
regression-based Expectation Maximization techniques on a vanilla examination based position
bias model[17]. The teacher models are trained on a predictive task with a cross entropy loss
over datasets with diferent distributions. The deconfounding teacher,  1, is trained based on the
IPW-debiased empirical distribution of labels for (query,item) pairs with the same representation
as the ranking task sampled with more stringent conditions to ensure that the relevancy of both
the positive and negative items are observed, satisfying the conditions explained in section 3.2.1
for training counterfactual models using observational data. Specifically, for positive examples,
we require that there has to be an engagement event with shopping intent associated to the
item, and for negative items we require that they appear above the last engaged item in the
SERP. The relevance teacher,  2, is trained via expert judgements trained on (query,item) pairs
with binarized relevance annotations, which satisfies stringent calibration properties. We are
oblivious to the data pooling and training techniques used to train this teacher. The
hyperparameter  of the the hybrid estimators is optimized via grid search on the NCTR objective.
Evaluations: For ofline experiments, we use NCTR with respect to IPW-debiased clicks as
relevance labels and a simple rank fit on the estimated propensities. We also use NDCG with
respect to naive clicks as a vanilla evaluation metric. For online experiments, we report ranking
eficiency metrics in terms of concentration of engagements in top slots(DCG) as well as the
cumulative reward metrics with respect to the share of search result pages with an engagement
event(CTR) from a randomized controlled experiment on the online search trafic of a major
E-commerce platform.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Empirical Research Questions</title>
        <sec id="sec-5-3-1">
          <title>Results and Discussions on Synthetic Data Table 1 provides an overview on the perfor</title>
          <p>mance of the estimators of interest  (⋅,   )for evaluation metrics  , specified as meta-columns,
over the evaluation datasets   , specified as columns. We observe across all datasets, that
the IPW outperforms the Naive estimator and all distilled estimators outperform IPW. the
An extended discussion on the performance of the IPW and DR 1 can be found in [8] with
interesting variance analysis.</p>
          <p>The most interesting observation is that the pointwise counterfactual click prediction teacher
 1 exhibits a strong ranking performance when it is used as the only target component, which
is aligned with the results from [8] where it is used as a standalone ranker. This observation
signifies that there is no meaningful heterogeneity in training data distribution across search
contexts in these public datasets and strong ranking performance can be obtained without
having to resort to listwise modeling. There is barely any performance gain in using the doubly
robust estimator and we omitted doing any parameter tuning for the hybrid estimator.</p>
        </sec>
        <sec id="sec-5-3-2">
          <title>Counterfactual evaluations on Product Search Data Table 2 provides relative perfor</title>
          <p>mance lift of the candidate estimators against a Naive Click model baseline. In stark contrast
to the results on synthetic datasets, we observe that the point-wise teachers are not very
efective for the ranking task due to heterogeneity in the training data distribution. This is
because pointwise models focus their predictive power also on learning that some contexts are
inherently easier/harder for relevance or engagement prediction tasks. Although the Bayes
optimal relevance predictor is also Bayes optimal for list-wise ranking loss[44], when learning
a contextually discriminative and well-calibrated relevance model is complex, such models
fail to perform well in the ranking task. The performance gap(not reported in the table) is
even higher for the relevance teacher trained on expert judgments, particularly because the
model is primarily focused on the context afinity of the candidate items rather than historic
performance.</p>
          <p>We observe, however, that the hybrid risk estimators based the relevance teacher outperform
the IPW estimator. It is interesting that the relevance teacher trained on expert judgements
significantly helps with the performance of HD 2, ∗, signifying the importance of soft relevance
labels for all candidate items and the synergy between the relevance annotations and the
implicit user feedback. We note that the deconfounding teacher  1 helps only marginally with
the performance of the HD 1, ∗ estimator compared to the IPW baseline. Despite the theoretical
appeal, the empirical performance of the doubly robust estimator relies on efective assumptions
on observing the relevance of the unengaged items. In fact, as demonstrated in the table, the
doubly robust estimator based on the naive observation assumption that relevance is observed
only in the event of explicit user engagements, fails to outperform even the IPW estimator in
terms of ranking eficiency metrics.</p>
          <p>Online Experiments on Product Search Ranking In section 3.2.2, we argued that a
relevance teacher can help not only with better generalization, but also with alleviating selection
bias of the ranker by further incorporating queries with poor engagement history in training. We
also showed strong empirical results from ofline counterfactual evaluations that incorporating
relevance signals from a relevance teacher, trained on expert judgements, in a Hybrid distilled
risk estimator can significantly improve ofline SERP eficiency metrics. To support our claims
on the generalization power of our proposed estimators, we report the results from a controlled
randomized online experiment in a major e-commerce platform, where the relevance ranking
model is replaced with a model trained on a Hybrid distilled risk that relies both on debiased user
engagement events and a relevance teacher trained on expert judgements. Beyond significant
lifts( &gt; +2%) in search eficiency metrics as measured by rank-discounted measures, we observed
a remarkable lift( &gt; +1%) in Click through rate, confirming generalization in terms of converting
more novel queries, as well as &gt; 0.5% reduction in search abandonment rate, which is mostly
afected by hard queries with poor historic engagements.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Concluding Remarks</title>
      <p>We proposed an efective debiasing technique for counterfactual learning to rank from
observational search activity data by using distilled knowledge from a relevance teacher to inform the
label distribution for a listwise ranking task. We established the generalization power of the
proposed estimators through rigorous empirical results in ofline counterfactual evaluations as
well as online randomized controlled experiments on a ranking task in a major e-commerce
platform. We also presented a theoretical analysis of the estimation error of the proposed
estimators to justify the improved generalizations from a theoretical perspective. Our contributions
highlight important insights into using potential outcome modeling from the more generic
perspective of knowledge distillation.
wards disentangling relevance and bias in unbiased learning to rank, arXiv preprint
arXiv:2212.13937 (2022).
[5] N. Craswell, O. Zoeter, M. Taylor, B. Ramsey, An experimental comparison of click
positionbias models, in: Proceedings of the 2008 international conference on web search and data
mining, 2008, pp. 87–94.
[6] L. Yan, Z. Qin, H. Zhuang, X. Wang, M. Bendersky, M. Najork, Revisiting two tower models
for unbiased learning to rank (2022).
[7] J. M. Robins, A. Rotnitzky, Semiparametric eficiency in multivariate regression models
with missing data, Journal of the American Statistical Association 90 (1995) 122–129.
[8] H. Oosterhuis, Doubly robust estimation for correcting position bias in click feedback for
unbiased learning to rank, ACM Transactions on Information Systems 41 (2023) 1–33.
[9] Z. Ovaisi, K. Vasilaky, E. Zheleva, Propensity-independent bias recovery in ofline
learningto-rank systems, in: Proceedings of the 44th International ACM SIGIR Conference on
Research and Development in Information Retrieval, 2021, pp. 1763–1767.
[10] A. Vardasbi, H. Oosterhuis, M. de Rijke, When inverse propensity scoring does not
work: Afine corrections for unbiased learning to rank, in: Proceedings of the 29th ACM
International Conference on Information &amp; Knowledge Management, 2020, pp. 1475–1484.
[11] A. Agarwal, X. Wang, C. Li, M. Bendersky, M. Najork, Addressing trust bias for unbiased
learning-to-rank, in: The World Wide Web Conference, 2019, pp. 4–14.
[12] X. Wang, R. Zhang, Y. Sun, J. Qi, Doubly robust joint learning for recommendation on
data missing not at random, in: International Conference on Machine Learning, PMLR,
2019, pp. 6638–6647.
[13] T. Yang, C. Luo, H. Lu, P. Gupta, B. Yin, Q. Ai, Can clicks be both labels and features?
unbiased behavior feature collection and uncertainty-aware learning to rank, in:
Proceedings of the 45th international ACM SIGIR conference on research and development in
information retrieval, 2022, pp. 6–17.
[14] M. Richardson, E. Dominowska, R. Ragno, Predicting clicks: estimating the click-through
rate for new ads, in: Proceedings of the 16th international conference on World Wide
Web, 2007, pp. 521–530.
[15] H. Guo, J. Yu, Q. Liu, R. Tang, Y. Zhang, Pal: a position-bias aware learning framework for
ctr prediction in live recommender systems, in: Proceedings of the 13th ACM Conference
on Recommender Systems, 2019, pp. 452–456.
[16] M. Chen, C. Liu, Z. Liu, J. Sun, Scalar is not enough: Vectorization-based unbiased learning
to rank, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery
and Data Mining, 2022, pp. 136–145.
[17] X. Wang, N. Golbandi, M. Bendersky, D. Metzler, M. Najork, Position bias estimation
for unbiased learning to rank in personal search, in: Proceedings of the Eleventh ACM
International Conference on Web Search and Data Mining, ACM, 2018, pp. 610–618.
[18] Z. Fang, A. Agarwal, T. Joachims, Intervention harvesting for context-dependent
examination-bias estimation, in: Proceedings of the 42nd International ACM SIGIR
Conference on Research and Development in Information Retrieval, 2019, pp. 825–834.
[19] X. Wang, M. Bendersky, D. Metzler, M. Najork, Learning to rank with selection bias in
personal search, in: Proceedings of the 39th International ACM SIGIR conference on
Research and Development in Information Retrieval, 2016, pp. 115–124.
[36] C. Xu, Q. Li, J. Ge, J. Gao, X. Yang, C. Pei, F. Sun, J. Wu, H. Sun, W. Ou, Privileged
features distillation at taobao recommendations, in: Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery &amp; Data Mining, 2020, pp. 2590–2598.
[37] J. Tang, K. Wang, Ranking distillation: Learning compact ranking models with high
performance for recommender system, in: Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery &amp; Data Mining, 2018, pp. 2289–2298.
[38] Z. Qin, L. Yan, Y. Tay, H. Zhuang, X. Wang, M. Bendersky, M. Najork, Improving neural
ranking via lossless knowledge distillation, arXiv preprint arXiv:2109.15285 (2021).
[39] D. B. Rubin, Causal inference using potential outcomes: Design, modeling, decisions,</p>
      <p>Journal of the American Statistical Association 100 (2005) 322–331.
[40] O. Chapelle, Y. Chang, Yahoo! learning to rank challenge overview, in: Proceedings of the
learning to rank challenge, PMLR, 2011, pp. 1–24.
[41] T. Qin, T.-Y. Liu, Introducing letor 4.0 datasets, arXiv preprint arXiv:1306.2597 (2013).
[42] C. Lucchese, F. M. Nardini, S. Orlando, R. Perego, F. Silvestri, S. Trani, Post-learning
optimization of tree ensembles for eficient ranking, in: Proceedings of the 39th International
ACM SIGIR conference on Research and Development in Information Retrieval, 2016, pp.
949–952.
[43] X. Wang, C. Li, N. Golbandi, M. Bendersky, M. Najork, The lambdaloss framework for
ranking metric optimization, in: Proceedings of the 27th ACM international conference
on information and knowledge management, 2018, pp. 1313–1322.
[44] D. Cossock, T. Zhang, Statistical analysis of bayes optimal subset ranking, IEEE
Transactions on Information Theory 54 (2008) 5140–5154.
Note that we only need to focus on the bias of the empirical label of a single sample point and
the proof follows by applying Lemma 2.1.</p>
      <p>[ ̂HD(d) −rd|s] =  [(</p>
      <p>− rd)|s] + (1 − )Δ d
cd
 ̂ 0(d)</p>
    </sec>
    <sec id="sec-7">
      <title>B. Proof of Theorem 3.2</title>
      <p>Note that we only need to derive a bound for the bias of the empirical label of a single sample
point and the result follows by applying Lemma 2.1.</p>
      <p>[ ̂DR(d) −rd|s] = Δd +  [ od (cd −  ( d; s))|s]</p>
      <p>̂ )(d)
=  [ [(
=  [(</p>
      <p>od
 ̂ 0(d)
ℙ(od = 1|)</p>
      <p>̂ 0(d)
∗
=    d  0(d) + (1 − )Δ d.</p>
      <p>− 1)rd|s, r]] + (1 − )Δ d</p>
      <p>− 1)rd|s] + (1 − )Δ d
= Δd +  [ [ od (odrd −  ( d; s))|r, s]]</p>
      <p>̂ 0(d)
= Δd +  [
= −Δd  0(d)
ℙ(od = 1|)
 ̂ 0(d)
(rd −  ( d; s))|s]</p>
    </sec>
    <sec id="sec-8">
      <title>C. Variance analysis for Hybrid Estimator</title>
      <p>By invoking the law of total variance and using the fact that ( ) is non-stochastic, we can
write
[  ̂HD( )] =  [ [ ̂HD( ) − ( ) |s]]</p>
      <p>+  [ [ ̂HD( ) − ( ) |s]]
=  [ℓ2( ( s) ) [ [ ̂HD(d) −  rd|s]] ]</p>
      <p>s
+  [ℓ( ( s) ) [   d  0(d) + (1 − ) ( d; s)] s]</p>
      <p>∗
+  [ℓ( ( s) ) [   d  0(d) + (1 − ) ( d; s)] s]
∗
where the last two lines follows by a similar argument as in the proof of Theorem 3.1.</p>
      <p>Our bias-variance analysis provides insight into the fundamental components of the
generalization error of hybrid distilled empirical risk estimates. The first component of the variance
in the last line is due to the IPW component and the variance penalty due to the distillation
component is
Also, We have already observed that the bias penalty we incur by incorporating the distillation
component is
(1 − ) 2 [ℓ( ( s) ) [ ( d; s)] s] .</p>
      <p>(1 − ) [ℓ( ( s) ) [Δd;s] s] .
(12)
(13)
This analysis shows the variance reduction benefits of the Hybrid Distilled Risk estimator, which
is further enhanced by increasing  , in the expense of potentially more bias, due to inaccuracy
of the teacher’s estimate of the actual relevance probability.</p>
    </sec>
    <sec id="sec-9">
      <title>D. Variance analysis for Doubly Robust Estimator</title>
      <p>By invoking the law of total variance and using the fact that ( ) is non-stochastic, we can
write
[  ̂DR( )] =  [ [ ̂DR( ) − ( ) |s]]
=  [ℓ2( ( s) ) [ [ ̂DR(d) −rd|s]] ]</p>
      <p>+  [ℓ( ( s) ) [ [ ̂DR(d) −rd|s]] ]
=  [ℓ2( ( s) ) [ [( D̂R(d) −rd)2|s]] ]
 s
 s</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Swaminathan</surname>
          </string-name>
          , T. Schnabel,
          <article-title>Unbiased learning-to-rank with biased feedback</article-title>
          ,
          <source>in: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, ACM</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>781</fpage>
          -
          <lpage>789</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Chapelle</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>A dynamic bayesian network click model for web search ranking</article-title>
          ,
          <source>in: Proceedings of the 18th international conference on World wide web</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Borisov</surname>
          </string-name>
          , I. Markov,
          <string-name>
            <surname>M. De Rijke</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Serdyukov</surname>
          </string-name>
          ,
          <article-title>A neural click model for web search</article-title>
          ,
          <source>in: Proceedings of the 25th International Conference on World Wide Web</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>531</fpage>
          -
          <lpage>541</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Najork</surname>
          </string-name>
          , To-
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>E.</given-names>
            <surname>Ebrahimzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bagherjeiran</surname>
          </string-name>
          ,
          <article-title>Intent-aware propensity estimation via click pattern stratification</article-title>
          ,
          <source>in: Companion Proceedings of the ACM Web Conference</source>
          <year>2023</year>
          , WWW '23 Companion, Association for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>751</fpage>
          -
          <lpage>755</lpage>
          . URL: https://doi.org/10.1145/3543873.3587610. doi:
          <volume>10</volume>
          .1145/3543873.3587610.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Unbiased learning to rank with unbiased propensity estimation</article-title>
          ,
          <source>in: The 41st International ACM SIGIR Conference on Research &amp; Development in Information Retrieval</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>385</fpage>
          -
          <lpage>394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Charlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <article-title>The deconfounded recommender: A causal inference approach to recommendation</article-title>
          , arXiv preprint arXiv:
          <year>1808</year>
          .
          <volume>06581</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dudík</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Langford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Doubly robust policy evaluation and learning</article-title>
          ,
          <source>arXiv preprint arXiv:1103.4601</source>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schnabel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Swaminathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chandak</surname>
          </string-name>
          , T. Joachims,
          <article-title>Recommendations as treatments: Debiasing learning and evaluation</article-title>
          , in: international conference on machine learning,
          <source>PMLR</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1670</fpage>
          -
          <lpage>1679</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          , S. Cheng, Z. Cheng, W. Ye,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <article-title>Approximated doubly robust search relevance estimation</article-title>
          ,
          <source>in: Proceedings of the 31st ACM International Conference on Information &amp; Knowledge Management</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>3756</fpage>
          -
          <lpage>3765</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Saito</surname>
          </string-name>
          ,
          <article-title>Doubly robust estimator for ranking metrics with post-click conversions</article-title>
          ,
          <source>in: Proceedings of the 14th ACM Conference on Recommender Systems</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>92</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chuklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Serdyukov</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. De Rijke</surname>
          </string-name>
          ,
          <article-title>Click model-based information retrieval metrics</article-title>
          ,
          <source>in: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>493</fpage>
          -
          <lpage>502</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Estrin</surname>
          </string-name>
          ,
          <article-title>Unbiased ofline recommender evaluation for missing-not-at-random implicit feedback</article-title>
          ,
          <source>in: Proceedings of the 12th ACM conference on recommender systems</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>279</fpage>
          -
          <lpage>287</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [29]
          <string-name>
            <surname>A. K. Menon</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          <string-name>
            <surname>Rawat</surname>
            ,
            <given-names>S. J.</given-names>
          </string-name>
          <string-name>
            <surname>Reddi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Why distillation helps: a statistical perspective</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>10419</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shivanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <article-title>Understanding and improving knowledge distillation</article-title>
          , arXiv preprint arXiv:
          <year>2002</year>
          .
          <volume>03532</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mobahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Farajtabar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. L.</given-names>
            <surname>Bartlett</surname>
          </string-name>
          ,
          <article-title>Self-distillation amplifies regularization in hilbert space</article-title>
          , arXiv preprint arXiv:
          <year>2002</year>
          .
          <volume>05715</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Pasumarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Menon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Rawat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Veit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Rankdistil: Knowledge distillation for ranking</article-title>
          ,
          <source>in: International Conference on Artificial Intelligence and Statistics</source>
          , PMLR,
          <year>2021</year>
          , pp.
          <fpage>2368</fpage>
          -
          <lpage>2376</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          , P. Cheng,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ming</surname>
          </string-name>
          , A
          <article-title>general knowledge distillation framework for counterfactual recommendation via uniform data</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>831</fpage>
          -
          <lpage>840</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Bayir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J</given-names>
            .
            <surname>Pfeifer</surname>
          </string-name>
          <string-name>
            <given-names>III</given-names>
            ,
            <surname>D. Charles</surname>
          </string-name>
          , E. Kiciman,
          <article-title>Causal transfer random forest: Combining logged data and randomized experiments for robust prediction</article-title>
          ,
          <source>in: Proceedings of the 14th ACM International Conference on Web Search and Data Mining</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>211</fpage>
          -
          <lpage>219</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sanghavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rahmanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bakus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          ,
          <article-title>Toward understanding privileged features distillation in learning-to-rank</article-title>
          ,
          <source>arXiv preprint arXiv:2209.08754</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>