Mitigating Targeting Bias in Content Recommendation with Causal Bandits∗

YAN ZHAO, MITCHELL GOODMAN, SAMEER KANASE, SHENGHE XU, YANNICK KIMMEL,
BRENT PAYNE, SAAD KHAN, and PATRICIA GRAO, Amazon.com,Inc, USA
Recommendations systems play a central role in improving customer experience on the Amazon retail website. Commonly, Learning-
to-Rank (LTR) methods are employed to rank content, however these methods are subject to bias inherent in the observational data
that they use for training. This paper studies a domain-specific self-selection bias, called Content Targeting Bias, introduced when
content is generated for specific targeted customers. When content specifically targets classes of customers who are more or less
likely to take actions associated with traditional recommendations algorithms (clicks, purchases), the resulting observations reflect a
biased relationship between the content and feedback. These observations do not account for the counterfactual condition, or what
would have happened if the customer had not received a recommendation. In many cases, customers will have a high propensity of
generating rewards, independent of the recommendations shown on the website. In this work we incorporate causal uplift modeling
with contextual bandits in order to consider the heterogeneous treatment effect as an adjusted objective for top-k content selection. We
demonstrate the performance and impact of the framework through both offline model evaluations and multiple live A/B experiments.

CCS Concepts: • Computing methodologies → Sequential decision making; Batch learning; Learning from implicit feedback;
Causal reasoning and diagnostics; Learning to rank; Supervised learning by regression; • Applied computing → Online shopping; •
Information systems → Content ranking; Personalization; Top-k retrieval in databases; Recommender systems; • Mathematics
of computing → Bayesian computation.

Additional Key Words and Phrases: Personalization, Recommender system, Content optimization, Content ranking, Selection bias,
Causal bandit, Contextual bandit, Uplift, View-through attribution, Fairness, Counterfactual learning


1   INTRODUCTION
In many e-commerce applications, customers rely on recommender systems to help sort through large corpuses
of content in order to discover the small fraction of content that they would be interested in. Amazon’s content
optimization/ranking system is designed as a self-service tool which enables teams across the company to build content
recommendation strategies once and run anywhere. Such a content optimization system is challenging mostly due to
following two reasons: (1) continuous learning: surfacing the right content to the right user at the right time requires a
ranking system to continually adapt to users’ shifting interested along with newly introduced content; (2) content bias
reduction: learning the unbiased incremental value of each piece of content given the context is typically unachievable
given the limits of partial observations.
    To continuously learn new content and adapt to changing customer behaviors, exploration/exploitation trade-off in
the context of Reinforcement learning, is currently an active area of research. In literature, numerous techniques have
emerged which are competitive and have shown promising results. These include epsilon-greedy ([26]), adding random
noise to parameters[8], bootstrap sampling[19], and Thompson Sampling[6].
    Content bias is introduced by the process of making recommendations online, which influences the way users interact
with the system and how the data collected from users is fed back into the system. This leads to several types of biases
such as popularity bias[5], human decision bias[7], position bias[14] or selection bias[13]. Traditional learning-to-rank
approaches must contend with the these biases, and most approaches are focusing on position[14] or selection bias[30].

∗ Copyright 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Presented at the MORS workshop held in conjunction with the 16th ACM Conference on Recommender Systems (RecSys), 2022, in Seattle, USA.
                                                                          1
RecSys ’22, September 18–23, 2022, Seattle, WA                                                                   Zhao et al.


    In this paper, we identified a new type of bias called Content Targeting Bias, introduced in a recommendations systems
at industry scale due to content targeting criteria. Here, content targeting criteria, defined by content recommendation
strategy owners, target only certain populations of customer or types of page contexts. These content owners then
participate in ranking competitions only when targeting criteria is met. Content targeting criteria can be as simple as
"all customers and context" or can be specific to a small portion of customers who have taken particular actions over the
past month(s). For example, Content Targeting Bias is introduced when content owners target only signed-in customers,
which are known to spend more on average than the population as a whole regardless of the recommendations provided.
Another way this targeting bias can happen is when content owners target recommendations to some negative-profit
item pages. Not accounting for such biases in ranking means we end up over or under estimating a content incremental
performance, thus unfair ranking and degraded customer experiences. For practical reasons it is nearly impossible for a
ranking system to have awareness of all targeting criteria that each content owner uses, along with an awareness of the
detailed context and customer information at the level of each content owner. Thus, there needs to be a feature-agnostic
way to mitigate Content Targeting Bias and thereby improve ranking.
    On top of this, we further proposed quantified measurement for Content Targeting Bias within recommendation
systems, and proposed solutions for reducing such biases. Our solution to reduce Content Targeting Bias was intuited
by causal bandits work[25] [24]. In this work, we incorporate uplift model [27][20][9], using meta learning approaches
(e.g. x-learner, r-learner)[17] [32], into contextualized bandits[18][3]. This approach was designed to consider the
heterogeneous treatment effect between when content eligible to show but not actually observed (not treated) v.s.
observed (treated), with the goal of improving equal opportunity of showing content without targeting criteria impacts,
thus maximizing customer experience. To the best of our knowledge, our work is the first to identify content targeting
as a unique bias and incorporate uplift modeling into bandit approaches including Bayesian Linear Regression Model
(BLIR [10]), to reduce such biases for the content ranking problem. During experimentation in a Amazon commercial
system[15], this work achieved significant online improvements across multiple pages within Amazon e-commerce
website.
    This paper is organized as follows: In section 2, we describe problem definitions. Section 3 describes the proposed
solution. Section 4 covers offline model evaluation and online live experiments results with learnings. Finally, Section 5
details conclusions and future work.


2     PROBLEM DESCRIPTION
2.1    Formalizing problems
We define widget group as a region of experience on Amazon’s e-commerce website which can be populated with
recommended content (a.k.a. widget) provided by different teams. Note here the number of content rendered on widget
group is much less than the number of all possible candidate content that is generated. Eligibility of rendering widget
𝑤𝑖 is typically determined by a combined factor of both the widget’s targeting criteria and ranking system valuation.
    The metric for measuring reward 𝑅 is determined by 𝑀𝑂𝐼 , short for ‘metric of interest’. In our setting, 𝑀𝑂𝐼 takes
into account the short-term as well as long-term impact to the customer’s shopping experience, and helps us to fairly
balance multiple and differing objectives of various stakeholders. Many Learning-to-Rank systems in the literature
([2][11][29]) optimize for Click-through Rate (CTR), while we are more interested in site-wise 𝑀𝑂𝐼 . Towards this end,
we have adopted an attribution modeling using view-through attribution (VTA)[16][31], which credits widgets for all
rewards following an impression within an attribution window (e.g. 100 minutes). For example, if a customer view
                                                             2
Mitigating Targeting Bias in Content Recommendation with Causal Bandits             RecSys ’22, September 18–23, 2022, Seattle, WA


some content recommendation 𝐴 for longer than 1 second (defined as an impression) and then make a purchase in the
subsequent 100 minutes, the reward 𝑅 generated by this purchase along with all other high value actions taken in these
100 minutes will all be attributed back to the impressed content 𝐴.
    The drawback of this methodology is that it loosens the connection between the content and the associated reward,
which makes ranking system more vulnerable to Content Targeting Bias. Specifically, by considering feedback as
“response” without considering counterfactual cases (i.e. what customer would have behaved if he/she had not received
a recommendation) we end up with a biased estimate. Cases where customers have a high probability of generating
down session rewards independent of the recommendations shown, can lead the system to over-estimate the value of
widgets shown. This misspecification of value due to Content Targeting Bias disrupts fairness guarantees provided by
the ranking system and ultimately leads to a suboptimal customer experience.

2.2    Formalizing Content Targeting Bias and ranking fairness
To formalize Content Targeting Bias, we adopt a recently introduced idea of opportunity bias[33] , a formula designed
to evaluate whether different types of content receive clicks (or other engagement metrics) proportional to their true
targeted population sizes (i.e. do content with different targeting criteria receive similar true positive rates?). This
method assumes that the content recommended by content owners are all relatively in good quality. We believe this
formalization of Content Targeting Bias is directly aligned with user satisfaction and economic gains of content owners.
    To quantify the impact of Content Targeting Bias to recommendation systems, we need to first calculate the true
positive rate for each content. Using show rate as an example, suppose content 𝑖 has been exposed to customers 𝐸
times in total, the true positive rate for 𝑖 is 𝑇 𝑃𝑅𝑖 = 𝐸𝑖 /𝐴𝑖 , where 𝐴𝑖 is the total times of content 𝑖 is generated based
on content owners’ targeting criteria. Then, we can use the Gini Coefficient[4][28] to measure the inequality in true
positive rates corresponding to content generation
                                                           Í
                                                      (2 ∗ 𝑖 − 𝑀 − 1) ∗ 𝑇 𝑃𝑅𝑖
                                                  𝐺𝑖𝑛𝑖 =       Í                                                   (1)
                                                         (𝑀 ∗ 𝑇 𝑃𝑅𝑖 )
    where contents are indexed from 1 to 𝑀 in targeted audience size, non-descending. We use −1 ≤ 𝐺𝑖𝑛𝑖 ≤ 1 to quantify
the Content Targeting Bias in recommendation system: a close to 0 𝐺𝑖𝑛𝑖 indicates a low bias; 𝐺𝑖𝑛𝑖 > 0 represents
that true positive rate is positively correlated to content targeted audience size; and 𝐺𝑖𝑛𝑖 < 0 represents that the true
positive rate is negatively correlated to audience size.

3     METHODOLOGY
To address problems above, we propose a framework employing uplift techniques with contextual bandits on top of
VTA, to de-bias observations for the ranking system. In more detail, we consider adding contextual features in an uplift
model in order to estimate Conditional Average Treatment Effect (CATE [1]) between exposure v.s. non-exposure of
a recommendation to customers, using both Randomized Controlled Trial (RCT) or observational data. Under this
framework, we also propose a modeling architecture incorporating Bayesian Linear Regression (BLIR) with Thompson
Sampling to achieve online exploration-exploitation trade-offs.

3.1    Assumptions and definitions
We divided the causal impacts of showing a widget to customers on reward 𝑅 into two parts, (1) request-level incremen-
tality, or the incremental value of showing top 𝐾 widgets to customer within request. This is to remove confounding
                                                                     3
RecSys ’22, September 18–23, 2022, Seattle, WA                                                                   Zhao et al.


factors which impact the overall down-session reward 𝑅 independent of the recommendations received. (2) widget-level
attribution: out of the top 𝐾 widgets, each widget’s contribution to request-level incrementality. Ideally, we want to
have a single causal model that can solve request-level incrementality and widget-level attribution at the same time,
but for scope of this paper, we focus on the first problem, which is more related to the Content Targeting Bias issue, as
removing customer/context intrinsic value from observed reward 𝑅.
   Let 𝑇 be a dummy variable indicating treatment status, with 𝑇 = 1 if a customer receives recommended contents
(treatment) for a given request and 𝑇 = 0 otherwise. The observed reward is defined as 𝑌 ≡ 𝑇 · 𝑌 (1) + (1 − 𝑇 ) · 𝑌 (0),
where 𝑌 (1) and 𝑌 (0) are the potential outcomes when people receive the treatment or not. Under the unconfoundedness
assumption[23], 𝑌 (0), 𝑌 (1) ⊥ 𝑇 , we can approximate 𝑌 using 𝑌 (1) if treatment is imposed , or 𝑌 (0) otherwise. In
practice, instead of the simple unconfoundedness assumption, we make a conditional unconfoundedness assumption
due to the non-random treatment assignment, which is also known as “strongly ignorable treatment assignment”[22],
𝑌 (0), 𝑌 (1) ⊥ 𝑇 |𝑋 . In other words, given all the covariates X, the treatment assignment will be independent of the
potential outcomes. Given 𝑋 = 𝑥, the CATE 𝜏 (𝑥) is then defined as 𝜏 (𝑥) ≡ 𝐸 [𝑌 (1) − 𝑌 (0)|𝑋 = 𝑥]
   In this work, treatment group is defined as request-level exposure of top 𝐾 widgets while control group is defined
as non-exposure of the entire widget group. This definition is to account for exogenous factors in the top-k ranking
problem, where widgets on top ranks may have an impact on lower ranked widget’s exposures.

3.2   Features
Another key point related to the unconfoundedness assumption is the set of covariates 𝑋 which can support it. In the
real world, finding all confounders can be intractable, however in this work we simplify the problem by limiting to two
confounding factors which impact user propensities: (1) content targeting where content is only generated to subgroup
of customers; (2) model targeting where the model returns content non-uniformly given different page context. The
conditional unconfoundedness assumption then becomes true as long as we can capture page context and a customer’s
intrinsic values in 𝑋 . In the Content Targeting Bias problem, a customer’s intrinsic values can be related only to the
candidate widgets generated for a given request. Thus, with feature representations of candidate widgets as well as
other page context and customer information, we can appropriately counteract the confounding problem.

3.3   Two-model framework estimating pseudo-effect
We further propose a two-model framework composed of a baseline (control) model and an uplift model using Linear
Regression, to reduce Content Targeting Bias. Note we use linear model in this paper, but this approach can be applied
to any types of Machine Learning models.

3.3.1 Model 1 - Baseline Model. The baseline model measures the observation in the control group, which is to estimate
the expected down-session rewards of various contents being eligible to show but not actually observed by customers,
defined at request level as 𝜇ˆ0 = 𝛽𝑇 𝑋 + 𝜖, where 𝛽 is feature weights, and during training we find which best explains
data. 𝜖 is a noise term to capture unobserved variables. 𝑋 represents the features used in the baseline model as explained
in above sections.

3.3.2 Model 2 - Uplift Model. For each request with 𝐾 individual observation (widget) in the treatment groups, which
are returned and actually observed by customers, we define 2

                                                  (1)
                                                 𝐷𝑖     = 𝑌𝑖 − 𝜇ˆ0 (𝑥) = 𝜏 (1) + 𝜖                                      (2)
                                                                 4
Mitigating Targeting Bias in Content Recommendation with Causal Bandits                  RecSys ’22, September 18–23, 2022, Seattle, WA


                                                                                       (1)
   where 𝑌𝑖 is observed down-session reward for a customer at request 𝑖, and 𝐷𝑖              is the imputed treatment effects for
request 𝑖 in the treated group, based on the baseline outcome estimator. 𝜏 (1) is imputed treatment effects estimation
for a given request. This “pseudo-effect” is an adjusted objective with Content Targeting Bias reduction, and it is
then used as the outcome in a secondary machine learning model to obtain the response functions with treatment
effects estimated. To achieve online exploration-exploitation trade-off, we utilize Bayesian Linear Regression Model
                            (1)                                 (1)
(BLIR) [10] approach, 𝜏ˆ𝐴         ∼ N (𝜃 𝑇 𝑋𝐴 , 𝜎 2 𝐼 ), where 𝜏ˆ𝐴 is the estimated score for widget 𝐴, 𝜃 is the coefficients of
features, 𝑋𝐴 represents the features used in the uplift model. Note that 𝑋𝐴 is different from 𝑋 in that 𝑋 contains all
candidate widgets information and is only used in offline, while 𝑋𝐴 only contains features for focal widget 𝐴. This
uplift model estimates “pseudo-treatment-effects” for the observations in the treatment group, and can help reduce
Content Targeting Bias, since we remove the counterfactual effect using the baseline (control) model. Finally, we do
point-wise ranking online, by estimating a score for every candidate widget, sorting and returning top-K to customers.
The exploration-exploitation trade-off is achieved by sampling model parameters from their posterior distributions
through Thompson Sampling.
   In addition to bias reduction with causal effect estimations, another fact of this two-model framework is that the
baseline model is only used offline, generating uplift objectives for the second model. This simplification empowers us
to include as many features as we can in 𝑋 without concern for latency or other requirements for online systems, such
as representations of all candidate widgets features, while keeping online feature set 𝑋𝐴 relatively simple but achieve
similar effect on bias reductions.


3.4   Log-tricks on objectives
One trick we performed is to transform reward 𝑅 and uplift estimations into log-scale. Transforming reward into
log-scale is a widely used trick for removing outliers, thus we first introduced it on top of baseline model estimations.
Here instead of using 𝑙𝑜𝑔, we used 𝑠𝑖𝑔𝑛𝑒𝑑𝐿𝑜𝑔1𝑝 (𝑣) = 𝑠𝑖𝑔𝑛𝑒𝑑 (𝑣) ∗ 𝑙𝑜𝑔1𝑝 (|𝑣 |) to achieve symmetry and valid values at
zeros (will still denote using 𝑙𝑜𝑔 in following context). Due to Jensen’s inequality, transforming log-scaled baseline
estimations 𝑙𝑜𝑔_𝜇ˆ0 back is biased, thus, we directly perform treatment effects estimations by 3


                                          𝑙𝑜𝑔_𝐷 (1) = 𝑙𝑜𝑔_𝑌𝑖 − 𝑙𝑜𝑔_𝜇ˆ0 (𝑥) = 𝑙𝑜𝑔_𝜏 (1) + 𝜖                                         (3)
  where 𝑙𝑜𝑔_𝜇ˆ0 (𝑥) can be treated as geometric mean of baseline values given covariates 𝑋 , like 𝑙𝑜𝑔((Π𝑚 𝜇ˆ0 ) 1/𝑀 )|𝑋 ,
                                                                       𝐷 (1)
while treatment effects 𝑙𝑜𝑔_𝜏 (1) becomes multiplicative, like 𝑙𝑜𝑔( 𝑔𝑒𝑜𝑀𝑒𝑎𝑛_ 𝜇ˆ0 ). Although this definition differs from
additive uplift in above section, the signs still has meaning, in that positive value indicate positive incrementality
whereas negative value indicate negative incrementality. In addition, defining this relationship as multiplicative also
lends an intuitive semantic meaning. For example, say customers who are not signed-in spend $10 on average while
signed-in customers spend $100, when representing uplift for some content 𝐴, instead of an additive uplift of $10 (thus
$20 for unsigned-in customer and $110 for signed-in customer), a multiplicative lift ratio of 1.1 makes more sense in
terms of e-commerce context (thus $11 for unsigned-in customer and $110 for signed-in customers). Results from A/B
testing show log scaled uplift (multiplicative) performs better than additive uplift.


3.5   Data collection
Randomly hiding data, as in RCT, could provide us with a more unbiased estimate of CATE, however it is costly to
proactively hide content from customers. In RCT setup, we randomly punt (do not display) the entire widget group
                                                                      5
RecSys ’22, September 18–23, 2022, Seattle, WA                                                                    Zhao et al.


a small percentage 𝜁 of the time, and train baseline model using only this punted traffic. In this way, using RCT, we
are able to remove the confounder effect resulting from customer selection bias, e.g. customer’s propensities to scroll
down the page and browse content. Observational data, in contrast to RCT, can provide sufficient data with minimal
cost, but at the expense of increased data bias. In practice, when top K widgets are returned, they are actually not
always all shown on viewport to customers. Our system is able to capture this client-side impression behaviors, and
our proposed work is able to train baseline&uplift models using this observational data. Results show limited biases
using observational data compared to RCT.

4     MODEL EVALUATION AND EXPERIMENTS
4.1    Ranking fairness estimation
In offline model evaluation of ranking fairness, we compared 𝐺𝑖𝑛𝑖 scores defined in section 2.2 by ranking content
using production model (non-uplift linear bandits directly regressed on VTA) v.s. two-model uplift bandit approach.
Also, in order to measure fairness in multiple dimensions, we defined 𝑇 𝑃𝑅 in 2 ways

                                           content exposure
                                                 Í
                                                                       TPR of show rate
                                      
                                      
                              𝑇 𝑃𝑅𝑖 = Í content generated
                                      Í
                                      
                                                                                                                     (4)
                                       observed
                                                    content reward TPR of performance
                                               content exposure
                                             Í
                                      
    From ??, uplift approach reduces bias in terms of both content show rate and content average performance, especially
the latter, which indicates that our approach improves fairness based more on content’s performance than their show
rate coverage, which better aligns with business goals.

                                                     Table 1. uplift Gini metrics


                                                                  production model   uplift model
                                 Gini of content show rate        0.818                 0.815
                                 Gini of content performance      0.55                  0.48


4.2    Effectiveness of heterogeneous treatment effect estimation
Through analysis on real online data, we identified several widgets targeting customers with high intrinsic rewards,
compare their observed score with predicted uplift values, and validated that uplift is able to reduce those biases. From
Table 2, widget C and D are widgets targeting at high-valued customer only, while widget A, B, E, F are common
widgets targeting at all customers. The scores are observed or estimated rewards as defined in section 2.1, and are
represented in tuples formatting as (average across all customers, average across high-valued customers only).
Evaluating using proposed framework, we intentionally exclude all customer related features, so model won’t depend
on customer profile. We can see that although widget C has the highest average observed score across all customers,
the actual uplift prediction is not as high as widget B after reducing Content Targeting Bias (0.15 v.s. 0.24), this aligns
with our observations of these widgets through online experimentation. A similar pattern can be found on widget D.

4.3    RCT and observational data
We also evaluate the baseline model in two-model uplift approach, trained with either RCT or observation data. Through
offline analysis, we see that the observational uplift approach is able to achieve similar results compared to RCT
                                                                  6
Mitigating Targeting Bias in Content Recommendation with Causal Bandits                              RecSys ’22, September 18–23, 2022, Seattle, WA


       Table 2. Effectiveness of uplift estimation (average across all customers, average across high-valued customers only)


                                           widget A      widget B          widget C       widget D       widget E        widget F
             average observed reward       (5.42,6.73)   (6.14,7.49)       (7.70,7.70)    (3.74,3.74)    (3.48,4.52)     (3.58,4.82)
             Control estimation            (4.88,6.13)   (5.20,6.45)       (6.37,6.37)    (3.84,3.84)    (3.97,5.02)     (4.79,6.34)
             Treatment estimation          (4.97,6.15)   (5.44,6.63)       (6.52,6.52)    (3.76,3.76)    (3.86,4.86)     (4.83,6.35)
             average uplift estimation     (0.08,0.01)   (0.24,0.18)       (0.15,0.15)   (-0.09,-0.09)   (-0.11,-0.16)   (0.03,0.02)


model, without significant difference in rankings. However, RCT baseline model gives a general lower estimation
of counterfactuals (5% ∼ 10%). This gap can be interpreted by customers’ selection biases, that when customers
intentionally don’t view content (observational data), they might be attracted by other content on the page or already
have a clear shopping mission. This in turn leads to higher estimates in observational baseline models vs RCT. Estimating
this gap is important, since this can be used to adjust observational modeling, and improve model interpretability while
avoiding the high cost imposed by RCT, e.g. showing sub-optimal results on a certain percentage of the populations.

4.4   Online experiments
Content Targeting Bias might appear in different formats across different pages. For example, on homepage, Content
Targeting Bias is mostly introduced by different targeting criteria on customer populations; while on product detail
pages, biases are mostly introduced by targeting criteria towards context information. To gain a thorough understanding
about this proposed work, we have completed five online randomized A/B experiments[12] [21].

                                         Table 3. Aggregated table for all experiments results


  Experiment Iteration    A/B Treatment                          Annualized Impact (% improvement)              Confidence interval     p-value
  EXP-1                   Observational Uplift                   +0.13%                                         +0.02% ∼ +0.23%         0.020
  EXP-1                   RCT Uplift                             +0.05%                                         -0.06% ∼ +0.15%         0.380
  EXP-2                   Observational Uplift                   +0.11%                                         -0.09% ∼ +0.32%         0.280
  EXP-3                   Observational Uplift                   +0.07%                                         +0.00% ∼ +0.13%         0.039
  EXP-4                   Observational Uplift                   +0.06%                                         -0.02% ∼ +0.14%         0.170
  EXP-4                   Observational Uplift with log trick    +0.09%                                         +0.01% ∼ +0.17%         0.024
  EXP-5                   Observational Uplift with log trick    +0.13%                                         +0.09% ∼ +0.18%         0.000


4.4.1 Online experiment setup. In our online experimentation setting, observational units (or shopping sessions) are
randomly exposed to either the baseline control policy or the alternative treatment policies. Here, we track the impact to
our metric of interest 𝑀𝑂𝐼 , which is a measure of improved site-wide customer shopping experience. In our results, we
include the causal effect w.r.t percentage improvement in this metric at Amazon’s scale. The experiments are conducted
across all of Amazon’s world-wide marketplaces and product categories. Level of significance 𝛼 for these experiments
was determined by Amazon’s business objectives and was set to 0.10. Duration for these experiments was estimated
from statistical power analysis. We allocated equal traffic to both the control and treatment groups. During the course
of the experiment, the models were incrementally trained using their own set of logged feedback.
   Through all experiments, A/B test Control group is a non-uplift linear contextual bandit regressed on VTA. In
Experiment 1, we ran on slots located at the bottom of different pages (e.g. detail page, homepage page etc.). The
following treatments are performed, (1) observational uplift: two-model uplift with observational data;(2) RCT uplift:
two-model uplift with RCT data. In Experiment 2, we ran on slots located at top of product detail pages, with
                                                                       7
RecSys ’22, September 18–23, 2022, Seattle, WA                                                                                               Zhao et al.


treatment group as observational uplift. In Experiment 3, we ran on cart pages, using uplift with observational data.
In Experiment 4, we ran on desktop product detail page, the following treatments are performed, (1) observational
uplift; (2) observational uplift with log scale tricks. In Experiment 5 ran on mobile app product detail page, using
observational uplift with log scale tricks as treatment group.

4.4.2 Online experiment results. We observed consistent improvements across experiments. Table 3 shows details on
𝑀𝑂𝐼 results with confidence intervals. Through these experiments, we proved (1) Improvements using heterogeneous
treatment effect estimation on top of bandits approach, on different pages including homepage, detail page, cart pages,
across Amazon e-commerce websites. Out of these, experiment 1, 3, 4, 5 achieved statistically significant improvements
with p-value less than 0.05, with experiment 5 having p-value close to 0.000. (2) the proposed method using observational
data achieved significant improvements (with p-value 0.02) while RCT only outperformed production model with low
confidence (with p-value 0.38). This demonstrated that observational uplift modeling can achieve similar results as
using RCT, by successfully minimizing potential bias in training examples. Conversely, RCT depends on random hiding
contents which is guaranteed to be suboptimal some percentage of the time, thus the overall RCT performance is
hurt; (3) the proposed uplift model with log tricks outperforms additive uplift, which can be observed directly from
experiment 4 where uplift with log tricks achieved significant improvements (with p-value 0.024) while additive uplift
improvements was not significant with p-value as 0.17. This further demonstrates our hypothesis on log tricks for
better managing outliers and more reasonable semantic meanings from multiplicative uplift.


5   CONCLUSION
In this paper, we studied a new type of bias in Learning-to-Rank systems, called Content Targeting Bias. We defined
such bias, proposed quantified measurement and further proposed an online ranking approach using BLIR considering
contextual features into uplift modeling to reduce such bias for top-K content selection. Through this work, we
introduced log-tricks for treatment effect estimations between exposure v.s. non-exposure of a recommendation and
compared baseline models trained using both RCT and observational data. This work demonstrates significant bias
reduction as well as significant 𝑀𝑂𝐼 improvements both offline and online. In future work, building on top of current
framework, we will improve uplift estimation by applying propensity-weighting based meta learner approach e.g.
double ML (R-learner) to improve current uplift modeling, to further reduce display biases in content rankings.


REFERENCES
 [1] Jason Abrevaya, Yu-Chin Hsu, and Robert P Lieli. Estimating conditional average treatment effects. Journal of Business & Economic Statistics, 33(4):
     485–505, 2015.
 [2] Aman Agarwal, Kenta Takatsu, Ivan Zaitsev, and Thorsten Joachims. A general framework for counterfactual learning-to-rank. In Proceedings of the
     42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 5–14, 2019.
 [3] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on
     Machine Learning, pages 1613–1622. PMLR, 2015.
 [4] Malcolm C Brown. Using gini-style indices to evaluate the spatial patterns of health practitioners: theoretical considerations and an application
     based on alberta data. Social science & medicine, 38(9):1243–1256, 1994.
 [5] Òscar Celma and Pedro Cano. From hits to niches? or how popular artists can bias music recommendation and discovery. In Proceedings of the 2nd
     KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition, pages 1–8, 2008.
 [6] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24:2249–2257,
     2011.
 [7] Li Chen, Marco De Gemmis, Alexander Felfernig, Pasquale Lops, Francesco Ricci, and Giovanni Semeraro. Human decision making and recommender
     systems. ACM Transactions on Interactive Intelligent Systems (TiiS), 3(3):1–7, 2013.

                                                                           8
Mitigating Targeting Bias in Content Recommendation with Causal Bandits                                  RecSys ’22, September 18–23, 2022, Seattle, WA


 [8] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international
     conference on machine learning, pages 1050–1059. PMLR, 2016.
 [9] Graton Gathright, Ranjan Roopesh, Vasudev Rahul, Marshall Yan, and Fan Zhang. Cross-channel attribution of consumer marketing. In Amazon
     Machine Learning Conference, 2017.
[10] Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. Web-scale bayesian click-through rate prediction for sponsored
     search advertising in microsoft’s bing search engine. In ICML, 2010.
[11] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based neural network for ctr prediction.
     arXiv preprint arXiv:1703.04247, 2017.
[12] Somit Gupta, Ronny Kohavi, Diane Tang, Ya Xu, Reid Andersen, Eytan Bakshy, Niall Cardin, Sumita Chandran, Nanyu Chen, Dominic Coey, et al.
     Top challenges from the first practical online controlled experiments summit. ACM SIGKDD Explorations Newsletter, 21(1):20–35, 2019.
[13] James J Heckman. Sample selection bias as a specification error with an application to the estimation of labor supply functions. Princeton University
     Press, 2014.
[14] Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. Unbiased learning-to-rank with biased feedback. In Proceedings of the Tenth ACM
     International Conference on Web Search and Data Mining, pages 781–789, 2017.
[15] Sameer Kanase, Yan Zhao, Shenghe Xu, Mitchell Goodman, Manohar Mandalapu, Benjamyn Ward, Chan Jeon, Shreya Kamath, Ben Cohen, Vlad
     Suslikov, Yujia Liu, Hengjia Zhang, Yannick Kimmel, Saad Khan, Brent Payne, and Patricia Grao. An application of causal bandit to content
     optimization. In Proceedings of the 5th Workshop on Online Recommender Systems and User Modeling (ORSUM 2022), in conjunction with the 16th ACM
     Conference on Recommender Systems (RecSys 2022), Seattle, WA, USA, 2022.
[16] Pavel Kireyev, Koen Pauwels, and Sunil Gupta. Do display ads influence search? attribution and dynamics in online advertising. International
     Journal of Research in Marketing, 33(3):475–490, 2016.
[17] Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning.
     Proceedings of the national academy of sciences, 116(10):4156–4165, 2019.
[18] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In
     Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
[19] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. Advances in neural information
     processing systems, 29:4026–4034, 2016.
[20] Roopesh Ranjan, Narayanan Sadagopan, and Guido Imbens. A propensity matching approach to multi touch attribution. In Amazon Machine
     Learning Conference, 2016.
[21] Thomas S Richardson, Yu Liu, James McQueen, and Doug Hains. A bayesian model for online activity sample sizes. In International Conference on
     Artificial Intelligence and Statistics, pages 1775–1785. PMLR, 2022.
[22] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55,
     1983.
[23] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688,
     1974.
[24] Neela Sawant, Chii Babu Namballa, Narayanan Sadagopan, and Houssam Nassif. Multi-armed bandit framework for causal effect optimization. In
     Amazon Machine Learning Conference, 2017.
[25] Neela Sawant, Chitti Babu Namballa, Narayanan Sadagopan, and Houssam Nassif. Contextual multi-armed bandits for causal marketing. arXiv
     preprint arXiv:1810.01859, 2018.
[26] Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint
     arXiv:1507.00814, 2015.
[27] Bo Tan, Pramod Muralidharan, Naveen Nair, Wenduo Wang, Shaurya Gupta, Jimmy Issac, Vignesh Kannappan, Prakash Bulusu, and Phil Leslie.
     Attribution of prime member signups to prime benefits. In Amazon Machine Learning Conference, 2016.
[28] Adam Wagstaff, Pierella Paci, and Eddy Van Doorslaer. On the measurement of inequalities in health. Social science & medicine, 33(5):545–557, 1991.
[29] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD
     international conference on knowledge discovery and data mining, pages 1235–1244, 2015.
[30] Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. Learning to rank with selection bias in personal search. In Proceedings of the
     39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 115–124, 2016.
[31] Shenghe Xu, Yan Zhao, Sameer Kanase, Mitchell Goodman, Saad Khan, Brent Payne, and Patricia Grao. Machine learning attribution: Inferring
     item-level impact from slate recommendation in e-commerce. In KDD 2022 Workshop on First Content Understanding and Generation for e-Commerce,
     2022. URL https://www.amazon.science/publications/machine-learning-attribution-inferring-item-level-impact-from-slate-recommendation-in-e-
     commerce.
[32] Zhenyu Zhao and Totte Harinen. Uplift modeling for multiple treatments with cost optimization. In 2019 IEEE International Conference on Data
     Science and Advanced Analytics (DSAA), pages 422–431. IEEE, 2019.
[33] Ziwei Zhu, Yun He, Xing Zhao, and James Caverlee. Popularity bias in dynamic recommendation. 2021.


                                                                            9