An Application of Causal Bandit to Content Optimization
Sameer Kanase1 , Yan Zhao1 , Shenghe Xu1 , Mitchell Goodman1 , Manohar Mandalapu1 ,
Benjamyn Ward1 , Chan Jeon1 , Shreya Kamath1 , Ben Cohen1 , Yujia Liu1 , Hengjia Zhang1 ,
Yannick Kimmel1 , Saad Khan1 , Brent Payne1 and Patricia Grao1
1
    Amazon, USA


                                             Abstract
                                             Amazon encompasses a large number of discrete businesses such as Retail, Advertising, Fresh, Business (B2B e-commerce),
                                             and Prime Video, most of which maintain a presence across its e-commerce website. They produce content for our customers
                                             that belong to diverse content types such as merchandising (e.g. product recommendations), product advertisements (e.g.
                                             sponsored products and display ads), program adoption banners (e.g. Amazon Fresh), and consumption (e.g. Prime Video).
                                             When customers visit a web page on the website, it triggers a content allocation process where we determine the specific
                                             content to show in regions of customer shopping experience on that web page. Content produced by the aforementioned
                                             businesses then needs to be arbitrated during this process. We present a causal bandit based framework to address the
                                             problem of content optimization in this context. The framework is responsible for fairly balancing the differing objectives
                                             and methods of these businesses, and selecting the right content to display to the customers at the right time. It does so with
                                             the goal of improving the overall site-wide customer shopping experience. In this paper, we present our content optimization
                                             framework, describe its components, demonstrate the framework’s effectiveness through online randomized experiments,
                                             and share learnings from deploying and testing the framework in production.

                                             Keywords
                                             Personalization, Recommender system, Content optimization, Content ranking, Content diversity, Causal bandit, Contextual
                                             bandit, View-through attribution, Holistic optimization


1. Introduction                                                                                                                          When customers visit a web page on the website, it
                                                                                                                                      triggers a content allocation process where we deter-
Amazon encompasses a large number of discrete busi-                                                                                   mine the specific content to show in the widget groups
nesses such as Retail, Advertising, Fresh, Business (B2B                                                                              on that web page. Content produced by the aforemen-
e-commerce), and Prime Video, most of which maintain a                                                                                tioned discrete businesses then needs to be arbitrated
presence across its e-commerce (or retail) website. These                                                                             during this process. As the common integration point,
discrete businesses produce content for our customers                                                                                 Amazon’s content optimization framework is responsi-
that belong to diverse content types such as merchandis-                                                                              ble for this content arbitration. It accomplishes this by
ing (e.g. product recommendations), product advertise-                                                                                fairly balancing the differing objectives and methods of
ment (e.g. sponsored products and display ads), program                                                                               these businesses through optimization capabilities, and
adoption banner (e.g. Amazon Fresh), and consumption                                                                                  by taking into account customer, content, and shopping
(e.g. Prime Video). Each such content is rendered in the                                                                              context. This results in the right content being shown
form of a widget within independent ‘regions of customer                                                                              to the customers at the right time thereby providing a
shopping experience’ on the website, also known as wid-                                                                               consistent and personalized shopping experience. The
get groups. For instance, widgets such as ‘customers who                                                                              content optimization framework is an ecosystem which
viewed this also viewed’ and ‘customers who bought this                                                                               enables businesses to interoperate independently by en-
also bought’ are displayed on product detail pages of the                                                                             abling content creators, customer shopping experience
website alongside other organic and advertising content.                                                                              providers, and web page owners to efficiently construct
The region of customer shopping experience on the web-                                                                                and serve content for the retail website.
site where the collection of such widgets are displayed is                                                                               In this paper, we present a causal bandit based frame-
an example of a widget group. We illustrate the concept                                                                               work to address the problem of content optimization
of a product (or an item), widget, and widget group in                                                                                with the objective of improving the overall customer
(figure 1).                                                                                                                           shopping experience on Amazon’s retail website. Our
                                                                                                                                      contributions include:
ORSUM@ACM RecSys 2022: 5th Workshop on Online Recommender
Systems and User Modeling, jointly with the 16th ACM Conference on
                                                                                                                                           • application of a contextual bandit framework to
Recommender Systems, September 23rd, 2022, Seattle, WA, USA
" kanases@amazon.com (S. Kanase); yzhaoai@amazon.com                                                                                         enable introduction of new content through on-
(Y. Zhao); shenghe@amazon.com (S. Xu); migood@amazon.com                                                                                     line randomized experiments (or A/B tests) and
(M. Goodman)                                                                                                                                 to learn the value (or benefit) of new content
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).                                                            through exploration,
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: Below is an example of a widget group in a shopping page on Amazon’s retail website. It is part of the checkout
experience that surfaces after an item has been added to your cart. Here, we see carousels of products (or items) each of
which is associated with a title, for instance, "Try before you buy with Prime Wardrobe". Each such carousel of products is
rendered in the form of a widget which are marked by blue lines in the figure. A single product recommendation within a
widget is marked by the green border while the widget group, which is a collection of widgets, is marked by the orange border.
Note that widget groups are regions of customer shopping experience on the website, and each web page on the website can
contain one or more widget groups. While the example in this figure illustrates a widget as a carousel of products, they are not
limited to it. Widgets can also be used to render images, banners, advertisements, and other types of content. The homepage
of Amazon’s retail website illustrates this diversity in content type, where a widget is one of the many cards shown in the
customer’s feed. For reference, we have included an example of Amazon’s homepage in Appendix A.


     • an approach to measure the reward for actions                     ranking by using cross-content interactions, and
       taken by the contextual bandit framework,                       • learnings from the deployment of a low-latency
     • application of view-through attribution (VTA) to                  learning framework in production that reduces
       attribute reward in the context of content ranking                the delay in feedback and increases the velocity
       which only requires that content be impressed,                    of our learning loop.
     • utilization of an uplift modeling framework to
       augment VTA and to optimize for incremental               Finally, we also demonstrate the effectiveness of our
       benefit,                                                  framework through online A/B tests, and share results
                                                                 and insights gathered through the same.
     • a methodology to incorporate diversity in content
2. Related Work                                                formally define the problem we address as determining
                                                               ranked set 𝑇𝑞 from 𝐶𝑞 given contexts 𝑋 and 𝑍 so as to
Application of exploration strategies in the context of rec-   maximize the expected reward ℛ. Here, reward ℛ is a
ommender systems is an active area of research. In recent      measure of improved customer shopping experience on
years, multiple exploration strategies have emerged and        the retail website. We denote the metric for measuring
shown promising results [1, 2]. They include epsilon-          reward ℛ by 𝑀 𝑂𝐼, short for ‘metric of interest’. In our
greedy [3, 4], upper confidence bound (UCB) [5, 6],            setting, 𝑀 𝑂𝐼 takes into account the short-term as well
adding random noise to parameters [7, 8, 9], and boot-         as long-term impact to the customer’s shopping experi-
strap sampling [10, 11]. We adopt the Thompson sam-            ence, and helps us to fairly balance multiple and differing
pling algorithm [12] to balance exploration with exploita-     objectives of various stakeholders. It is computed us-
tion under the contextual bandit setting. Originally in-       ing actions taken by the customer after interacting with
troduced in 1933 [13], Thompson sampling has been              content such as impressions, clicks, purchases and other
widely adopted in the context of bandit problems recently      high-value actions. Note that our problem is different
[14, 15, 16]. It has been shown to achieve state-of-the-art    from that of ranking products (or items) within a single
results on some real-world use cases and be robust to          widget for a particular recommender system.
delay [17, 18].                                                   A key challenge we face in predicting 𝑀 𝑂𝐼 using 𝑋
   Uplift modeling is a widely used approach to mea-           and 𝑍 is that of the estimate being biased due to the cold-
sure incremental effect [19, 20, 21, 22]. Our approach to      start problem. New content gets continually introduced
estimate incremental effect or benefit is similar to the       to be shown on Amazon’s retail website while existing
meta-learning approach presented in [23, 24]. In [25], the     content can be sunsetted at any point of time. Empirically,
authors presented an application of a causal bandit in tar-    we observe a propensity in customers to interact more
geting campaigns. They estimated incremental effect to         with content displayed higher up in the widget group and
optimize for clicks in email marketing campaigns and ad-       on the web page. Furthermore, we only observe reward
vertisement campaigns on Amazon’s mobile homepage.             for content that was shown to customers before, but we
In this work, we explore an application of causal bandit       only show content to customers for which we predict
for content optimization by estimating and optimizing          there will be sufficient reward. Consequently, content
for heterogeneous treatment effect [26].                       with few or no prior observations is unlikely to be ranked
                                                               higher or chosen to be shown to the customers even if
                                                               it could generate a high-reward in the counterfactual
3. Problem Description                                         event where a customer were to interact with it. Here,
Let 𝒫 be the set of all web pages and 𝒬 be the set of          we could use aggregate-level features to partially address
all widget groups on Amazon’s retail website. Here, a          the cold-start problem but cannot fully solve it. Moreover,
widget group refers to real estate or region of customer       we observe that customer preferences and their interac-
shopping experience on the website which can be pop-           tions with content change over time. To address these
ulated with content 𝑐 in the form of a widget. Content         challenges, we use a contextual bandit based framework
𝑐 can belong to diverse types of content such as prod-         to create a learning loop for new content that has never
uct advertisements (e.g. sponsored products and display        been shown before to the customers and to dynamically
ads), merchandising (e.g. product recommendations),            adapt to changing customer preferences.
program adoption banners (e.g. Amazon Fresh), and con-
sumption (e.g. Prime Video). Each widget group 𝑞 ∈ 𝒬           4. Methodology
in turn can render (or display) a set of ranked content
𝑇𝑞 = {𝑐𝑟 | 𝑐𝑟 ∈ 𝐶𝑞 𝑎𝑛𝑑 𝑟 ∈ {1, . . . , 𝑘𝑞 }} where 𝑟 is        In this section, we present a causal bandit based frame-
the rank of content rendered in widget group 𝑞, 𝑘𝑞 is          work to address the problem of content optimization.
the total number of content that can be rendered in 𝑞,
and 𝐶𝑞 is the set of all possible candidate content that is
eligible to be rendered in 𝑞. Here, the cardinality of set
                                                               4.1. Features
𝐶𝑞 >> 𝑘𝑞 . Eligibility for rendering content 𝑐 in widget       When a customer visits web page 𝑝 ∈ 𝒫 on Amazon’s
group 𝑞 is typically determined by business rules and          retail website, we receive customer and shopping con-
content creators.                                              text 𝑋. Context 𝑍 corresponding to each candidate con-
   When a customer visits web page 𝑝 ∈ 𝒫 on Amazon’s           tent can also be generated separately. We then com-
retail website, a request is generated with customer and       bine contexts 𝑋 and 𝑍 non-linearly to form a single
shopping context 𝑋 to optimize and display content for         d-dimensional vector 𝐵 ∈ R𝑑 . We also include second-
widget group 𝑞 on page 𝑝. Context for candidate content        and third-order interaction terms between the explana-
𝑐 ∈ 𝐶𝑞 can be constructed and is denoted by 𝑍. We now          tory variables observed in the context. For reference, we
include a few examples of context below:                     we observe that results from such an approach can be
                                                             mixed in that it may improve customer shopping expe-
       • Shopping context: region, web page type, widget rience on some web pages of the website but not all of
         group id, page item, metadata of page item, and them.
         search query                                           To address these challenges, we propose optimizing
       • Customer context: recent interaction events, cus- directly for overall down-session value generated after
         tomer signed-in status, and prime membership customer has interacted with content. In this approach,
         status                                              once content has been ranked and rendered, we record
       • Content context: widget id, widget meta informa- customer’s interaction events with it such as impressions,
         tion, and content attributes                        clicks, purchases and other high-value actions. There-
                                                             after, we measure the aggregate value generated from
4.2. Ranking Model                                           these events over a subsequent time horizon to compute
                                                             our metric of interest 𝑀 𝑂𝐼. The measured value is at-
We formulate the problem of content optimization as that tributed to content as reward, if it meets a predefined
of learning to rank the set of eligible content 𝐶𝑞 . Our aim criteria. Content ranking models then learn to predict for
is to determine the rank of each eligible candidate content this down-session value of showing content to customers
𝑐 ∈ 𝐶𝑞 and return the 𝑡𝑜𝑝 − 𝐾 ranked content 𝑇𝑞 so given a context, and make ranking decisions based on
as to render them in widget group 𝑞. To do so, we need the predicted value. This approach enables us to measure
a utility function using which we can evaluate eligible and attribute site-wide impact across all devices, apps,
content and rank them. We propose using reward ℛ to widget groups, and web pages from the moment a cus-
be generated over a subsequent time horizon in the event tomer has interacted with content. We call this approach
content were to be shown to a customer, 𝒮 ∈ {0, 1}, to define reward and rank diverse type of content using
as our utility function. We model it using a generalized aggregate down-session value as holistic optimization.
linear model,

                𝐸(𝑅|𝑆, 𝐵) = 𝑔(𝐵 ⊤ 𝑊 )                   (1)    4.4. Attribution
where, g is the link function. Since reward ℛ takes            The predefined criteria used to attribute aggregate down-
continuous values in our problem setting, we choose            session value as reward also defines the form of attri-
an identity link function. We use the set of past obser-       bution such as click-through attribution (CTA) or view-
vations 𝐻 made up of triplets of context, action and           through attribution (VTA). The distinction between these
reward {(𝑋𝜏 , 𝐴𝜏 , 𝑅𝜏 ), 𝜏 = 1, . . . , 𝑡 − 1} to train the    two forms of attributions is the customer interaction
ranking model and estimate regression parameters using         event that triggers the measurement of reward. In CTA,
a Bayesian framework.                                          reward is measured after a click event with content oc-
                                                               curs while in VTA reward is measured after a view event
                                                               with content occurs. Note that both VTA and CTA are
4.3. Reward                                                    a form of equal credit attribution model. Likewise, the
A fundamental challenge in our problem setting is that         time horizon over which the reward is measured is called
of defining and measuring reward so as to evaluate di-         an attribution window. The window is triggered after a
verse types of content together on an equal footing [27].      customer interaction event with content occurs. We de-
When content optimization systems seek to maximize             termine attribution windows by performing exploratory
the attributed value (or reward) to individual content,        data analysis of the length of customer shopping sessions
we observe that it leads to development and launch of          and use multiple windows in practice to cater to varied
bespoke recommender systems that optimize for individ-         use cases. We illustrate the concept of VTA and CTA
ual objectives and cater to page specific use cases. For       with an attribution window using the example in (figure
instance, recommender systems displayed on different           2).
web pages can optimize for increasing customer inter-              A key drawback of CTA is that it cannot attribute re-
actions with themselves through view, clicks, purchases        ward to content that cannot be clicked or where clicking
and other high-value actions without being complemen-          on content does not necessarily indicate a positive cus-
tary (or incremental) to the customer’s current shopping       tomer shopping experience. We observe that CTA also
intent. This often results in a poor customer shopping         leads ranking models to favor content that has a high
experience which in turn leads to a negative impact to         click propensity. Consequently, such models promote
business metrics such as revenue. An alternative here is       content which at times is not relevant to customer’s on-
to attribute value to individual content only if interaction   going shopping mission. This distracts the customer from
with it is in addition to purchase of the page item (or        their mission which in turn results in a negative impact
product) wherever applicable. In empirical evaluation,         to their shopping experience. VTA on the other hand
Figure 2: Here, content (c1) corresponding to view (v1) will be attributed with MOI of 30 (= 10 + 20) under view-through
attribution. Likewise, under click-through attribution content (c2) corresponding to click (cl-1) will be attributed with MOI of
30 (= 10 + 20) while content (c3) corresponding to click (cl-2) will be attributed with MOI of 50 (= 20 + 30). Note that content
(c1) corresponding to view (v1) will be attributed MOI=0 under click-through attribution.


allows us to capture both the positive and negative im-          observed down-session value as reward without account-
pact of presenting content to customers. It enables us           ing for the counterfactual outcome could lead to models
to capture the value of showing content which inspires           overestimating the predicted benefit at inference time.
customer shopping missions including scenarios where             To address these challenges, we use an uplift modeling
customers can compare selection without requiring di-            framework. It estimates Conditional Average Treatment
rect interaction with content. Furthermore, it is closer in      Effect (CATE) [28] between exposure and non-exposure
alignment with how an experimentation framework for              of content to customers using observational data. We
conducting online A/B tests may measure and attribute            assume conditional unconfoundedness in our problem
aggregate downstream impact after a customer has been            setting [29, 30]. Then,
exposed to a new shopping experience (or treatment).
Thus, it can enable parity in the methodology used to               𝐶𝐴𝑇 𝐸       ≡     𝐸[𝑅(1) − 𝑅(0)|𝐵 = 𝑏]                   (2)
attribute reward in the content optimization and online                         =     𝐸[𝑅(1)|𝐵 = 𝑏] − 𝐸[𝑅(0)|𝐵 = 𝑏]
experimentation systems.
                                                                 where, 𝐵 is the d-dimensional feature vector. Here,
                                                                 𝐸[𝑅(1)|𝐵 = 𝑏] is the mean of the treated group
4.5. Uplift Modeling Framework
                                                                 where content 𝑐 is shown in the shopping session, and
Both VTA and CTA assume a causal relationship between            𝐸[𝑅(0)|𝐵 = 𝑏] is the mean of the untreated group where
customers interacting with shown content through views           content 𝑐 is not shown in the shopping session. We have
and clicks (cause), and observed reward (effect). In the         explored two approaches to estimate the latter: i) using
case of CTA, there is a strong connection between the            the mean of untreated group calculated from our ranking
cause and effect as often times a click is an intentional        logs as a biased estimate for 𝐸[𝑅(0)|𝐵 = 𝑏], and ii) by
action on the part of a customer. However, with VTA              estimating 𝐸[𝑅(0)|𝐵 = 𝑏] from randomized controlled
we cannot establish this direct connection between a             trials.
customer viewing content and the observed reward. As                The uplift modeling framework is then defined using
such, we assume a causal relationship which introduces           a two-part model. First, a baseline model estimates the
noise in our observations. Models incrementally trained          expected counterfactual reward when content 𝑐 is ranked
using such observations are likely to have a high variance       but not shown in the shopping session. We illustrate the
in the predictions.                                              underlying theory using a linear regression model:
   In addition, the attribution model described in the pre-
vious section does not capture the incremental value of                    𝜇0 = 𝐵 ⊤ 𝛽 + 𝜖                  (3)
showing content. Customers can have an underlying
propensity to shop products or consume content based Treatment or incremental effect for each observation in
on prior exposure or affinity. As a result, attributing the treated group where content 𝑐 is ranked and shown
in the shopping session is then estimated as:             parameters by a small fraction of their existing value
                    (1)                                   to account for changes in the environment [31]. This
                   𝐷𝑖 = 𝑅𝑖 − 𝜇
                             ˆ 0 (𝑏)                  (4)
                                                          completes the feedback loop which allows us to contin-
where, 𝑅𝑖 is the observed down-session reward for ob- uously explore actions and expand our knowledge for
                   (1)
servation 𝑖, and 𝐷𝑖 is the imputed incremental effect making better decisions in the future. This in turn en-
for observation 𝑖 in the treated group. In the second ables us to support running online A/B tests using which
part, pseudo-effect 𝐷 is used as the target variable in content creators can introduce new content across Ama-
our ranking model, described in (eqn. 1), to predict the zon’s retail website with the goal of improving customer
incremental benefit of showing content 𝑐 to a customer shopping experience and measure its benefit while doing
in widget group 𝑞.                                        so.

4.6. Exploration Strategy                                      4.7. Incorporating Diversity in Ranking
The exploration component of our content optimization           Showing high-relevance content without taking content
framework explores content with few observations from           diversity into account leads to monotony and tends to
the past. To do so, it aims at solving a contextual bandit      make the holistic shopping experience less meaningful
problem. Here, we use Thompson sampling, an algo-               for the customers. Optimizing for the whole widget group
rithm widely used to balance exploration and exploita-          involves balancing relevance and diversity of the content
tion. It suggests to randomly play each arm according to        therein, where the whole-widget group effect is repre-
its probability of being optimal. In our problem setting, it    sented using the amount of similar content displayed
means choosing content proportional to the probability          in it. One approach followed here is to model this as a
of it being optimal. This implies we won’t be necessarily       submodular optimization problem [32]. In [33, 34], the
choosing content with the highest expected incremental          authors propose using submodular functions which have
benefit at each time step. It is a trade-off we make to ex-     a diminishing returns property. In their approach, the
plore content with few observations from the past which         total score for a content is derived from its relevance
have high uncertainty but ultimately may drive a higher         while also accounting for the decreasing utility of show-
reward. In practice, we apply the Thompson sampling             ing multiple content of the same type. As a result, the
algorithm by sampling model parameters 𝑊      ˆ 𝑡 from their    value of selecting content from a given category or type
posterior distributions followed by choosing content that       decreases as a function of the number of content belong-
maximizes the reward.                                           ing to that type already selected. A key shortcoming of
                                                                this approach is that it lacks a feedback loop and param-
Algorithm 1 Thompson Sampling Algorithm for Con- eters of the diversity scoring function aren’t learned to
tent Optimization                                               optimize for the same objective as the relevance scoring
  1: for 𝑡 = 1, . . . , 𝑇 do                                    function.
  2:     for all 𝑟𝑎𝑛𝑘 = 1, . . . , 𝑘𝑞 do ◁ 𝑘𝑞 is the value         Instead we propose a two-stage model for incorporat-
     of 𝐾 corresponding to widget group 𝑞                       ing diversity into content. We first rank all the eligible
  3:         Receive context 𝑋𝑡                                 content 𝑐 ∈ 𝐶𝑞 to be shown in widget group 𝑞 using our
  4:         Sample 𝑊    ˆ 𝑡 from the posterior distribution underlying ranking model (eqn. 1). Thereafter, we iter-
     Pr(𝑊 |𝐻𝑡 )                                                 atively re-rank content at each position 𝑟 in the widget
  5:         Select 𝐴𝑡𝑟𝑎𝑛𝑘 = argmax𝐴 𝐵 ⊤ 𝑊      ˆ𝑡              group by taking into account the content that is already
  6:     end for                                                ranked in the previous 𝑟 − 1 positions. This is accom-
  7:     Choose 𝑡𝑜𝑝 − 𝐾 arms and observe reward 𝑅               plished by using a second ranking model which includes
  8:     𝐻𝑡                    =              𝐻𝑡−1           ∪ cross-content interaction features. To capture these in-
     {(𝑥𝑡𝑟𝑎𝑛𝑘 , 𝑎𝑡𝑟𝑎𝑛𝑘 , 𝑟𝑡𝑟𝑎𝑛𝑘 ), 𝑟𝑎𝑛𝑘 = 1, . . . , 𝑘𝑞 }       teractions, we categorize each content 𝑐 in to one of 𝑚
  9: end for                                                    distinct categories or types. The goal is to then select
                                                                an optimal number of highly relevant widgets in each
   Candidates that are ranked and chosen to be displayed category. For a given widget group 𝑞 ∈ 𝒬, the overall
by the ranking model are then logged along with their value of widget group 𝑞 is represented as:
observed reward in the form of triplets (𝑥𝑡 , 𝑎𝑡 , 𝑟𝑡 ). There-                              𝑘
after, we estimate the incremental effect for each obser-
                                                                                            ∑︁
                                                                             𝑓 (𝑐𝑘 |𝑞, 𝑋) =      𝑓 (𝑐𝑟 |𝑇𝑞(𝑟−1) , 𝑋, 𝑊 ) (5)
vation in the logged feedback using our uplift modeling                                     𝑟=1
framework. The incremental effects are subsequently               𝑓 (𝑐𝑟 |𝑇𝑞(𝑟−1) , 𝑋, 𝑊 ) = 𝑔(𝐵 ⊤ 𝑊 )                    (6)
used as target variables to incrementally train our rank-
ing model using a batch update under the Thompson Sam- where, 𝑐𝑟 is the content at rank 𝑟 in widget group 𝑞,
pling framework. While doing so, we decay the model 𝑇𝑞(𝑟−1) is the set of content allocated in the top 𝑟 − 1
positions, 𝑋 is the request and customer context received than before as we retrained the models at a faster ca-
as before, 𝑊 are the model parameters, and 𝑔 is a gener- dence. This impacts the model’s learning process in two
alized linear model.                                         ways: i) outliers in the dataset can cause the model to
                                                             incorrectly associate higher potential reward for some
                                                             content despite winsorization techniques, and ii) insuf-
5. Low Latency Learning                                      ficient data can limit the model’s ability to learn about
     Framework                                               content’s reward distribution especially in regions with
                                                             small amount of traffic. Empirically, we observe that both
In production, we observed that our content optimiza- of these scenarios lead to over-exploration of content.
tion framework suffered from a delay in feedback as it          We address this challenge using Bayesian Regulariza-
needed multiple days on average to complete the learning tion. The use of Gaussian priors has been established in
loop. This involves logging the feedback after customers [35] as a form of L2 regularization wherein the following
are shown with content on the website, measuring and equivalence is explored:
attributing reward in our data pipelines, incrementally
training the models at a daily cadence, and deploying
                                                                                             [︁ 𝑃 (𝐵|𝑊 )𝑃 (𝑊 ) ]︁
                                                                 (𝐵𝐵 𝑇 + 𝛼𝐼)−1 𝐵 𝑇 𝑅 = E                              (7)
the retrained models in production. A delay in feedback                                             𝑃 (𝐵)
has the following consequences in production for our
content optimization framework:                              Here, instead of initiating the feature weights from a
                                                             static prior (i.e. mean 0 and variance 1), we derive a
      • When new content is introduced into the ecosys- prior distribution from previously learned weight distri-
         tem, the optimization framework is not able to butions in order to allow for a more pessimistic explo-
         effectively estimate its potential benefit, and the ration regime. For instance, by using a prior representing
         content is subject to exploration as expected. Due 20th percentile mean of all features and 75th percentile
         to the delay in feedback, this can result in new variance of all features, we can reduce the chances of
         content being explored at a higher show-rate over-exposure for new content during the learning pe-
         for the duration of the delay during the initial riod. This improvement in turn has enabled us to launch
         learning period before sufficient observations are the L3 framework in production and reduced the delay
         logged for the model to learn from its own feed- in feedback by 90%.
         back. This in turn can result in sub-optimal deci-
         sion making and introduction of poor customer
         shopping experience during the learning period. 6. Experiments
      • While the cost of exploration may be amortized
                                                             We first evaluate our content optimization framework
         over long running content campaigns, a longer
                                                             using both traditional offline evaluation and off-policy
         feedback loop induces limitations in realizing ben-
                                                             evaluation methodologies [36, 37]. This allows us to eval-
         efit during high-value events such as Cyber Mon-
                                                             uate and prune alternative treatment policies before intro-
         day where new content may be introduced for a
                                                             ducing them in online randomized experiments [38, 39].
         short period of time. In such cases, some content
                                                             Here, we use regression and ranking metrics to evalu-
         promoting sales or other events will be turned off
                                                             ate the framework quantitatively, and content’s share-of-
         even before their benefit is effectively learned by
                                                             voice and ranking distributions to evaluate it qualitatively
         the model.
                                                             using domain knowledge. Subsequently, we demonstrate
      • A longer feedback loop also decreases the velocity
                                                             the effectiveness of our framework through five online
         of running online A/B tests where new content
                                                             A/B tests.
         and improvements to optimization framework
         may be introduced with the aim of improving
         customer shopping experience.                       6.1. Online Experiment Setting
To address these challenges, we developed a Low Latency In our online experimentation setting, observational
Learning (L3) framework which has reduced the learning  units (or shopping sessions) are randomly exposed to
loop for our content optimization framework by 90% from either the baseline control policy or the alternative treat-
multiple days to a couple of hours.                     ment policies. Here, we track the impact to our metric
                                                        of interest 𝑀 𝑂𝐼, which is a measure of improved site-
                                                        wide customer shopping experience. In the results, we
5.1. Bayesian Regularization                            include the causal effect w.r.t percentage improvement
A key challenge we encountered in the development of in this metric at Amazon’s scale. The experiments are
L3 framework was that of the number of samples avail- conducted across all of Amazon’s world-wide market-
able for incremental training being significantly lower places and product categories. Level of significance 𝛼 for
Table 1                                                    three sizes – small, medium and large. Here, the size of a
Online experiment results.                                 rendered image can influence the customer’s understand-
                                                           ing of the product. Hence, we want to select and render
       Experiment Incremental Impact p-value
                                                           an optimal size of the same product image so as to help
                        (% improvement)
                                                           the customers evaluate products better especially for high
       EXP-1                 +0.16%           0.02         consideration purchases. This is a use case where click-
       EXP-2A                +0.01%           0.12         through attribution cannot be used as clicking on the
       EXP-2B                +0.01%           0.09         content does not necessarily indicate a positive customer
       EXP-3                 +0.09%           0.02
                                                           shopping experience. We formulate the task of optimal
       EXP-4                 +0.05%           0.19
       EXP-5                 +0.11%           0.00         image size selection as a learning to rank problem, and
                                                           use view-through attribution to measure and attribute
                                                           reward to the rendered image size. To demonstrate the
                                                           effectiveness of this approach, we ran two experiments –
these experiments was determined by Amazon’s business
                                                           one each for desktop and mobile surfaces. In the control
objectives and was set to 0.10. Duration for these experi-
                                                           group, image size was selected by a rule-based system
ments was estimated from statistical power analysis. We
                                                           while in the treatment group, our framework ranked the
allocated equal traffic to both the control and treatment
                                                           image size variations and chose the top ranked variation
groups. During the course of the experiment, the models
                                                           to render. In both the experiments (EXP-2A and EXP-2B),
were incrementally trained using their own set of logged
                                                           we observe an improvement in the 𝑀 𝑂𝐼 metric which
feedback.
                                                           is practically significant at Amazon’s scale.

6.2. Experiment 1: Application of the                          6.4. Experiment 3: Application of the
     Holistic Optimization Framework                                Causal Bandit Framework
We first test the effectiveness of our holistic optimization
                                                               After demonstrating the effectiveness of VTA, we tested
framework to rank content in a widget group on product
                                                               the utility of the uplift modeling framework. The frame-
detail pages of Amazon’e retail website. This is a region
                                                               work allows us to measure and optimize for incremental
of customer shopping experience on the website where
                                                               value generated by content, and reduces the observa-
we usually see organic content such as ‘customers who
                                                               tional bias in data. We conducted an experiment in a
viewed this also viewed’ and ‘customers who bought this
                                                               widget group which is located at the bottom of product
also bought’ widgets being displayed alongside adver-
                                                               detail pages on the desktop retail website where person-
tising content. A key challenge in dynamically ranking
                                                               alized content that is usually generated by taking recent
content in this setting was that of attribution of reward
                                                               browsing history into account is shown. This in turn
to diverse type of content which were generated by con-
                                                               allowed us to test our hypothesis that customers can
tent creators who optimized for differing business ob-
                                                               have an underlying propensity to shop products or con-
jectives. As such, our framework needed to arbitrate
                                                               sume content based on prior exposure or affinity, and
content during the content allocation process and fairly
                                                               optimizing for incremental benefit can result in a posi-
balance the differing objectives. Since holistic optimiza-
                                                               tive customer shopping experience. In the control group,
tion framework measures reward using the aggregate
                                                               content was ranked by a linear bandit without using the
down-session value after customer has interacted with
                                                               uplift modeling framework while rewards were measured
content, we wanted to test its effectiveness in addressing
                                                               and attributed using CTA. In the treatment group, con-
this problem. In the control group, content was statically
                                                               tent was ranked using a linear causal bandit with VTA.
ranked by a rule-based system while in the treatment
                                                               In the results (EXP-3), we observe that the linear causal
group, our framework dynamically ranked content using
                                                               bandit using VTA performed better than the linear ban-
the holistic optimization framework. In the results (EXP-
                                                               dit which did not use uplift modeling framework. The
1), we observe a practically and statistically significant
                                                               improvement in 𝑀 𝑂𝐼 metric was both practically and
improvement in the 𝑀 𝑂𝐼 metric which is a measure of
                                                               statistically significant.
site-wide improvement in customer shopping experience.

                                                               6.5. Experiment 4: Application of
6.3. Experiment 2: Application of
                                                                    Incorporating Diversity in Ranking
     View-through Attribution
                                                               Subsequently, we ran an experiment (EXP-4) on the desk-
In this experiment, we applied our content optimization        top homepage of Amazon’s retail website to test the im-
framework to the image size selection problem. Usually,        pact of incorporating diversity in content. In the control
product display images on Amazon’s detail page exist in        group, content was ranked using just the single baseline
ranking model, while in the treatment group, content           learnings from the deployment of a low-latency learning
was ranked using the two-stage ranking model – first           framework in production that has reduced the delay in
using the baseline model followed by a re-ranking model        feedback and shortened the learning loop by 90%. Here,
which incorporates diversity using cross-content interac-      we described our application of Gaussian prior as a form
tion features. Here, we observe a practically significant      of L2 regularization which in turn enabled the launch
improvement in the 𝑀 𝑂𝐼 metric. Based on the results,          of the L3 framework. We then demonstrated the effec-
we infer that incorporating diversity into content ranking     tiveness of our methodology through multiple online
can lead to a better customer shopping experience.             experiments, and shared results and insights gathered
                                                               through the same. Finally, we believe our methodology
6.6. Experiment 5: Application of the                          and learnings are generic and can be extended to con-
                                                               tent optimization problems in other domains. It can also
     Low Latency Learning Framework                            be extended to rank items (or products) within a single
L3 pipeline has shortened the delay in feedback for our        widget for a product recommendation system.
contextual bandit based framework by 90%. As a result,
we expect the bandit retrained at hourly cadence to con-
verge sooner and perform better than the one retrained         References
at a daily cadence. To test the benefit and measure the         [1] C. Riquelme, G. Tucker, J. Snoek, Deep bayesian
impact of low latency learning, we ran an experiment                bandits showdown: An empirical comparison of
on the mobile homepage of Amazon’s retail website. In               bayesian deep networks for thompson sampling, in:
the control group, content was ranked by a linear bandit            International Conference on Learning Representa-
incrementally trained at a slower cadence with a learning           tions, 2018.
loop of multiple days, while in the treatment group, con-       [2] A. Bietti, A. Agarwal, J. Langford, A contextual
tent was ranked by a linear bandit incrementally trained            bandit bake-off, arXiv preprint arXiv:1802.04064
at a faster cadence with a learning loop of a few hours.            (2018).
In the results (EXP-5), we observe that the bandit with         [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Ve-
a shorter delay in feedback performed better w.r.t our              ness, M. G. Bellemare, A. Graves, M. A. Riedmiller,
metric of interest 𝑀 𝑂𝐼 where the improvement was                   A. Fidjeland, G. Ostrovski, S. Petersen, C. Beat-
both practically and statistically significant. Based on the        tie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
results, we infer that reducing the delay in feedback and           D. Wierstra, S. Legg, D. Hassabis, Human-level con-
increasing the velocity of learning loop has a positive             trol through deep reinforcement learning, Nature
impact on customer shopping experience.                             518 (2015) 529–533. doi:10.1038/nature14236.
                                                                [4] T. Schaul, J. Quan, I. Antonoglou, D. Silver, Priori-
7. Conclusion                                                       tized experience replay, in: 4th International Con-
                                                                    ference on Learning Representations, ICLR 2016,
In this paper, we presented a causal bandit framework               San Juan, Puerto Rico, May 2-4, 2016, Conference
to address the problem of content optimization with the             Track Proceedings, 2016.
objective of improving the overall customer shopping            [5] T. Lai, H. Robbins, Asymptotically efficient adaptive
experience on Amazon’s e-commerce (or retail) website.              allocation rules, Advances in Applied Mathematics
Therein, we introduced a holistic optimization frame-               6 (1985) 4–22. doi:https://doi.org/10.1016/
work that enables us to define reward and rank diverse              0196-8858(85)90002-8.
types of content using aggregate down-session value;            [6] P. Auer, N. Cesa-Bianchi, P. Fischer, Finite-time
presented the concept of view-through attribution; dis-             analysis of the multiarmed bandit problem, Mach.
cussed how it addresses some of the shortcomings of                 Learn. 47 (2002) 235–256. doi:10.1023/A:101368
click-through attribution; and presented applications of            9704352.
VTA in ranking content belonging to diverse type. To ad-        [7] Y. Gal, Z. Ghahramani, Dropout as a bayesian ap-
dress the shortcomings of view-through attribution, we              proximation: Representing model uncertainty in
used an Uplift modeling framework which has enabled                 deep learning, in: Proceedings of the 33rd Interna-
us to rank content using incremental or causal bene-                tional Conference on International Conference on
fit instead of overall value. Subsequently, we proposed             Machine Learning - Volume 48, ICML’16, JMLR.org,
a two-stage model to incorporate diversity in content               2016, p. 1050–1059.
ranking by using cross-content interaction features. It         [8] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y.
helps us to balance relevance with diversity in content             Chen, X. Chen, T. Asfour, P. Abbeel, M. Andrychow-
shown on Amazon’s retail website and provide a mean-                icz, Parameter space noise for exploration, in: Inter-
ingful experience to our customers. Thereafter, we shared
     national Conference on Learning Representations,              ternational Conference on Predictive Applications
     2018.                                                         and APIs, PMLR, 2017, pp. 1–13.
 [9] M. Fortunato, M. G. Azar, B. Piot, J. Menick, M. Hes-    [23] S. R. Künzel, J. S. Sekhon, P. J. Bickel, B. Yu, Met-
     sel, I. Osband, A. Graves, V. Mnih, R. Munos, D. Has-         alearners for estimating heterogeneous treatment
     sabis, O. Pietquin, C. Blundell, S. Legg, Noisy net-          effects using machine learning, Proceedings of the
     works for exploration, in: International Conference           national academy of sciences 116 (2019) 4156–4165.
     on Learning Representations, 2018.                       [24] Z. Zhao, T. Harinen, Uplift modeling for multiple
[10] I. Osband, B. Van Roy, Bootstrapped thompson sam-             treatments with cost optimization, in: 2019 IEEE
     pling and deep exploration, 2015. doi:10.48550/A              International Conference on Data Science and Ad-
     RXIV.1507.00300.                                              vanced Analytics (DSAA), IEEE, 2019, pp. 422–431.
[11] I. Osband, C. Blundell, A. Pritzel, B. Van Roy, Deep     [25] N. Sawant, C. B. Namballa, N. Sadagopan, H. Nas-
     exploration via bootstrapped dqn, in: Advances in             sif, Contextual multi-armed bandits for causal
     Neural Information Processing Systems 29, Curran              marketing, in: ICML 2018, 2018. URL: https:
     Associates, Inc., 2016, pp. 4026–4034.                        //www.amazon.science/publications/contextu
[12] D. Russo, B. V. Roy, A. Kazerouni, I. Osband, A tuto-         al-multi-armed-bandits-for-causal-marketing.
     rial on thompson sampling, CoRR abs/1707.02038           [26] Y. Zhao, M. Goodman, S. Kanase, S. Xu, Y. Kimmel,
     (2017). arXiv:1707.02038.                                     B. Payne, S. Khan, P. Grao, Mitigating targeting
[13] W. R. Thompson, On the likelihood that one un-                bias in content recommendation with causal ban-
     known probability exceeds another in view of the              dits, in: Proceedings of the 2nd Workshop on Multi-
     evidence of two samples, Biometrika 25 (1933) 285–            Objective Recommender Systems (MORS 2022), in
     294.                                                          conjunction with the 16th ACM Conference on
[14] M. Strens, A bayesian framework for reinforcement             Recommender Systems (RecSys 2022), Seattle, WA,
     learning, in: ICML, volume 2000, 2000, pp. 943–950.           USA, 2022.
[15] S. L. Scott, A modern bayesian look at the multi-        [27] S. Xu, Y. Zhao, S. Kanase, M. Goodman, S. Khan,
     armed bandit, Applied Stochastic Models in Busi-              B. Payne, P. Grao, Machine learning attribution:
     ness and Industry 26 (2010) 639–658.                          Inferring item-level impact from slate recommen-
[16] L. Li, W. Chu, J. Langford, R. E. Schapire, A                 dation in e-commerce, in: KDD 2022 Workshop
     contextual-bandit approach to personalized news               on First Content Understanding and Generation for
     article recommendation, in: Proceedings of the                e-Commerce, 2022. URL: https://www.amazon.sci
     19th International Conference on World Wide Web,              ence/publications/machine-learning-attribution-i
     WWW ’10, Association for Computing Machinery,                 nferring-item-level-impact-from-slate-recomm
     New York, NY, USA, 2010, p. 661–670. doi:10.114               endation-in-e-commerce.
     5/1772690.1772758.                                       [28] S. Athey, G. Imbens, Recursive partitioning for
[17] O. Chapelle, L. Li, An empirical evaluation of                heterogeneous causal effects, Proceedings of the
     thompson sampling, in: Proceedings of the 24th                National Academy of Sciences 113 (2016) 7353–7360.
     International Conference on Neural Information                doi:10.1073/pnas.1510489113.
     Processing Systems, NIPS’11, Curran Associates           [29] D. Rubin, Estimating causal effects of treatments in
     Inc., Red Hook, NY, USA, 2011, p. 2249–2257.                  randomized and nonrandomized studies., Journal
[18] S. Agrawal, N. Goyal, Thompson sampling for con-              of Educational Psychology 66 (1974) 688–701.
     textual bandits with linear payoffs, in: International   [30] G. W. Imbens, D. B. Rubin, Causal Inference for
     Conference on Machine Learning, 2013, pp. 127–                Statistics, Social, and Biomedical Sciences: An
     135.                                                          Introduction, Cambridge University Press, 2015.
[19] B. Hansotia, B. Rukstales, Incremental value mod-             doi:10.1017/CBO9781139025751.
     eling, Journal of Interactive Marketing 16 (2002)        [31] T. Graepel, J. Q. n. Candela, T. Borchert, R. Herbrich,
     35.                                                           Web-scale bayesian click-through rate prediction
[20] V. S. Y. Lo, The true lift model: A novel data mining         for sponsored search advertising in microsoft’s bing
     approach to response modeling in database mar-                search engine, in: Proceedings of the 27th Interna-
     keting, SIGKDD Explor. Newsl. 4 (2002) 78–86.                 tional Conference on International Conference on
     doi:10.1145/772862.772872.                                    Machine Learning, ICML’10, Omnipress, Madison,
[21] N. Radcliffe, Using control groups to target on               WI, USA, 2010, p. 13–20.
     predicted lift: Building and assessing uplift model,     [32] Y. Yue, C. Guestrin, Linear submodular bandits and
     Direct Marketing Analytics Journal (2007) 14–21.              their application to diversified retrieval, in: Ad-
[22] P. Gutierrez, J.-Y. Gérardy, Causal inference and             vances in Neural Information Processing Systems,
     uplift modelling: A review of the literature, in: In-         volume 24, Curran Associates, Inc., 2011.
[33] C. H. Teo, H. Nassif, D. Hill, S. Srinivasan, M. Good-        D. Coey, M. Curtis, A. Deng, W. Duan, P. Forbes,
     man, V. Mohan, S. Vishwanathan, Adaptive, person-             B. Frasca, T. Guy, G. W. Imbens, G. Saint Jacques,
     alized diversity for visual discovery, in: Proceed-           P. Kantawala, I. Katsev, M. Katzwer, M. Konut-
     ings of the 10th ACM Conference on Recommender                gan, E. Kunakova, M. Lee, M. Lee, J. Liu, J. Mc-
     Systems, RecSys ’16, Association for Computing                Queen, A. Najmi, B. Smith, V. Trehan, L. Vermeer,
     Machinery, New York, NY, USA, 2016, p. 35–38.                 T. Walker, J. Wong, I. Yashkov, Top challenges from
[34] H. Nassif, K. O. Cansizlar, M. Goodman, S. V. N.              the first practical online controlled experiments
     Vishwanathan, Diversifying music recommenda-                  summit, SIGKDD Explor. Newsl. 21 (2019) 20–35.
     tions, in: ICML 2016, 2016.                                   doi:10.1145/3331651.3331655.
[35] M. A. Figueiredo, Adaptive sparseness for super-         [39] T. S. Richardson, Y. Liu, J. Mcqueen, D. Hains, A
     vised learning, IEEE transactions on pattern analy-           bayesian model for online activity sample sizes, in:
     sis and machine intelligence 25 (2003) 1150–1159.             Proceedings of The 25th International Conference
[36] A. Swaminathan, T. Joachims, The self-normalized              on Artificial Intelligence and Statistics, volume 151
     estimator for counterfactual learning, in: Advances           of Proceedings of Machine Learning Research, PMLR,
     in Neural Information Processing Systems, vol-                2022, pp. 1775–1785.
     ume 28, Curran Associates, Inc., 2015.
[37] T. Schnabel, A. Swaminathan, A. Singh, N. Chan-
     dak, T. Joachims, Recommendations as treatments:
     Debiasing learning and evaluation, in: Proceedings
                                                              APPENDIX
     of the 33rd International Conference on Interna-
     tional Conference on Machine Learning - Volume
                                                              A. An Illustration of Diverse Type of
     48, ICML’16, JMLR.org, 2016, p. 1670–1679.                  Content on Amazon’s Homepage
[38] S. Gupta, R. Kohavi, D. Tang, Y. Xu, R. Ander-           Below, (figure 3) illustrates diverse type of content being
     sen, E. Bakshy, N. Cardin, S. Chandran, N. Chen,         shown on the homepage of Amazon’s retail website.
Figure 3: Homepage of Amazon’s Retail Website.