An Application of Causal Bandit to Content Optimization Sameer Kanase1 , Yan Zhao1 , Shenghe Xu1 , Mitchell Goodman1 , Manohar Mandalapu1 , Benjamyn Ward1 , Chan Jeon1 , Shreya Kamath1 , Ben Cohen1 , Yujia Liu1 , Hengjia Zhang1 , Yannick Kimmel1 , Saad Khan1 , Brent Payne1 and Patricia Grao1 1 Amazon, USA Abstract Amazon encompasses a large number of discrete businesses such as Retail, Advertising, Fresh, Business (B2B e-commerce), and Prime Video, most of which maintain a presence across its e-commerce website. They produce content for our customers that belong to diverse content types such as merchandising (e.g. product recommendations), product advertisements (e.g. sponsored products and display ads), program adoption banners (e.g. Amazon Fresh), and consumption (e.g. Prime Video). When customers visit a web page on the website, it triggers a content allocation process where we determine the specific content to show in regions of customer shopping experience on that web page. Content produced by the aforementioned businesses then needs to be arbitrated during this process. We present a causal bandit based framework to address the problem of content optimization in this context. The framework is responsible for fairly balancing the differing objectives and methods of these businesses, and selecting the right content to display to the customers at the right time. It does so with the goal of improving the overall site-wide customer shopping experience. In this paper, we present our content optimization framework, describe its components, demonstrate the framework’s effectiveness through online randomized experiments, and share learnings from deploying and testing the framework in production. Keywords Personalization, Recommender system, Content optimization, Content ranking, Content diversity, Causal bandit, Contextual bandit, View-through attribution, Holistic optimization 1. Introduction When customers visit a web page on the website, it triggers a content allocation process where we deter- Amazon encompasses a large number of discrete busi- mine the specific content to show in the widget groups nesses such as Retail, Advertising, Fresh, Business (B2B on that web page. Content produced by the aforemen- e-commerce), and Prime Video, most of which maintain a tioned discrete businesses then needs to be arbitrated presence across its e-commerce (or retail) website. These during this process. As the common integration point, discrete businesses produce content for our customers Amazon’s content optimization framework is responsi- that belong to diverse content types such as merchandis- ble for this content arbitration. It accomplishes this by ing (e.g. product recommendations), product advertise- fairly balancing the differing objectives and methods of ment (e.g. sponsored products and display ads), program these businesses through optimization capabilities, and adoption banner (e.g. Amazon Fresh), and consumption by taking into account customer, content, and shopping (e.g. Prime Video). Each such content is rendered in the context. This results in the right content being shown form of a widget within independent ‘regions of customer to the customers at the right time thereby providing a shopping experience’ on the website, also known as wid- consistent and personalized shopping experience. The get groups. For instance, widgets such as ‘customers who content optimization framework is an ecosystem which viewed this also viewed’ and ‘customers who bought this enables businesses to interoperate independently by en- also bought’ are displayed on product detail pages of the abling content creators, customer shopping experience website alongside other organic and advertising content. providers, and web page owners to efficiently construct The region of customer shopping experience on the web- and serve content for the retail website. site where the collection of such widgets are displayed is In this paper, we present a causal bandit based frame- an example of a widget group. We illustrate the concept work to address the problem of content optimization of a product (or an item), widget, and widget group in with the objective of improving the overall customer (figure 1). shopping experience on Amazon’s retail website. Our contributions include: ORSUM@ACM RecSys 2022: 5th Workshop on Online Recommender Systems and User Modeling, jointly with the 16th ACM Conference on • application of a contextual bandit framework to Recommender Systems, September 23rd, 2022, Seattle, WA, USA " kanases@amazon.com (S. Kanase); yzhaoai@amazon.com enable introduction of new content through on- (Y. Zhao); shenghe@amazon.com (S. Xu); migood@amazon.com line randomized experiments (or A/B tests) and (M. Goodman) to learn the value (or benefit) of new content © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). through exploration, CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: Below is an example of a widget group in a shopping page on Amazon’s retail website. It is part of the checkout experience that surfaces after an item has been added to your cart. Here, we see carousels of products (or items) each of which is associated with a title, for instance, "Try before you buy with Prime Wardrobe". Each such carousel of products is rendered in the form of a widget which are marked by blue lines in the figure. A single product recommendation within a widget is marked by the green border while the widget group, which is a collection of widgets, is marked by the orange border. Note that widget groups are regions of customer shopping experience on the website, and each web page on the website can contain one or more widget groups. While the example in this figure illustrates a widget as a carousel of products, they are not limited to it. Widgets can also be used to render images, banners, advertisements, and other types of content. The homepage of Amazon’s retail website illustrates this diversity in content type, where a widget is one of the many cards shown in the customer’s feed. For reference, we have included an example of Amazon’s homepage in Appendix A. • an approach to measure the reward for actions ranking by using cross-content interactions, and taken by the contextual bandit framework, • learnings from the deployment of a low-latency • application of view-through attribution (VTA) to learning framework in production that reduces attribute reward in the context of content ranking the delay in feedback and increases the velocity which only requires that content be impressed, of our learning loop. • utilization of an uplift modeling framework to augment VTA and to optimize for incremental Finally, we also demonstrate the effectiveness of our benefit, framework through online A/B tests, and share results and insights gathered through the same. • a methodology to incorporate diversity in content 2. Related Work formally define the problem we address as determining ranked set 𝑇𝑞 from 𝐶𝑞 given contexts 𝑋 and 𝑍 so as to Application of exploration strategies in the context of rec- maximize the expected reward ℛ. Here, reward ℛ is a ommender systems is an active area of research. In recent measure of improved customer shopping experience on years, multiple exploration strategies have emerged and the retail website. We denote the metric for measuring shown promising results [1, 2]. They include epsilon- reward ℛ by 𝑀 𝑂𝐼, short for ‘metric of interest’. In our greedy [3, 4], upper confidence bound (UCB) [5, 6], setting, 𝑀 𝑂𝐼 takes into account the short-term as well adding random noise to parameters [7, 8, 9], and boot- as long-term impact to the customer’s shopping experi- strap sampling [10, 11]. We adopt the Thompson sam- ence, and helps us to fairly balance multiple and differing pling algorithm [12] to balance exploration with exploita- objectives of various stakeholders. It is computed us- tion under the contextual bandit setting. Originally in- ing actions taken by the customer after interacting with troduced in 1933 [13], Thompson sampling has been content such as impressions, clicks, purchases and other widely adopted in the context of bandit problems recently high-value actions. Note that our problem is different [14, 15, 16]. It has been shown to achieve state-of-the-art from that of ranking products (or items) within a single results on some real-world use cases and be robust to widget for a particular recommender system. delay [17, 18]. A key challenge we face in predicting 𝑀 𝑂𝐼 using 𝑋 Uplift modeling is a widely used approach to mea- and 𝑍 is that of the estimate being biased due to the cold- sure incremental effect [19, 20, 21, 22]. Our approach to start problem. New content gets continually introduced estimate incremental effect or benefit is similar to the to be shown on Amazon’s retail website while existing meta-learning approach presented in [23, 24]. In [25], the content can be sunsetted at any point of time. Empirically, authors presented an application of a causal bandit in tar- we observe a propensity in customers to interact more geting campaigns. They estimated incremental effect to with content displayed higher up in the widget group and optimize for clicks in email marketing campaigns and ad- on the web page. Furthermore, we only observe reward vertisement campaigns on Amazon’s mobile homepage. for content that was shown to customers before, but we In this work, we explore an application of causal bandit only show content to customers for which we predict for content optimization by estimating and optimizing there will be sufficient reward. Consequently, content for heterogeneous treatment effect [26]. with few or no prior observations is unlikely to be ranked higher or chosen to be shown to the customers even if it could generate a high-reward in the counterfactual 3. Problem Description event where a customer were to interact with it. Here, Let 𝒫 be the set of all web pages and 𝒬 be the set of we could use aggregate-level features to partially address all widget groups on Amazon’s retail website. Here, a the cold-start problem but cannot fully solve it. Moreover, widget group refers to real estate or region of customer we observe that customer preferences and their interac- shopping experience on the website which can be pop- tions with content change over time. To address these ulated with content 𝑐 in the form of a widget. Content challenges, we use a contextual bandit based framework 𝑐 can belong to diverse types of content such as prod- to create a learning loop for new content that has never uct advertisements (e.g. sponsored products and display been shown before to the customers and to dynamically ads), merchandising (e.g. product recommendations), adapt to changing customer preferences. program adoption banners (e.g. Amazon Fresh), and con- sumption (e.g. Prime Video). Each widget group 𝑞 ∈ 𝒬 4. Methodology in turn can render (or display) a set of ranked content 𝑇𝑞 = {𝑐𝑟 | 𝑐𝑟 ∈ 𝐶𝑞 𝑎𝑛𝑑 𝑟 ∈ {1, . . . , 𝑘𝑞 }} where 𝑟 is In this section, we present a causal bandit based frame- the rank of content rendered in widget group 𝑞, 𝑘𝑞 is work to address the problem of content optimization. the total number of content that can be rendered in 𝑞, and 𝐶𝑞 is the set of all possible candidate content that is eligible to be rendered in 𝑞. Here, the cardinality of set 4.1. Features 𝐶𝑞 >> 𝑘𝑞 . Eligibility for rendering content 𝑐 in widget When a customer visits web page 𝑝 ∈ 𝒫 on Amazon’s group 𝑞 is typically determined by business rules and retail website, we receive customer and shopping con- content creators. text 𝑋. Context 𝑍 corresponding to each candidate con- When a customer visits web page 𝑝 ∈ 𝒫 on Amazon’s tent can also be generated separately. We then com- retail website, a request is generated with customer and bine contexts 𝑋 and 𝑍 non-linearly to form a single shopping context 𝑋 to optimize and display content for d-dimensional vector 𝐵 ∈ R𝑑 . We also include second- widget group 𝑞 on page 𝑝. Context for candidate content and third-order interaction terms between the explana- 𝑐 ∈ 𝐶𝑞 can be constructed and is denoted by 𝑍. We now tory variables observed in the context. For reference, we include a few examples of context below: we observe that results from such an approach can be mixed in that it may improve customer shopping expe- • Shopping context: region, web page type, widget rience on some web pages of the website but not all of group id, page item, metadata of page item, and them. search query To address these challenges, we propose optimizing • Customer context: recent interaction events, cus- directly for overall down-session value generated after tomer signed-in status, and prime membership customer has interacted with content. In this approach, status once content has been ranked and rendered, we record • Content context: widget id, widget meta informa- customer’s interaction events with it such as impressions, tion, and content attributes clicks, purchases and other high-value actions. There- after, we measure the aggregate value generated from 4.2. Ranking Model these events over a subsequent time horizon to compute our metric of interest 𝑀 𝑂𝐼. The measured value is at- We formulate the problem of content optimization as that tributed to content as reward, if it meets a predefined of learning to rank the set of eligible content 𝐶𝑞 . Our aim criteria. Content ranking models then learn to predict for is to determine the rank of each eligible candidate content this down-session value of showing content to customers 𝑐 ∈ 𝐶𝑞 and return the 𝑡𝑜𝑝 − 𝐾 ranked content 𝑇𝑞 so given a context, and make ranking decisions based on as to render them in widget group 𝑞. To do so, we need the predicted value. This approach enables us to measure a utility function using which we can evaluate eligible and attribute site-wide impact across all devices, apps, content and rank them. We propose using reward ℛ to widget groups, and web pages from the moment a cus- be generated over a subsequent time horizon in the event tomer has interacted with content. We call this approach content were to be shown to a customer, 𝒮 ∈ {0, 1}, to define reward and rank diverse type of content using as our utility function. We model it using a generalized aggregate down-session value as holistic optimization. linear model, 𝐸(𝑅|𝑆, 𝐵) = 𝑔(𝐵 ⊤ 𝑊 ) (1) 4.4. Attribution where, g is the link function. Since reward ℛ takes The predefined criteria used to attribute aggregate down- continuous values in our problem setting, we choose session value as reward also defines the form of attri- an identity link function. We use the set of past obser- bution such as click-through attribution (CTA) or view- vations 𝐻 made up of triplets of context, action and through attribution (VTA). The distinction between these reward {(𝑋𝜏 , 𝐴𝜏 , 𝑅𝜏 ), 𝜏 = 1, . . . , 𝑡 − 1} to train the two forms of attributions is the customer interaction ranking model and estimate regression parameters using event that triggers the measurement of reward. In CTA, a Bayesian framework. reward is measured after a click event with content oc- curs while in VTA reward is measured after a view event with content occurs. Note that both VTA and CTA are 4.3. Reward a form of equal credit attribution model. Likewise, the A fundamental challenge in our problem setting is that time horizon over which the reward is measured is called of defining and measuring reward so as to evaluate di- an attribution window. The window is triggered after a verse types of content together on an equal footing [27]. customer interaction event with content occurs. We de- When content optimization systems seek to maximize termine attribution windows by performing exploratory the attributed value (or reward) to individual content, data analysis of the length of customer shopping sessions we observe that it leads to development and launch of and use multiple windows in practice to cater to varied bespoke recommender systems that optimize for individ- use cases. We illustrate the concept of VTA and CTA ual objectives and cater to page specific use cases. For with an attribution window using the example in (figure instance, recommender systems displayed on different 2). web pages can optimize for increasing customer inter- A key drawback of CTA is that it cannot attribute re- actions with themselves through view, clicks, purchases ward to content that cannot be clicked or where clicking and other high-value actions without being complemen- on content does not necessarily indicate a positive cus- tary (or incremental) to the customer’s current shopping tomer shopping experience. We observe that CTA also intent. This often results in a poor customer shopping leads ranking models to favor content that has a high experience which in turn leads to a negative impact to click propensity. Consequently, such models promote business metrics such as revenue. An alternative here is content which at times is not relevant to customer’s on- to attribute value to individual content only if interaction going shopping mission. This distracts the customer from with it is in addition to purchase of the page item (or their mission which in turn results in a negative impact product) wherever applicable. In empirical evaluation, to their shopping experience. VTA on the other hand Figure 2: Here, content (c1) corresponding to view (v1) will be attributed with MOI of 30 (= 10 + 20) under view-through attribution. Likewise, under click-through attribution content (c2) corresponding to click (cl-1) will be attributed with MOI of 30 (= 10 + 20) while content (c3) corresponding to click (cl-2) will be attributed with MOI of 50 (= 20 + 30). Note that content (c1) corresponding to view (v1) will be attributed MOI=0 under click-through attribution. allows us to capture both the positive and negative im- observed down-session value as reward without account- pact of presenting content to customers. It enables us ing for the counterfactual outcome could lead to models to capture the value of showing content which inspires overestimating the predicted benefit at inference time. customer shopping missions including scenarios where To address these challenges, we use an uplift modeling customers can compare selection without requiring di- framework. It estimates Conditional Average Treatment rect interaction with content. Furthermore, it is closer in Effect (CATE) [28] between exposure and non-exposure alignment with how an experimentation framework for of content to customers using observational data. We conducting online A/B tests may measure and attribute assume conditional unconfoundedness in our problem aggregate downstream impact after a customer has been setting [29, 30]. Then, exposed to a new shopping experience (or treatment). Thus, it can enable parity in the methodology used to 𝐶𝐴𝑇 𝐸 ≡ 𝐸[𝑅(1) − 𝑅(0)|𝐵 = 𝑏] (2) attribute reward in the content optimization and online = 𝐸[𝑅(1)|𝐵 = 𝑏] − 𝐸[𝑅(0)|𝐵 = 𝑏] experimentation systems. where, 𝐵 is the d-dimensional feature vector. Here, 𝐸[𝑅(1)|𝐵 = 𝑏] is the mean of the treated group 4.5. Uplift Modeling Framework where content 𝑐 is shown in the shopping session, and Both VTA and CTA assume a causal relationship between 𝐸[𝑅(0)|𝐵 = 𝑏] is the mean of the untreated group where customers interacting with shown content through views content 𝑐 is not shown in the shopping session. We have and clicks (cause), and observed reward (effect). In the explored two approaches to estimate the latter: i) using case of CTA, there is a strong connection between the the mean of untreated group calculated from our ranking cause and effect as often times a click is an intentional logs as a biased estimate for 𝐸[𝑅(0)|𝐵 = 𝑏], and ii) by action on the part of a customer. However, with VTA estimating 𝐸[𝑅(0)|𝐵 = 𝑏] from randomized controlled we cannot establish this direct connection between a trials. customer viewing content and the observed reward. As The uplift modeling framework is then defined using such, we assume a causal relationship which introduces a two-part model. First, a baseline model estimates the noise in our observations. Models incrementally trained expected counterfactual reward when content 𝑐 is ranked using such observations are likely to have a high variance but not shown in the shopping session. We illustrate the in the predictions. underlying theory using a linear regression model: In addition, the attribution model described in the pre- vious section does not capture the incremental value of 𝜇0 = 𝐵 ⊤ 𝛽 + 𝜖 (3) showing content. Customers can have an underlying propensity to shop products or consume content based Treatment or incremental effect for each observation in on prior exposure or affinity. As a result, attributing the treated group where content 𝑐 is ranked and shown in the shopping session is then estimated as: parameters by a small fraction of their existing value (1) to account for changes in the environment [31]. This 𝐷𝑖 = 𝑅𝑖 − 𝜇 ˆ 0 (𝑏) (4) completes the feedback loop which allows us to contin- where, 𝑅𝑖 is the observed down-session reward for ob- uously explore actions and expand our knowledge for (1) servation 𝑖, and 𝐷𝑖 is the imputed incremental effect making better decisions in the future. This in turn en- for observation 𝑖 in the treated group. In the second ables us to support running online A/B tests using which part, pseudo-effect 𝐷 is used as the target variable in content creators can introduce new content across Ama- our ranking model, described in (eqn. 1), to predict the zon’s retail website with the goal of improving customer incremental benefit of showing content 𝑐 to a customer shopping experience and measure its benefit while doing in widget group 𝑞. so. 4.6. Exploration Strategy 4.7. Incorporating Diversity in Ranking The exploration component of our content optimization Showing high-relevance content without taking content framework explores content with few observations from diversity into account leads to monotony and tends to the past. To do so, it aims at solving a contextual bandit make the holistic shopping experience less meaningful problem. Here, we use Thompson sampling, an algo- for the customers. Optimizing for the whole widget group rithm widely used to balance exploration and exploita- involves balancing relevance and diversity of the content tion. It suggests to randomly play each arm according to therein, where the whole-widget group effect is repre- its probability of being optimal. In our problem setting, it sented using the amount of similar content displayed means choosing content proportional to the probability in it. One approach followed here is to model this as a of it being optimal. This implies we won’t be necessarily submodular optimization problem [32]. In [33, 34], the choosing content with the highest expected incremental authors propose using submodular functions which have benefit at each time step. It is a trade-off we make to ex- a diminishing returns property. In their approach, the plore content with few observations from the past which total score for a content is derived from its relevance have high uncertainty but ultimately may drive a higher while also accounting for the decreasing utility of show- reward. In practice, we apply the Thompson sampling ing multiple content of the same type. As a result, the algorithm by sampling model parameters 𝑊 ˆ 𝑡 from their value of selecting content from a given category or type posterior distributions followed by choosing content that decreases as a function of the number of content belong- maximizes the reward. ing to that type already selected. A key shortcoming of this approach is that it lacks a feedback loop and param- Algorithm 1 Thompson Sampling Algorithm for Con- eters of the diversity scoring function aren’t learned to tent Optimization optimize for the same objective as the relevance scoring 1: for 𝑡 = 1, . . . , 𝑇 do function. 2: for all 𝑟𝑎𝑛𝑘 = 1, . . . , 𝑘𝑞 do ◁ 𝑘𝑞 is the value Instead we propose a two-stage model for incorporat- of 𝐾 corresponding to widget group 𝑞 ing diversity into content. We first rank all the eligible 3: Receive context 𝑋𝑡 content 𝑐 ∈ 𝐶𝑞 to be shown in widget group 𝑞 using our 4: Sample 𝑊 ˆ 𝑡 from the posterior distribution underlying ranking model (eqn. 1). Thereafter, we iter- Pr(𝑊 |𝐻𝑡 ) atively re-rank content at each position 𝑟 in the widget 5: Select 𝐴𝑡𝑟𝑎𝑛𝑘 = argmax𝐴 𝐵 ⊤ 𝑊 ˆ𝑡 group by taking into account the content that is already 6: end for ranked in the previous 𝑟 − 1 positions. This is accom- 7: Choose 𝑡𝑜𝑝 − 𝐾 arms and observe reward 𝑅 plished by using a second ranking model which includes 8: 𝐻𝑡 = 𝐻𝑡−1 ∪ cross-content interaction features. To capture these in- {(𝑥𝑡𝑟𝑎𝑛𝑘 , 𝑎𝑡𝑟𝑎𝑛𝑘 , 𝑟𝑡𝑟𝑎𝑛𝑘 ), 𝑟𝑎𝑛𝑘 = 1, . . . , 𝑘𝑞 } teractions, we categorize each content 𝑐 in to one of 𝑚 9: end for distinct categories or types. The goal is to then select an optimal number of highly relevant widgets in each Candidates that are ranked and chosen to be displayed category. For a given widget group 𝑞 ∈ 𝒬, the overall by the ranking model are then logged along with their value of widget group 𝑞 is represented as: observed reward in the form of triplets (𝑥𝑡 , 𝑎𝑡 , 𝑟𝑡 ). There- 𝑘 after, we estimate the incremental effect for each obser- ∑︁ 𝑓 (𝑐𝑘 |𝑞, 𝑋) = 𝑓 (𝑐𝑟 |𝑇𝑞(𝑟−1) , 𝑋, 𝑊 ) (5) vation in the logged feedback using our uplift modeling 𝑟=1 framework. The incremental effects are subsequently 𝑓 (𝑐𝑟 |𝑇𝑞(𝑟−1) , 𝑋, 𝑊 ) = 𝑔(𝐵 ⊤ 𝑊 ) (6) used as target variables to incrementally train our rank- ing model using a batch update under the Thompson Sam- where, 𝑐𝑟 is the content at rank 𝑟 in widget group 𝑞, pling framework. While doing so, we decay the model 𝑇𝑞(𝑟−1) is the set of content allocated in the top 𝑟 − 1 positions, 𝑋 is the request and customer context received than before as we retrained the models at a faster ca- as before, 𝑊 are the model parameters, and 𝑔 is a gener- dence. This impacts the model’s learning process in two alized linear model. ways: i) outliers in the dataset can cause the model to incorrectly associate higher potential reward for some content despite winsorization techniques, and ii) insuf- 5. Low Latency Learning ficient data can limit the model’s ability to learn about Framework content’s reward distribution especially in regions with small amount of traffic. Empirically, we observe that both In production, we observed that our content optimiza- of these scenarios lead to over-exploration of content. tion framework suffered from a delay in feedback as it We address this challenge using Bayesian Regulariza- needed multiple days on average to complete the learning tion. The use of Gaussian priors has been established in loop. This involves logging the feedback after customers [35] as a form of L2 regularization wherein the following are shown with content on the website, measuring and equivalence is explored: attributing reward in our data pipelines, incrementally training the models at a daily cadence, and deploying [︁ 𝑃 (𝐵|𝑊 )𝑃 (𝑊 ) ]︁ (𝐵𝐵 𝑇 + 𝛼𝐼)−1 𝐵 𝑇 𝑅 = E (7) the retrained models in production. A delay in feedback 𝑃 (𝐵) has the following consequences in production for our content optimization framework: Here, instead of initiating the feature weights from a static prior (i.e. mean 0 and variance 1), we derive a • When new content is introduced into the ecosys- prior distribution from previously learned weight distri- tem, the optimization framework is not able to butions in order to allow for a more pessimistic explo- effectively estimate its potential benefit, and the ration regime. For instance, by using a prior representing content is subject to exploration as expected. Due 20th percentile mean of all features and 75th percentile to the delay in feedback, this can result in new variance of all features, we can reduce the chances of content being explored at a higher show-rate over-exposure for new content during the learning pe- for the duration of the delay during the initial riod. This improvement in turn has enabled us to launch learning period before sufficient observations are the L3 framework in production and reduced the delay logged for the model to learn from its own feed- in feedback by 90%. back. This in turn can result in sub-optimal deci- sion making and introduction of poor customer shopping experience during the learning period. 6. Experiments • While the cost of exploration may be amortized We first evaluate our content optimization framework over long running content campaigns, a longer using both traditional offline evaluation and off-policy feedback loop induces limitations in realizing ben- evaluation methodologies [36, 37]. This allows us to eval- efit during high-value events such as Cyber Mon- uate and prune alternative treatment policies before intro- day where new content may be introduced for a ducing them in online randomized experiments [38, 39]. short period of time. In such cases, some content Here, we use regression and ranking metrics to evalu- promoting sales or other events will be turned off ate the framework quantitatively, and content’s share-of- even before their benefit is effectively learned by voice and ranking distributions to evaluate it qualitatively the model. using domain knowledge. Subsequently, we demonstrate • A longer feedback loop also decreases the velocity the effectiveness of our framework through five online of running online A/B tests where new content A/B tests. and improvements to optimization framework may be introduced with the aim of improving customer shopping experience. 6.1. Online Experiment Setting To address these challenges, we developed a Low Latency In our online experimentation setting, observational Learning (L3) framework which has reduced the learning units (or shopping sessions) are randomly exposed to loop for our content optimization framework by 90% from either the baseline control policy or the alternative treat- multiple days to a couple of hours. ment policies. Here, we track the impact to our metric of interest 𝑀 𝑂𝐼, which is a measure of improved site- wide customer shopping experience. In the results, we 5.1. Bayesian Regularization include the causal effect w.r.t percentage improvement A key challenge we encountered in the development of in this metric at Amazon’s scale. The experiments are L3 framework was that of the number of samples avail- conducted across all of Amazon’s world-wide market- able for incremental training being significantly lower places and product categories. Level of significance 𝛼 for Table 1 three sizes – small, medium and large. Here, the size of a Online experiment results. rendered image can influence the customer’s understand- ing of the product. Hence, we want to select and render Experiment Incremental Impact p-value an optimal size of the same product image so as to help (% improvement) the customers evaluate products better especially for high EXP-1 +0.16% 0.02 consideration purchases. This is a use case where click- EXP-2A +0.01% 0.12 through attribution cannot be used as clicking on the EXP-2B +0.01% 0.09 content does not necessarily indicate a positive customer EXP-3 +0.09% 0.02 shopping experience. We formulate the task of optimal EXP-4 +0.05% 0.19 EXP-5 +0.11% 0.00 image size selection as a learning to rank problem, and use view-through attribution to measure and attribute reward to the rendered image size. To demonstrate the effectiveness of this approach, we ran two experiments – these experiments was determined by Amazon’s business one each for desktop and mobile surfaces. In the control objectives and was set to 0.10. Duration for these experi- group, image size was selected by a rule-based system ments was estimated from statistical power analysis. We while in the treatment group, our framework ranked the allocated equal traffic to both the control and treatment image size variations and chose the top ranked variation groups. During the course of the experiment, the models to render. In both the experiments (EXP-2A and EXP-2B), were incrementally trained using their own set of logged we observe an improvement in the 𝑀 𝑂𝐼 metric which feedback. is practically significant at Amazon’s scale. 6.2. Experiment 1: Application of the 6.4. Experiment 3: Application of the Holistic Optimization Framework Causal Bandit Framework We first test the effectiveness of our holistic optimization After demonstrating the effectiveness of VTA, we tested framework to rank content in a widget group on product the utility of the uplift modeling framework. The frame- detail pages of Amazon’e retail website. This is a region work allows us to measure and optimize for incremental of customer shopping experience on the website where value generated by content, and reduces the observa- we usually see organic content such as ‘customers who tional bias in data. We conducted an experiment in a viewed this also viewed’ and ‘customers who bought this widget group which is located at the bottom of product also bought’ widgets being displayed alongside adver- detail pages on the desktop retail website where person- tising content. A key challenge in dynamically ranking alized content that is usually generated by taking recent content in this setting was that of attribution of reward browsing history into account is shown. This in turn to diverse type of content which were generated by con- allowed us to test our hypothesis that customers can tent creators who optimized for differing business ob- have an underlying propensity to shop products or con- jectives. As such, our framework needed to arbitrate sume content based on prior exposure or affinity, and content during the content allocation process and fairly optimizing for incremental benefit can result in a posi- balance the differing objectives. Since holistic optimiza- tive customer shopping experience. In the control group, tion framework measures reward using the aggregate content was ranked by a linear bandit without using the down-session value after customer has interacted with uplift modeling framework while rewards were measured content, we wanted to test its effectiveness in addressing and attributed using CTA. In the treatment group, con- this problem. In the control group, content was statically tent was ranked using a linear causal bandit with VTA. ranked by a rule-based system while in the treatment In the results (EXP-3), we observe that the linear causal group, our framework dynamically ranked content using bandit using VTA performed better than the linear ban- the holistic optimization framework. In the results (EXP- dit which did not use uplift modeling framework. The 1), we observe a practically and statistically significant improvement in 𝑀 𝑂𝐼 metric was both practically and improvement in the 𝑀 𝑂𝐼 metric which is a measure of statistically significant. site-wide improvement in customer shopping experience. 6.5. Experiment 4: Application of 6.3. Experiment 2: Application of Incorporating Diversity in Ranking View-through Attribution Subsequently, we ran an experiment (EXP-4) on the desk- In this experiment, we applied our content optimization top homepage of Amazon’s retail website to test the im- framework to the image size selection problem. Usually, pact of incorporating diversity in content. In the control product display images on Amazon’s detail page exist in group, content was ranked using just the single baseline ranking model, while in the treatment group, content learnings from the deployment of a low-latency learning was ranked using the two-stage ranking model – first framework in production that has reduced the delay in using the baseline model followed by a re-ranking model feedback and shortened the learning loop by 90%. Here, which incorporates diversity using cross-content interac- we described our application of Gaussian prior as a form tion features. Here, we observe a practically significant of L2 regularization which in turn enabled the launch improvement in the 𝑀 𝑂𝐼 metric. Based on the results, of the L3 framework. We then demonstrated the effec- we infer that incorporating diversity into content ranking tiveness of our methodology through multiple online can lead to a better customer shopping experience. experiments, and shared results and insights gathered through the same. Finally, we believe our methodology 6.6. Experiment 5: Application of the and learnings are generic and can be extended to con- tent optimization problems in other domains. It can also Low Latency Learning Framework be extended to rank items (or products) within a single L3 pipeline has shortened the delay in feedback for our widget for a product recommendation system. contextual bandit based framework by 90%. As a result, we expect the bandit retrained at hourly cadence to con- verge sooner and perform better than the one retrained References at a daily cadence. To test the benefit and measure the [1] C. Riquelme, G. Tucker, J. Snoek, Deep bayesian impact of low latency learning, we ran an experiment bandits showdown: An empirical comparison of on the mobile homepage of Amazon’s retail website. In bayesian deep networks for thompson sampling, in: the control group, content was ranked by a linear bandit International Conference on Learning Representa- incrementally trained at a slower cadence with a learning tions, 2018. loop of multiple days, while in the treatment group, con- [2] A. Bietti, A. Agarwal, J. Langford, A contextual tent was ranked by a linear bandit incrementally trained bandit bake-off, arXiv preprint arXiv:1802.04064 at a faster cadence with a learning loop of a few hours. (2018). In the results (EXP-5), we observe that the bandit with [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Ve- a shorter delay in feedback performed better w.r.t our ness, M. G. Bellemare, A. Graves, M. A. Riedmiller, metric of interest 𝑀 𝑂𝐼 where the improvement was A. Fidjeland, G. Ostrovski, S. Petersen, C. Beat- both practically and statistically significant. Based on the tie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, results, we infer that reducing the delay in feedback and D. Wierstra, S. Legg, D. Hassabis, Human-level con- increasing the velocity of learning loop has a positive trol through deep reinforcement learning, Nature impact on customer shopping experience. 518 (2015) 529–533. doi:10.1038/nature14236. [4] T. Schaul, J. Quan, I. Antonoglou, D. Silver, Priori- 7. Conclusion tized experience replay, in: 4th International Con- ference on Learning Representations, ICLR 2016, In this paper, we presented a causal bandit framework San Juan, Puerto Rico, May 2-4, 2016, Conference to address the problem of content optimization with the Track Proceedings, 2016. objective of improving the overall customer shopping [5] T. Lai, H. Robbins, Asymptotically efficient adaptive experience on Amazon’s e-commerce (or retail) website. allocation rules, Advances in Applied Mathematics Therein, we introduced a holistic optimization frame- 6 (1985) 4–22. doi:https://doi.org/10.1016/ work that enables us to define reward and rank diverse 0196-8858(85)90002-8. types of content using aggregate down-session value; [6] P. Auer, N. Cesa-Bianchi, P. Fischer, Finite-time presented the concept of view-through attribution; dis- analysis of the multiarmed bandit problem, Mach. cussed how it addresses some of the shortcomings of Learn. 47 (2002) 235–256. doi:10.1023/A:101368 click-through attribution; and presented applications of 9704352. VTA in ranking content belonging to diverse type. To ad- [7] Y. Gal, Z. Ghahramani, Dropout as a bayesian ap- dress the shortcomings of view-through attribution, we proximation: Representing model uncertainty in used an Uplift modeling framework which has enabled deep learning, in: Proceedings of the 33rd Interna- us to rank content using incremental or causal bene- tional Conference on International Conference on fit instead of overall value. Subsequently, we proposed Machine Learning - Volume 48, ICML’16, JMLR.org, a two-stage model to incorporate diversity in content 2016, p. 1050–1059. ranking by using cross-content interaction features. It [8] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. helps us to balance relevance with diversity in content Chen, X. Chen, T. Asfour, P. Abbeel, M. Andrychow- shown on Amazon’s retail website and provide a mean- icz, Parameter space noise for exploration, in: Inter- ingful experience to our customers. Thereafter, we shared national Conference on Learning Representations, ternational Conference on Predictive Applications 2018. and APIs, PMLR, 2017, pp. 1–13. [9] M. Fortunato, M. G. Azar, B. Piot, J. Menick, M. Hes- [23] S. R. Künzel, J. S. Sekhon, P. J. Bickel, B. Yu, Met- sel, I. Osband, A. Graves, V. Mnih, R. Munos, D. Has- alearners for estimating heterogeneous treatment sabis, O. Pietquin, C. Blundell, S. Legg, Noisy net- effects using machine learning, Proceedings of the works for exploration, in: International Conference national academy of sciences 116 (2019) 4156–4165. on Learning Representations, 2018. [24] Z. Zhao, T. Harinen, Uplift modeling for multiple [10] I. Osband, B. Van Roy, Bootstrapped thompson sam- treatments with cost optimization, in: 2019 IEEE pling and deep exploration, 2015. doi:10.48550/A International Conference on Data Science and Ad- RXIV.1507.00300. vanced Analytics (DSAA), IEEE, 2019, pp. 422–431. [11] I. Osband, C. Blundell, A. Pritzel, B. Van Roy, Deep [25] N. Sawant, C. B. Namballa, N. Sadagopan, H. Nas- exploration via bootstrapped dqn, in: Advances in sif, Contextual multi-armed bandits for causal Neural Information Processing Systems 29, Curran marketing, in: ICML 2018, 2018. URL: https: Associates, Inc., 2016, pp. 4026–4034. //www.amazon.science/publications/contextu [12] D. Russo, B. V. Roy, A. Kazerouni, I. Osband, A tuto- al-multi-armed-bandits-for-causal-marketing. rial on thompson sampling, CoRR abs/1707.02038 [26] Y. Zhao, M. Goodman, S. Kanase, S. Xu, Y. Kimmel, (2017). arXiv:1707.02038. B. Payne, S. Khan, P. Grao, Mitigating targeting [13] W. R. Thompson, On the likelihood that one un- bias in content recommendation with causal ban- known probability exceeds another in view of the dits, in: Proceedings of the 2nd Workshop on Multi- evidence of two samples, Biometrika 25 (1933) 285– Objective Recommender Systems (MORS 2022), in 294. conjunction with the 16th ACM Conference on [14] M. Strens, A bayesian framework for reinforcement Recommender Systems (RecSys 2022), Seattle, WA, learning, in: ICML, volume 2000, 2000, pp. 943–950. USA, 2022. [15] S. L. Scott, A modern bayesian look at the multi- [27] S. Xu, Y. Zhao, S. Kanase, M. Goodman, S. Khan, armed bandit, Applied Stochastic Models in Busi- B. Payne, P. Grao, Machine learning attribution: ness and Industry 26 (2010) 639–658. Inferring item-level impact from slate recommen- [16] L. Li, W. Chu, J. Langford, R. E. Schapire, A dation in e-commerce, in: KDD 2022 Workshop contextual-bandit approach to personalized news on First Content Understanding and Generation for article recommendation, in: Proceedings of the e-Commerce, 2022. URL: https://www.amazon.sci 19th International Conference on World Wide Web, ence/publications/machine-learning-attribution-i WWW ’10, Association for Computing Machinery, nferring-item-level-impact-from-slate-recomm New York, NY, USA, 2010, p. 661–670. doi:10.114 endation-in-e-commerce. 5/1772690.1772758. [28] S. Athey, G. Imbens, Recursive partitioning for [17] O. Chapelle, L. Li, An empirical evaluation of heterogeneous causal effects, Proceedings of the thompson sampling, in: Proceedings of the 24th National Academy of Sciences 113 (2016) 7353–7360. International Conference on Neural Information doi:10.1073/pnas.1510489113. Processing Systems, NIPS’11, Curran Associates [29] D. Rubin, Estimating causal effects of treatments in Inc., Red Hook, NY, USA, 2011, p. 2249–2257. randomized and nonrandomized studies., Journal [18] S. Agrawal, N. Goyal, Thompson sampling for con- of Educational Psychology 66 (1974) 688–701. textual bandits with linear payoffs, in: International [30] G. W. Imbens, D. B. Rubin, Causal Inference for Conference on Machine Learning, 2013, pp. 127– Statistics, Social, and Biomedical Sciences: An 135. Introduction, Cambridge University Press, 2015. [19] B. Hansotia, B. Rukstales, Incremental value mod- doi:10.1017/CBO9781139025751. eling, Journal of Interactive Marketing 16 (2002) [31] T. Graepel, J. Q. n. Candela, T. Borchert, R. Herbrich, 35. Web-scale bayesian click-through rate prediction [20] V. S. Y. Lo, The true lift model: A novel data mining for sponsored search advertising in microsoft’s bing approach to response modeling in database mar- search engine, in: Proceedings of the 27th Interna- keting, SIGKDD Explor. Newsl. 4 (2002) 78–86. tional Conference on International Conference on doi:10.1145/772862.772872. Machine Learning, ICML’10, Omnipress, Madison, [21] N. Radcliffe, Using control groups to target on WI, USA, 2010, p. 13–20. predicted lift: Building and assessing uplift model, [32] Y. Yue, C. Guestrin, Linear submodular bandits and Direct Marketing Analytics Journal (2007) 14–21. their application to diversified retrieval, in: Ad- [22] P. Gutierrez, J.-Y. Gérardy, Causal inference and vances in Neural Information Processing Systems, uplift modelling: A review of the literature, in: In- volume 24, Curran Associates, Inc., 2011. [33] C. H. Teo, H. Nassif, D. Hill, S. Srinivasan, M. Good- D. Coey, M. Curtis, A. Deng, W. Duan, P. Forbes, man, V. Mohan, S. Vishwanathan, Adaptive, person- B. Frasca, T. Guy, G. W. Imbens, G. Saint Jacques, alized diversity for visual discovery, in: Proceed- P. Kantawala, I. Katsev, M. Katzwer, M. Konut- ings of the 10th ACM Conference on Recommender gan, E. Kunakova, M. Lee, M. Lee, J. Liu, J. Mc- Systems, RecSys ’16, Association for Computing Queen, A. Najmi, B. Smith, V. Trehan, L. Vermeer, Machinery, New York, NY, USA, 2016, p. 35–38. T. Walker, J. Wong, I. Yashkov, Top challenges from [34] H. Nassif, K. O. Cansizlar, M. Goodman, S. V. N. the first practical online controlled experiments Vishwanathan, Diversifying music recommenda- summit, SIGKDD Explor. Newsl. 21 (2019) 20–35. tions, in: ICML 2016, 2016. doi:10.1145/3331651.3331655. [35] M. A. Figueiredo, Adaptive sparseness for super- [39] T. S. Richardson, Y. Liu, J. Mcqueen, D. Hains, A vised learning, IEEE transactions on pattern analy- bayesian model for online activity sample sizes, in: sis and machine intelligence 25 (2003) 1150–1159. Proceedings of The 25th International Conference [36] A. Swaminathan, T. Joachims, The self-normalized on Artificial Intelligence and Statistics, volume 151 estimator for counterfactual learning, in: Advances of Proceedings of Machine Learning Research, PMLR, in Neural Information Processing Systems, vol- 2022, pp. 1775–1785. ume 28, Curran Associates, Inc., 2015. [37] T. Schnabel, A. Swaminathan, A. Singh, N. Chan- dak, T. Joachims, Recommendations as treatments: Debiasing learning and evaluation, in: Proceedings APPENDIX of the 33rd International Conference on Interna- tional Conference on Machine Learning - Volume A. An Illustration of Diverse Type of 48, ICML’16, JMLR.org, 2016, p. 1670–1679. Content on Amazon’s Homepage [38] S. Gupta, R. Kohavi, D. Tang, Y. Xu, R. Ander- Below, (figure 3) illustrates diverse type of content being sen, E. Bakshy, N. Cardin, S. Chandran, N. Chen, shown on the homepage of Amazon’s retail website. Figure 3: Homepage of Amazon’s Retail Website.