1 INTRODUCTION

September

Mitigating Targeting Bias in Content Recommendation with Causal Bandits∗

YAN ZHAO

MITCHELL GOODMAN

SAMEER KANASE

SHENGHE XU

YANNICK KIMMEL

BRENT PAYNE

SAAD KHAN

PATRICIA GRAO

0 0 Additional Key Words and Phrases: Personalization, Recommender system , Content optimization, Content ranking, Selection bias, Causal bandit, Contextual bandit, Uplift, View-through attribution, Fairness, Counterfactual learning

2022

1 8 23

Recommendations systems play a central role in improving customer experience on the Amazon retail website. Commonly, Learningto-Rank (LTR) methods are employed to rank content, however these methods are subject to bias inherent in the observational data that they use for training. This paper studies a domain-specific self-selection bias, called Content Targeting Bias, introduced when content is generated for specific targeted customers. When content specifically targets classes of customers who are more or less likely to take actions associated with traditional recommendations algorithms (clicks, purchases), the resulting observations reflect a biased relationship between the content and feedback. These observations do not account for the counterfactual condition, or what would have happened if the customer had not received a recommendation. In many cases, customers will have a high propensity of generating rewards, independent of the recommendations shown on the website. In this work we incorporate causal uplift modeling with contextual bandits in order to consider the heterogeneous treatment efect as an adjusted objective for top-k content selection. We demonstrate the performance and impact of the framework through both ofline model evaluations and multiple live A/B experiments. CCS Concepts: • Computing methodologies → Sequential decision making; Batch learning; Learning from implicit feedback; Causal reasoning and diagnostics; Learning to rank; Supervised learning by regression; • Applied computing → Online shopping; • Information systems → Content ranking; Personalization; Top-k retrieval in databases; Recommender systems; • Mathematics of computing → Bayesian computation.

1 INTRODUCTION

In many e-commerce applications, customers rely on recommender systems to help sort through large corpuses of content in order to discover the small fraction of content that they would be interested in. Amazon’s content optimization/ranking system is designed as a self-service tool which enables teams across the company to build content recommendation strategies once and run anywhere. Such a content optimization system is challenging mostly due to following two reasons: (1) continuous learning: surfacing the right content to the right user at the right time requires a ranking system to continually adapt to users’ shifting interested along with newly introduced content; (2) content bias reduction: learning the unbiased incremental value of each piece of content given the context is typically unachievable given the limits of partial observations.

To continuously learn new content and adapt to changing customer behaviors, exploration/exploitation trade-of in the context of Reinforcement learning, is currently an active area of research. In literature, numerous techniques have emerged which are competitive and have shown promising results. These include epsilon-greedy ([ 26 ]), adding random noise to parameters[ 8 ], bootstrap sampling[ 19 ], and Thompson Sampling[ 6 ].

Content bias is introduced by the process of making recommendations online, which influences the way users interact with the system and how the data collected from users is fed back into the system. This leads to several types of biases such as popularity bias[ 5 ], human decision bias[ 7 ], position bias[ 14 ] or selection bias[ 13 ]. Traditional learning-to-rank approaches must contend with the these biases, and most approaches are focusing on position[ 14 ] or selection bias[ 30 ].

In this paper, we identified a new type of bias called Content Targeting Bias, introduced in a recommendations systems at industry scale due to content targeting criteria. Here, content targeting criteria, defined by content recommendation strategy owners, target only certain populations of customer or types of page contexts. These content owners then participate in ranking competitions only when targeting criteria is met. Content targeting criteria can be as simple as "all customers and context" or can be specific to a small portion of customers who have taken particular actions over the past month(s). For example, Content Targeting Bias is introduced when content owners target only signed-in customers, which are known to spend more on average than the population as a whole regardless of the recommendations provided. Another way this targeting bias can happen is when content owners target recommendations to some negative-profit item pages. Not accounting for such biases in ranking means we end up over or under estimating a content incremental performance, thus unfair ranking and degraded customer experiences. For practical reasons it is nearly impossible for a ranking system to have awareness of all targeting criteria that each content owner uses, along with an awareness of the detailed context and customer information at the level of each content owner. Thus, there needs to be a feature-agnostic way to mitigate Content Targeting Bias and thereby improve ranking.

On top of this, we further proposed quantified measurement for Content Targeting Bias within recommendation systems, and proposed solutions for reducing such biases. Our solution to reduce Content Targeting Bias was intuited by causal bandits work[ 25 ] [ 24 ]. In this work, we incorporate uplift model [ 27 ][ 20 ][ 9 ], using meta learning approaches (e.g. x-learner, r-learner)[ 17 ] [ 32 ], into contextualized bandits[ 18 ][ 3 ]. This approach was designed to consider the heterogeneous treatment efect between when content eligible to show but not actually observed (not treated) v.s. observed (treated), with the goal of improving equal opportunity of showing content without targeting criteria impacts, thus maximizing customer experience. To the best of our knowledge, our work is the first to identify content targeting as a unique bias and incorporate uplift modeling into bandit approaches including Bayesian Linear Regression Model (BLIR [ 10 ]), to reduce such biases for the content ranking problem. During experimentation in a Amazon commercial system[ 15 ], this work achieved significant online improvements across multiple pages within Amazon e-commerce website.

This paper is organized as follows: In section 2, we describe problem definitions. Section 3 describes the proposed solution. Section 4 covers ofline model evaluation and online live experiments results with learnings. Finally, Section 5 details conclusions and future work. 2 2.1

PROBLEM DESCRIPTION Formalizing problems

We define widget group as a region of experience on Amazon’s e-commerce website which can be populated with recommended content (a.k.a. widget) provided by diferent teams. Note here the number of content rendered on widget group is much less than the number of all possible candidate content that is generated. Eligibility of rendering widget is typically determined by a combined factor of both the widget’s targeting criteria and ranking system valuation.

The metric for measuring reward is determined by , short for ‘metric of interest’. In our setting, takes into account the short-term as well as long-term impact to the customer’s shopping experience, and helps us to fairly balance multiple and difering objectives of various stakeholders. Many Learning-to-Rank systems in the literature ([ 2 ][ 11 ][ 29 ]) optimize for Click-through Rate (CTR), while we are more interested in site-wise . Towards this end, we have adopted an attribution modeling using view-through attribution (VTA)[ 16 ][ 31 ], which credits widgets for all rewards following an impression within an attribution window (e.g. 100 minutes). For example, if a customer view some content recommendation for longer than 1 second (defined as an impression) and then make a purchase in the subsequent 100 minutes, the reward generated by this purchase along with all other high value actions taken in these 100 minutes will all be attributed back to the impressed content .

The drawback of this methodology is that it loosens the connection between the content and the associated reward, which makes ranking system more vulnerable to Content Targeting Bias. Specifically, by considering feedback as “response” without considering counterfactual cases (i.e. what customer would have behaved if he/she had not received a recommendation) we end up with a biased estimate. Cases where customers have a high probability of generating down session rewards independent of the recommendations shown, can lead the system to over-estimate the value of widgets shown. This misspecification of value due to Content Targeting Bias disrupts fairness guarantees provided by the ranking system and ultimately leads to a suboptimal customer experience. 2.2

Formalizing Content Targeting Bias and ranking fairness

To formalize Content Targeting Bias, we adopt a recently introduced idea of opportunity bias[ 33 ] , a formula designed to evaluate whether diferent types of content receive clicks (or other engagement metrics) proportional to their true targeted population sizes (i.e. do content with diferent targeting criteria receive similar true positive rates?). This method assumes that the content recommended by content owners are all relatively in good quality. We believe this formalization of Content Targeting Bias is directly aligned with user satisfaction and economic gains of content owners.

To quantify the impact of Content Targeting Bias to recommendation systems, we need to first calculate the true positive rate for each content. Using show rate as an example, suppose content has been exposed to customers times in total, the true positive rate for is = / , where is the total times of content is generated based on content owners’ targeting criteria. Then, we can use the Gini Coeficient[ 4 ][ 28 ] to measure the inequality in true positive rates corresponding to content generation

= Í (2 ∗(−∗Í − 1)∗) (1) where contents are indexed from 1 to in targeted audience size, non-descending. We use −1 ≤ ≤ 1 to quantify the Content Targeting Bias in recommendation system: a close to 0 indicates a low bias; > 0 represents that true positive rate is positively correlated to content targeted audience size; and < 0 represents that the true positive rate is negatively correlated to audience size. 3

METHODOLOGY

To address problems above, we propose a framework employing uplift techniques with contextual bandits on top of VTA, to de-bias observations for the ranking system. In more detail, we consider adding contextual features in an uplift model in order to estimate Conditional Average Treatment Efect (CATE [ 1 ]) between exposure v.s. non-exposure of a recommendation to customers, using both Randomized Controlled Trial (RCT) or observational data. Under this framework, we also propose a modeling architecture incorporating Bayesian Linear Regression (BLIR) with Thompson Sampling to achieve online exploration-exploitation trade-ofs. 3.1

Assumptions and definitions

We divided the causal impacts of showing a widget to customers on reward into two parts, (1) request-level incrementality, or the incremental value of showing top widgets to customer within request. This is to remove confounding 3 factors which impact the overall down-session reward independent of the recommendations received. (2) widget-level attribution: out of the top widgets, each widget’s contribution to request-level incrementality. Ideally, we want to have a single causal model that can solve request-level incrementality and widget-level attribution at the same time, but for scope of this paper, we focus on the first problem, which is more related to the Content Targeting Bias issue, as removing customer/context intrinsic value from observed reward .

Let be a dummy variable indicating treatment status, with = 1 if a customer receives recommended contents (treatment) for a given request and = 0 otherwise. The observed reward is defined as ≡ · (1) + (1 − ) · (0), where (1) and (0) are the potential outcomes when people receive the treatment or not. Under the unconfoundedness assumption[ 23 ], (0), (1) ⊥ , we can approximate using (1) if treatment is imposed , or (0) otherwise. In practice, instead of the simple unconfoundedness assumption, we make a conditional unconfoundedness assumption due to the non-random treatment assignment, which is also known as “strongly ignorable treatment assignment”[ 22 ], (0), (1) ⊥ | . In other words, given all the covariates X, the treatment assignment will be independent of the potential outcomes. Given = , the CATE ( ) is then defined as ( ) ≡ [ (1) − (0) | = ]

In this work, treatment group is defined as request-level exposure of top widgets while control group is defined as non-exposure of the entire widget group. This definition is to account for exogenous factors in the top-k ranking problem, where widgets on top ranks may have an impact on lower ranked widget’s exposures. 3.2

Features

Another key point related to the unconfoundedness assumption is the set of covariates which can support it. In the real world, finding all confounders can be intractable, however in this work we simplify the problem by limiting to two confounding factors which impact user propensities: (1) content targeting where content is only generated to subgroup of customers; (2) model targeting where the model returns content non-uniformly given diferent page context. The conditional unconfoundedness assumption then becomes true as long as we can capture page context and a customer’s intrinsic values in . In the Content Targeting Bias problem, a customer’s intrinsic values can be related only to the candidate widgets generated for a given request. Thus, with feature representations of candidate widgets as well as other page context and customer information, we can appropriately counteract the confounding problem. 3.3

Two-model framework estimating pseudo-efect

We further propose a two-model framework composed of a baseline (control) model and an uplift model using Linear Regression, to reduce Content Targeting Bias. Note we use linear model in this paper, but this approach can be applied to any types of Machine Learning models. 3.3.1 Model 1 - Baseline Model. The baseline model measures the observation in the control group, which is to estimate the expected down-session rewards of various contents being eligible to show but not actually observed by customers, defined at request level as ˆ0 = + , where is feature weights, and during training we find which best explains data. is a noise term to capture unobserved variables. represents the features used in the baseline model as explained in above sections. 3.3.2 Model 2 - Uplift Model. For each request with individual observation (widget) in the treatment groups, which are returned and actually observed by customers, we define 2 where is observed down-session reward for a customer at request , and (1) is the imputed treatment efects for request in the treated group, based on the baseline outcome estimator. (1) is imputed treatment efects estimation for a given request. This “pseudo-efect” is an adjusted objective with Content Targeting Bias reduction, and it is then used as the outcome in a secondary machine learning model to obtain the response functions with treatment efects estimated. To achieve online exploration-exploitation trade-of, we utilize Bayesian Linear Regression Model (BLIR) [ 10 ] approach, ˆ(1) ∼ N ( , 2 ), where ˆ(1) is the estimated score for widget , is the coeficients of features, represents the features used in the uplift model. Note that is diferent from in that contains all candidate widgets information and is only used in ofline, while only contains features for focal widget . This uplift model estimates “pseudo-treatment-efects” for the observations in the treatment group, and can help reduce Content Targeting Bias, since we remove the counterfactual efect using the baseline (control) model. Finally, we do point-wise ranking online, by estimating a score for every candidate widget, sorting and returning top-K to customers. The exploration-exploitation trade-of is achieved by sampling model parameters from their posterior distributions through Thompson Sampling.

In addition to bias reduction with causal efect estimations, another fact of this two-model framework is that the baseline model is only used ofline, generating uplift objectives for the second model. This simplification empowers us to include as many features as we can in without concern for latency or other requirements for online systems, such as representations of all candidate widgets features, while keeping online feature set relatively simple but achieve similar efect on bias reductions. 3.4

Log-tricks on objectives

One trick we performed is to transform reward and uplift estimations into log-scale. Transforming reward into log-scale is a widely used trick for removing outliers, thus we first introduced it on top of baseline model estimations. Here instead of using , we used 1 ( ) = ( ) ∗ 1 (| |) to achieve symmetry and valid values at zeros (will still denote using in following context). Due to Jensen’s inequality, transforming log-scaled baseline estimations _ˆ0 back is biased, thus, we directly perform treatment efects estimations by 3 _ (1) = _ − _ˆ0 ( ) = _ (1) + (3) where _ˆ0 ( ) can be treated as geometric mean of baseline values given covariates , like ( (Π ˆ0)1/ ) | , (1) while treatment efects _ (1) becomes multiplicative, like ( _ˆ0 ). Although this definition difers from additive uplift in above section, the signs still has meaning, in that positive value indicate positive incrementality whereas negative value indicate negative incrementality. In addition, defining this relationship as multiplicative also lends an intuitive semantic meaning. For example, say customers who are not signed-in spend $10 on average while signed-in customers spend $100, when representing uplift for some content , instead of an additive uplift of $10 (thus $20 for unsigned-in customer and $110 for signed-in customer), a multiplicative lift ratio of 1.1 makes more sense in terms of e-commerce context (thus $11 for unsigned-in customer and $110 for signed-in customers). Results from A/B testing show log scaled uplift (multiplicative) performs better than additive uplift. 3.5

Data collection

Randomly hiding data, as in RCT, could provide us with a more unbiased estimate of CATE, however it is costly to proactively hide content from customers. In RCT setup, we randomly punt (do not display) the entire widget group a small percentage of the time, and train baseline model using only this punted trafic. In this way, using RCT, we are able to remove the confounder efect resulting from customer selection bias, e.g. customer’s propensities to scroll down the page and browse content. Observational data, in contrast to RCT, can provide suficient data with minimal cost, but at the expense of increased data bias. In practice, when top K widgets are returned, they are actually not always all shown on viewport to customers. Our system is able to capture this client-side impression behaviors, and our proposed work is able to train baseline&uplift models using this observational data. Results show limited biases using observational data compared to RCT. 4 4.1

MODEL EVALUATION AND EXPERIMENTS Ranking fairness estimation

In ofline model evaluation of ranking fairness, we compared scores defined in section 2.2 by ranking content using production model (non-uplift linear bandits directly regressed on VTA) v.s. two-model uplift bandit approach. Also, in order to measure fairness in multiple dimensions, we defined in 2 ways

=  ÍÍÍcocoboÍnsnetctereovnnnetttdegenecxntopeneorxtsaeputnoretesdurerweard TTPPRR ooff spheorfworrmataence (4)

From ??, uplift approach reduces bias in terms of both content show rate and content average performance, especially the latter, which indicates that our approach improves fairness based more on content’s performance than their show rate coverage, which better aligns with business goals. Through analysis on real online data, we identified several widgets targeting customers with high intrinsic rewards, compare their observed score with predicted uplift values, and validated that uplift is able to reduce those biases. From Table 2, widget C and D are widgets targeting at high-valued customer only, while widget A, B, E, F are common widgets targeting at all customers. The scores are observed or estimated rewards as defined in section 2.1, and are represented in tuples formatting as (average across all customers, average across high-valued customers only). Evaluating using proposed framework, we intentionally exclude all customer related features, so model won’t depend on customer profile. We can see that although widget C has the highest average observed score across all customers, the actual uplift prediction is not as high as widget B after reducing Content Targeting Bias (0.15 v.s. 0.24), this aligns with our observations of these widgets through online experimentation. A similar pattern can be found on widget D. 4.3

RCT and observational data

We also evaluate the baseline model in two-model uplift approach, trained with either RCT or observation data. Through ofline analysis, we see that the observational uplift approach is able to achieve similar results compared to RCT 6 of counterfactuals (5% ∼ 10%). This gap can be interpreted by customers’ selection biases, that when customers intentionally don’t view content (observational data), they might be attracted by other content on the page or already have a clear shopping mission. This in turn leads to higher estimates in observational baseline models vs RCT. Estimating this gap is important, since this can be used to adjust observational modeling, and improve model interpretability while avoiding the high cost imposed by RCT, e.g. showing sub-optimal results on a certain percentage of the populations. 4.4

Online experiments

Content Targeting Bias might appear in diferent formats across diferent pages. For example, on homepage, Content Targeting Bias is mostly introduced by diferent targeting criteria on customer populations; while on product detail pages, biases are mostly introduced by targeting criteria towards context information. To gain a thorough understanding about this proposed work, we have completed five online randomized A/B experiments[ 12 ] [ 21 ]. following treatments are performed, (1) observational uplift: two-model uplift with observational data;(2) RCT uplift: two-model uplift with RCT data. In Experiment 2, we ran on slots located at top of product detail pages, with treatment group as observational uplift. In Experiment 3, we ran on cart pages, using uplift with observational data. In Experiment 4, we ran on desktop product detail page, the following treatments are performed, (1) observational uplift; (2) observational uplift with log scale tricks. In Experiment 5 ran on mobile app product detail page, using observational uplift with log scale tricks as treatment group. 4.4.2 Online experiment results. We observed consistent improvements across experiments. Table 3 shows details on results with confidence intervals. Through these experiments, we proved (1) Improvements using heterogeneous treatment efect estimation on top of bandits approach, on diferent pages including homepage, detail page, cart pages, across Amazon e-commerce websites. Out of these, experiment 1, 3, 4, 5 achieved statistically significant improvements with p-value less than 0.05, with experiment 5 having p-value close to 0.000. (2) the proposed method using observational data achieved significant improvements (with p-value 0.02) while RCT only outperformed production model with low confidence (with p-value 0.38). This demonstrated that observational uplift modeling can achieve similar results as using RCT, by successfully minimizing potential bias in training examples. Conversely, RCT depends on random hiding contents which is guaranteed to be suboptimal some percentage of the time, thus the overall RCT performance is hurt; (3) the proposed uplift model with log tricks outperforms additive uplift, which can be observed directly from experiment 4 where uplift with log tricks achieved significant improvements (with p-value 0.024) while additive uplift improvements was not significant with p-value as 0.17. This further demonstrates our hypothesis on log tricks for better managing outliers and more reasonable semantic meanings from multiplicative uplift.

5 CONCLUSION

In this paper, we studied a new type of bias in Learning-to-Rank systems, called Content Targeting Bias. We defined such bias, proposed quantified measurement and further proposed an online ranking approach using BLIR considering contextual features into uplift modeling to reduce such bias for top-K content selection. Through this work, we introduced log-tricks for treatment efect estimations between exposure v.s. non-exposure of a recommendation and compared baseline models trained using both RCT and observational data. This work demonstrates significant bias reduction as well as significant improvements both ofline and online. In future work, building on top of current framework, we will improve uplift estimation by applying propensity-weighting based meta learner approach e.g. double ML (R-learner) to improve current uplift modeling, to further reduce display biases in content rankings.

[1]

Jason

Abrevaya , Yu-Chin Hsu , and Robert P Lieli. Estimating conditional average treatment efects . Journal of Business & Economic Statistics , 33 ( 4 ): 485 - 505 , 2015 .

[2]

Aman

Agarwal , Kenta Takatsu, Ivan Zaitsev, and

Thorsten

Joachims . A general framework for counterfactual learning-to-rank . In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 5 - 14 , 2019 .

[3]

Charles

Blundell , Julien Cornebise, Koray Kavukcuoglu, and

Daan

Wierstra . Weight uncertainty in neural network . In International Conference on Machine Learning , pages 1613 - 1622 . PMLR, 2015 .

[4] Malcolm

C Brown.

Using gini-style indices to evaluate the spatial patterns of health practitioners: theoretical considerations and an application based on alberta data . Social science & medicine , 38 ( 9 ): 1243 - 1256 , 1994 .

[5]

Òscar

Celma and

Pedro

Cano . From hits to niches? or how popular artists can bias music recommendation and discovery . In Proceedings of the 2nd KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition , pages 1 - 8 , 2008 .

[6]

Olivier

Chapelle and

Lihong

Li . An empirical evaluation of thompson sampling . Advances in neural information processing systems , 24 : 2249 - 2257 , 2011 .

[7]

Chen , Marco De Gemmis, Alexander Felfernig, Pasquale Lops, Francesco Ricci, and

Giovanni

Semeraro . Human decision making and recommender systems . ACM Transactions on Interactive Intelligent Systems (TiiS) , 3 ( 3 ): 1 - 7 , 2013 . 8

[8]

Yarin

Gal and

Zoubin

Ghahramani . Dropout as a bayesian approximation: Representing model uncertainty in deep learning . In international conference on machine learning , pages 1050 - 1059 . PMLR, 2016 .

[9]

Graton

Gathright , Ranjan Roopesh, Vasudev Rahul, Marshall Yan, and Fan Zhang. Cross-channel attribution of consumer marketing . In Amazon Machine Learning Conference , 2017 .

[10] Thore

Graepel

, Joaquin Quinonero Candela, Thomas Borchert, and

Ralf

Herbrich . Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft's bing search engine . In ICML, 2010 .

[11] Huifeng

Guo

, Ruiming Tang, Yunming Ye,

Zhenguo

Li ,

and Xiuqiang

He . Deepfm: a factorization-machine based neural network for ctr prediction . arXiv preprint arXiv:1703.04247 , 2017 .

[12] Somit

Gupta

, Ronny Kohavi, Diane Tang, Ya Xu, Reid Andersen, Eytan Bakshy, Niall Cardin, Sumita Chandran, Nanyu Chen,

Dominic

Coey , et al. Top challenges from the first practical online controlled experiments summit . ACM SIGKDD Explorations Newsletter , 21 ( 1 ): 20 - 35 , 2019 .

[13] James

Heckman . Sample selection bias as a specification error with an application to the estimation of labor supply functions . Princeton University Press, 2014 .

[14] Thorsten

Joachims

, Adith Swaminathan, and

Tobias

Schnabel . Unbiased learning-to-rank with biased feedback . In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining , pages 781 - 789 , 2017 .

[15] Sameer

Kanase

Yan

Zhao , Shenghe Xu, Mitchell Goodman, Manohar Mandalapu, Benjamyn Ward, Chan Jeon, Shreya Kamath, Ben Cohen, Vlad Suslikov, Yujia Liu, Hengjia Zhang, Yannick Kimmel, Saad Khan, Brent Payne, and

Patricia

Grao . An application of causal bandit to content optimization . In Proceedings of the 5th Workshop on Online Recommender Systems and User Modeling (ORSUM 2022 ), in conjunction with the 16th ACM Conference on Recommender Systems (RecSys 2022 ), Seattle, WA, USA, 2022 .

[16] Pavel

Kireyev

, Koen Pauwels, and

Sunil

Gupta . Do display ads influence search? attribution and dynamics in online advertising . International Journal of Research in Marketing , 33 ( 3 ): 475 - 490 , 2016 .

[17] Sören

R Künzel

, Jasjeet S Sekhon, Peter J Bickel , and Bin Yu . Metalearners for estimating heterogeneous treatment efects using machine learning . Proceedings of the national academy of sciences , 116 ( 10 ): 4156 - 4165 , 2019 .

[18]

Lihong

Li , Wei Chu , John Langford, and

Robert E

Schapire . A contextual-bandit approach to personalized news article recommendation . In Proceedings of the 19th international conference on World wide web , pages 661 - 670 , 2010 .

[19] Ian

Osband

, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn . Advances in neural information processing systems , 29 : 4026 - 4034 , 2016 .

[20] Roopesh

Ranjan

, Narayanan Sadagopan, and

Guido

Imbens . A propensity matching approach to multi touch attribution . In Amazon Machine Learning Conference , 2016 .

[21] Thomas

S Richardson

, Yu Liu, James McQueen , and Doug Hains . A bayesian model for online activity sample sizes . In International Conference on Artificial Intelligence and Statistics , pages 1775 - 1785 . PMLR, 2022 .

[22] Paul R Rosenbaum and Donald B Rubin . The central role of the propensity score in observational studies for causal efects . Biometrika , 70 ( 1 ): 41 - 55 , 1983 .

[23] Donald

Rubin . Estimating causal efects of treatments in randomized and nonrandomized studies . Journal of educational Psychology , 66 ( 5 ): 688 , 1974 .

[24] Neela

Sawant

, Chii Babu Namballa, Narayanan Sadagopan, and Houssam Nassif. Multi-armed bandit framework for causal efect optimization . In Amazon Machine Learning Conference , 2017 .

[25] Neela

Sawant

, Chitti Babu Namballa, Narayanan Sadagopan, and

Houssam

Nassif . Contextual multi-armed bandits for causal marketing . arXiv preprint arXiv: 1810 . 01859 , 2018 .

[26] Bradly

C Stadie

, Sergey Levine , and Pieter Abbeel . Incentivizing exploration in reinforcement learning with deep predictive models . arXiv preprint arXiv:1507.00814 , 2015 .

[27] Bo

Tan

, Pramod Muralidharan, Naveen Nair, Wenduo Wang, Shaurya Gupta, Jimmy Issac, Vignesh Kannappan, Prakash Bulusu, and

Phil

Leslie . Attribution of prime member signups to prime benefits . In Amazon Machine Learning Conference , 2016 .

[28] Adam

Wagstaf

, Pierella Paci, and Eddy Van Doorslaer. On the measurement of inequalities in health . Social science & medicine , 33 ( 5 ): 545 - 557 , 1991 .

[29] Hao

Wang

Naiyan

Wang , and Dit-Yan Yeung . Collaborative deep learning for recommender systems . In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining , pages 1235 - 1244 , 2015 .

[30] Xuanhui

Wang

Michael

Bendersky , Donald Metzler, and

Marc

Najork . Learning to rank with selection bias in personal search . In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval , pages 115 - 124 , 2016 .

[31] Shenghe

Yan

Zhao , Sameer Kanase, Mitchell Goodman, Saad Khan, Brent Payne, and

Patricia

Grao . Machine learning attribution: Inferring item-level impact from slate recommendation in e-commerce . In KDD 2022 Workshop on First Content Understanding and Generation for e-Commerce , 2022 . URL https://www.amazon.science/publications/machine -learning-attribution-inferring-item-level-impact-from-slate-recommendation-in-ecommerce.

[32]

Zhenyu

Zhao and

Totte

Harinen . Uplift modeling for multiple treatments with cost optimization . In 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA) , pages 422 - 431 . IEEE, 2019 .

[33] Ziwei

Zhu

, Yun He, Xing Zhao , and James Caverlee . Popularity bias in dynamic recommendation . 2021 .