Augmented Two-Stage Bandit Framework: Practical Approaches for Improved Online Ad Selection Seowon Han† , Ryan Lakritz† and Hanxiao Wu† Reddit Inc., 303 2nd Street, South Tower, Floor 5, San Francisco, CA 94107 Abstract In online advertising, maximizing user engagement and advertiser performance hinges on effective ad selection algorithms. Algorithms that tackle Multi-armed bandit problems, such as Thompson Sampling, excel in exploration, but their utilization of contextual information remains limited. Conversely, contextual bandit approaches personalize ad selection by leveraging user and ad-specific features. However, they perform poorly in contexts with limited data and often encounter cold start problems for new ad groups. To address this dilemma, we propose a novel bandit framework that combines context-free and context-aware rewards and is augmented with historical predicted performance, for which we use predicted click-through rate (pCTR) scores. We will refer to this bandit framework as the Augmented Two-Stage Bandit Framework. Our bandit framework is comprised of two stages. In the first stage, the framework applies context-free Thompson Sampling augmented by historical pCTR scores for initial exploration. The non-contextual bandit algorithm and generalized patterns recognized by our pCTR model allow for effective mitigation of the cold start problem. In the second stage, the framework shifts to a contextual bandit algorithm for refined exploration and exploitation. We demonstrate the efficacy of our proposed method using extensive simulation and experiments conducted on a real-world ads marketplace at Reddit. Compared to traditional bandit algorithms, our historical pCTR augmented Two-Stage Bandit framework achieves significant improvements in click-through rate. These findings underscore the ability of an Augmented Two-Stage Bandit Framework to enhance online ad selection and improve key performance metrics. Keywords Online advertising, bandit algorithms, reinforcement learning, contextual bandits, multi-armed bandits, deep neural networks, ad optimization, ad selection, ad retrieval 1. Introduction improvements in click-through rate and other key perfor- mance metrics, underscoring its potential to improve online This paper introduces a novel Augmented Two-Stage Ban- ad selection. dit Framework that improves click-through-rate prediction. We delve into the challenges of data sparsity and cold Our framework aims to optimize ad performance while han- starts, showcase the benefits of our pCTR-augmented ap- dling the inherent data sparsity that is introduced by newly proach, and present the compelling results of our exper- formed ads. Our framework brings two major novelties: iments. By bridging the gap between exploration and a two-stage approach and pCTR (predicted click-through exploitation, while enabling personalization and context- rate) augmentation. The former helps performance in data- awareness, the proposed method solves a significant issue sparse environments while the latter further helps with in intelligent ad selection today, ultimately benefiting both data-sparsity and improves overall outcomes. users and advertisers in the online advertising landscape. In the initial stage, we leverage a context-free bandit al- gorithm enhanced by historical pCTR scores to effectively explore candidate ads, even with limited data. The context- 2. Related Work free bandit algorithm we use is Thompson Sampling, but the general framework is adaptable to any context-free bandit Extensive research exists on exploration versus exploitation algorithm. This mitigates the cold start problem for new algorithms. ads, allowing them to quickly learn and adapt to user pref- Multi-Armed Bandit Algorithms (MABs): Many solu- erences. tions to the online ad selection problem fall squarely within As data accumulates, the framework transitions to a con- the domain of bandit algorithms. Traditional algorithms that textual bandit algorithm. We use Linear Thompson Sam- address the Multi-Armed Bandit problem, like Thompson pling to incorporate user and ad-specific features for refined Sampling, offer efficient exploration-exploitation trade-offs exploration and exploitation. This ensures personalized ad for selecting ads in dynamic environments. However, the selection that maximizes click-through rates and improves lack of context-awareness limits their ability to personalize overall campaign performance. recommendations based on user and ad-specific features Our proposed two-stage approach effectively addresses [1, 2]. data sparsity and imbalance while reaping the benefits of Contextual Bandit Algorithms: Recognizing the limita- personalized, context-aware selection with prior knowledge tions of MABs, contextual bandit solutions, such as LinUCB about predicted performance. We demonstrate the efficacy and Linear Thompson Sampling, are able to leverage ad- of this method through extensive simulations and real-world ditional information, such as user demographics and ad experiments on a large-scale ads marketplace. Compared to context, for personalized selection[1]. These features are traditional contextual bandit algorithms, the proposed Aug- typically integrated through embedding techniques or neu- mented Two-Stage Bandit Framework achieves significant ral networks, such as deep neural networks (DNNs) [3], which enable the model to learn complex interactions be- AdKDD’24: Barcelona, Spain tween the contextual information and ad performance [4]. † These authors contributed equally. While demonstrating superior performance compared to $ samantha.han@reddit.com (S. Han); ryan.lakritz@reddit.com MABs, their reliance on sufficient per-context data can hin- (R. Lakritz); sylvia.wu@reddit.com (H. Wu) der their effectiveness in data-scarce scenarios [5], which © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings introduces cold start issues in ad selection. model to all candidate ads. As such, the ad selection model Cold Start Solutions: Various approaches have been is an important component of our ad auction framework explored to address the cold start problem in bandit problem and requires a unique modeling solution. formulations[6, 7, 8, 2, 5]. Bayesian approaches leverage prior information from similar contexts to inform initial 3.2. Problem Framing decision-making. Others utilize Thompson Sampling with confidence bounds to prioritize exploration for new items. Ad selection can be formulated as a contextual bandit prob- However, these methods often rely on strong assumptions lem. At time t, an ad impression is requested by a user. The about data similarity or require careful parameter tuning, decision-making agent of the bandit will observe the feature limiting their ability to generalize [5]. vector of the impression, which is considered as context Deep Learning Integration: Recently, the integration 𝑠𝑡 ∈ 𝑆, and be presented with a set of eligible ads to choose of deep neural networks (DNNs) with bandit algorithms from action space 𝐴. The chosen ad, 𝑎𝑡 , is the action taken has shown promising results for improving ad selection by the agent, where 𝑎𝑡 ∈ 𝐴. The agent uses the learned performance [9, 10]. Proposed frameworks integrate these selection policy 𝜋 to choose an ad 𝑎𝑡 = 𝜋(𝑠𝑡 ) for the ad features in a non-linear manner, enabling principled explo- impression at time 𝑡 to maximize the expected click. The ration through techniques like neural collaborative filter- system collects the clicks generated by the agent’s chosen ing recommendation and inference time dropout, leading ad, denoted as reward 𝑟(𝑠, 𝑎). The given policy’s expected to improved performance in various applications [11, 12]. reward is denoted as 𝑟𝜋 (𝑠, 𝑎) = 𝐸[𝑟(𝑠, 𝑎)]. For each obser- However, their computational complexity and potential for vation at time 𝑡 agent improves its selection policy based overfitting remain challenges to be addressed. on the observation tuple < 𝑠𝑡 , 𝑎𝑡 , 𝑟𝜋 (𝑠𝑡 , 𝑎𝑡 ) >. Our proposed Augmented Two-Stage Bandit Framework The objective is to find an optimal ∑︀ policy 𝜋 that maxi- combines the strengths of context-free bandits for explo- mizes the expected total reward 𝐸[ 𝑇𝑡=1 𝑟(𝑠𝑡 , 𝑎𝑡 )]. During ration and historical pCTR scores for cold start mitigation, ad selection, the agent will select an ad that maximizes the and we aim to: expected reward given the observed context of an impres- sion: 𝑎𝑡 := arg max𝑎𝑡 ∈𝐴 𝐸[ 𝑇𝑡=1 𝑟(𝑠𝑡 , 𝑎𝑡 )] . The subse- ∑︀ • Bridge the gap between exploration and ex- quent sections will delve into the methodologies employed ploitation: Our approach effectively explores the ad to enhance the policy, ultimately aiming for improved click space in the initial phase with minimal data, while performance. seamlessly transitioning to context-aware decision- making as data accumulates [13]. • Address the cold start problem: The two-stage 4. Methods approach with pCTR augmentation mitigates the data sparsity issue for new agents, enabling them to 4.1. Two-Stage Bandit Framework quickly learn and adapt to user preferences. One notable constraint of contextual bandit algorithms is • Improve overall ad performance: By personal- high variance at the initial stages of learning and, as a result, izing recommendations based on user and ad fea- can over-emphasize exploration. Excessive exploration is tures, our framework seeks to achieve significant im- costly in an ads marketplace, resulting in sub-optimal ad provements in click-through rates and other key per- performance. formance metrics compared to existing approaches. To tackle this challenge, we introduce a two-stage bandit We selected features with a total cardinality of less framework that optimizes 𝑟𝑇 𝑆 (𝑠, 𝑎). The framework con- than 101 across all features, based on offline feature sists of a context-free reward 𝑟𝑀 𝐴𝐵 (𝑎) at the initial stage, importance analysis and online experimentation. and a context-aware reward 𝑟𝐶𝐵 (𝑠, 𝑎), which it switches The proposed bandit framework also has some drawbacks to as data is accumulated. The motivation behind this ap- to consider. It does not solve the problem that bandit algo- proach is that in ad selection, the best performing ad for rithms, in general, do not scale well with a large number of the overall marketplace is likely a better-than-average can- features. Another thing to consider is that it’s more difficult didate under different contexts. By initially relying on the to do an unbiased offline evaluation of bandit performance, context-free policy’s rewards when the context information compared to prediction models. This might lead to more is sparse, and then transitioning to the context-aware pol- work to develop a trustworthy offline evaluation framework icy’s rewards once it outperforms the context-free bandit or run more online experimentation, which comes with a policy, our proposed method aims to improve the perfor- cost. mance at the early stage while preserving the advantages of personalized recommendations in the long run. In order to switch between the context-free and con- 3. Problem textual reward, the variance is introduced to the frame- work as a measure of uncertainty and a degree of explo- 3.1. Background on Ad Auction Funnel ration. We utilize variance of the expected contextual re- wards collected from the first observation through time 𝑡 − We apply an ad selection model as part of our ad auction 1: 𝑉 𝑎𝑟𝑡 (𝑠, 𝑎) = 𝑉 𝑎𝑟1≤𝑖≤𝑡−1,(𝑠𝑖 ,𝑎𝑖 )=(𝑠,𝑎) (𝑟𝐶𝐵 (𝑠𝑖 , 𝑎𝑖 )). funnel. Its primary function is to select a subset of candi- This yields the variance of the contextual rewards for the date ads that will proceed to lower-funnel steps, including given state-action pair up to time 𝑡 − 1. inference by a heavy ranking (pCTR) model and final auc- The 𝑟𝑇 𝑆 (𝑠, 𝑎) uses context-free reward until 𝑉 𝑎𝑟𝑡 (𝑠, 𝑎)) tion ranking. The ad selection stage is necessary due to is below a threshold, denoted as 𝜏 . In practice, the threshold infrastructure constraints and the amount of computation 𝜏 was tuned using offline evaluation and online experimen- resources necessary to complete the lower-funnel steps. It is tation, such that it maximized the total reward of the Two- not cost-efficient or technically feasible to apply the pCTR {︃ Stage framework. We started with a wide range of testable 𝜏 𝑟𝐴−𝑀 𝐴𝐵 (𝑎𝑡 ) if 𝑉 𝑎𝑟𝑡 (𝑠, 𝑎) > 𝜏 𝑟𝐴−𝑇 𝑆 (𝑠𝑡 , 𝑎𝑡 ) = values and narrowed down to values that indicated stability. 𝑟𝐶𝐵 (𝑠𝑡 , 𝑎𝑡 ) otherwise From that point, we conducted online experiments which (3) tested smaller adjustments in 𝜏 to find an optimally tuned value. {︃ 5. Model Training and Serving 𝑟𝑀 𝐴𝐵 (𝑎𝑡 ) if 𝑉 𝑎𝑟𝑡 (𝑠, 𝑎) > 𝜏 𝑟𝑇 𝑆 (𝑠𝑡 , 𝑎𝑡 ) = (1) 𝑟𝐶𝐵 (𝑠𝑡 , 𝑎𝑡 ) otherwise 5.1. Training Establishing an online Reinforcement Learning environ- 4.2. Historical Predicted Click Performance ment for linear bandit algorithms is a significant infras- tructure investment that is difficult to balance with serving Our ad auction pipeline relies on the pCTR model to pre- latency constraints at scale. To mitigate these risks, we dict click-through rates, which has a deep neural network opted for a mini-batch training approach, with a training (DNN) model architecture. This architecture allows us to intervals of one hour. This training interval was decided capture complex interactions between user and ad features based on an offline simulation. and adapt to changing user behavior to enhance the accu- The offline simulation compares the performance of racy of click-through rate predictions. agents that are retrained at different frequencies. Figure In the heavy ranking stage, the pCTR model is used be- 1 below shows the normalized average reward, i.e. the sim- cause it can achieve higher accuracy with sufficient data ulated reward of the given agents divided by the simulated and features. It is more efficient at incorporating contexts, reward of an agent that is retrained at real-time. One hour such as user-specific and interaction features. However, a provided the best performance-to-cost balance, showing complex model like pCTR has its drawbacks, including la- marginal drop-off from 15 minutes but a large improvement tency and cost, as it requires more time and computational over daily retraining. resources to train, maintain, and use for inference. Infer- ence with the pCTR model in real-time for all ads entering the auction pipeline is computationally infeasible, given the latency constraints. In the ad selection stage of the early auction pipeline, the pCTR models — as well as other non-bandit algorithms — can suffer from feedback loops and selection bias [14]. Ban- dit algorithms are well-suited for ad selection in this early stage, where exploration is crucial, due to their emphasis on exploration versus exploitation. Additionally, bandit algo- rithms are advantageous in ad marketplaces where ads can be created or changed at any time, due to their real-time adaptability. These reasons are what lead us to utilizing a bandit algorithm, while incorporating some key strengths of the pCTR model. Figure 1: Average Reward at Given Retrain Frequencies 4.3. Incorporating Two-Stage Approach and pCTR into our Framework To address this limitation, we explored an alternative ap- 5.2. Serving proach that augments our two-stage framework with the pCTR model. We observed a remarkably high correlation As noted, the auction pipeline has very tight serving latency (r-squared = 0.9907) between pCTR and the estimated CTR constraints in an online ad serving environment. Our model of the following day, suggesting that the pCTR scores from inference service, which is called at every impression, con- the previous day can still effectively capture the underly- tains a framework that gets features from our online feature ing patterns and trends in user behavior. By incorporating store, routes requests between different models to compute pCTR scores as weights in our ad selection bandit frame- model inferences from our Augmented Two-Stage Bandit work, we harness the strengths of both models to improve Framework. the accuracy and efficiency of our ad auction pipeline. With our framework, we are able to achieve sub-2ms We introduce the previous day’s, pre-computed pCTR model inference time while serving in production. scores as weights to the policy’s reward with a multiplica- tive application for the context-free stage. Once the variance of the contextual bandit agent crosses the threshold 𝜏 , the 6. Experimentation framework switches to the context-aware stage. pCTR aug- mentation is no longer used at this stage in order to reduce 6.1. Methodology and Dataset infrastructure cost and improve latency. In order to evaluate the impact of our proposed method in The final reward function of our pCTR-augmented two- a real-world setting, we conduct an online experiment. Our stage bandit framework, denoted 𝑟𝐴−𝑇 𝑆 (𝑠, 𝑎) is defined control variant is the Contextual Bandit algorithm without as: alterations. We have three treatment variants: Two-Stage Bandit, Augmented Contextual Bandit, and Augmented Two- 𝑟𝐴−𝑀 𝐴𝐵 (𝑎𝑡 ) = 𝑟𝑀 𝐴𝐵 (𝑎𝑡 ) * 𝑝𝐶𝑇 𝑅(𝑎𝑡 ) (2) Stage Bandit. All of the models are initialized from a cold- start and are updated at an hourly cadence. The details of the highest overall performance across all ad groups. Our each model are illustrated in the next section. bandit framework effectively mitigates the cold-start prob- We use data collected in real-time from Reddit’s ad im- lem, enabling better CTR performance even during early pressions. To handle business logic within the auction sys- exploration. In contrast, the Augmented Contextual Bandit tem, the model is designed to evaluate at the ad group level; exhibited slight negative lift, highlighting the significance the model chooses an ad with the highest expected reward of integrating the two-stage bandit framework in achieving within an ad group for all ad groups passing through this these gains. Ultimately, the combination of historical pCTR part of the ad funnel. As such, we only include ad groups augmentation and the two-stage bandit framework demon- with more than one ad for the analysis. Then we evaluate strates a combined effect that surpasses the performance of the lift in click-through rate (CTR) of proposed methods either approach in isolation. versus control variant. The A/B experiment occurred over 7 days to achieve 7.2. CTR Lift in Data-Scarce Ad Groups statistical power and significance on the key metrics and to account for weekly seasonality. Figure 2 provides important insight into performance in cold start and data-scarcity by segmenting the performance using impressions per ad within the ad group. The ad groups 6.2. Model Variants with low number of impressions per ad indicate that the The following models are included in the online experiment. algorithms have less data to train and are prone from heavy Contextual Bandit: In the control variant, a contextual exploration in contextual bandit algorithms. In Figure 2, bandit algorithm, specifically, Linear Thompson Sampling each dot presents the lift for ad groups within the percentile [15], is used to select one ad per ad group for the ad selec- of cumulative impressions per ad within the ad group. tion requests. The control model has the same contextual The Augmented Two-Stage Bandit performs significantly awareness features and click-based reward compared to the better than baseline, particularly for low impression ad following treatment variants, but it does not apply two-stage groups. Specifically, for the "p10 Ad Groups" - which com- framework or augmentation. prise of ad groups with impressions per ad below the 10th Two-Stage Bandit: This is the bandit framework intro- percentile - our approach achieves 10.95% improvement in duced in Section 4.1 Equation 4.1. It relies on Thompson CTR compared to the baseline. Sampling during the initial stage and transitions to Linear Thompson Sampling as the variance of the contextual re- ward is below the threshold 𝜏 . Augmented Contextual Bandit: This is the Linear Thompson Sampling algorithm, with rewards that are aug- mented by pCTR. 𝑟𝐴−𝐶𝐵 (𝑠, 𝑎) = 𝑟𝐶𝐵 (𝑠, 𝑎) * 𝑝𝐶𝑇 𝑅(𝑎) (4) Augmented Two-Stage Bandit: This is the proposed bandit framework introduced in Section 4.3 Equation 3. This framework applies the pCTR augmentation to context-free Thompson Sampling and switches to Linear Thompson Sam- pling when the variance of the contextual reward is below the threshold 𝜏 . Figure 2: CTR Lift at Ad Group Percentile Based on Impressions per Ad 7. Results 7.1. Aggregate Results Table 1 summarizes the lift in click-through rate, a key per- 7.3. Cumulative CTR Lift Over Time formance metric in ads marketplace, and compares three Figure 3 shows the cumulative daily lift achieved by our pro- test variants against the control model. The results indicate posed methods. Notably, the Augmented Two-Stage Bandit a marginal decline in lift for the Augmented Contextual exhibits a pronounced lift during the initial two days of the Bandit, whereas the Augmented Two-Stage Bandit exhibits experiment where most of the ad groups have low impres- a modest increase in click-through rate (CTR). sion counts. This highlights the improved cold-start per- formance at the onset of model deployment. As more data CTR Lift is accumulated, the Augmented Two-Stage Bandit Frame- Control - work switches to Contextual Bandit and the lift decreases Augmented Two-Stage Bandit 0.97% as expected. Two-Stage Bandit 0.49% Augmented Contextual Bandit -0.12% Table 1 Relative CTR lift versus Control Notably, our proposed method, which integrates the two- stage bandit framework and pCTR augmentation, achieves 9. Acknowledgements Thank you to Bee Massi, Simon Kim, and Josh Cherry for reviewing and advising on the underlying two-stage and augmentation approaches that are utilized in this paper. References [1] J. Langford, T. Zhang, The epoch-greedy algorithm for multi-armed bandits with side information, in: J. Platt, D. Koller, Y. Singer, S. Roweis (Eds.), Advances in Neural Information Processing Systems, volume 20, Curran Associates, Inc., 2007. URL: https: //proceedings.neurips.cc/paper_files/paper/2007/file/ Figure 3: Cumulative CTR Lift versus Control 4b04a686b0ad13dce35fa99fa4161c65-Paper.pdf. [2] S. Caron, S. Bhagat, Mixing bandits: a recipe for im- proved cold-start recommendations in a social net- 7.4. Cumulative Click Volume Lift Over work, in: Proceedings of the 7th Workshop on Social Time Network Mining and Analysis, SNAKDD ’13, Associa- tion for Computing Machinery, New York, NY, USA, In addition to the lift in click-through rate, Figure 4 presents 2013. URL: https://doi.org/10.1145/2501025.2501029. a comparative analysis of click volume across the different doi:10.1145/2501025.2501029. variants, further reinforcing the benefits of the proposed [3] J. Chen, B. Sun, H. Li, H. Lu, X.-S. Hua, Deep approach. Our proposed strategy has consistently generated ctr prediction in display advertising, in: Proceed- the highest click volume throughout the experiment. The ings of the 24th ACM International Conference on increase in click volume, coupled with the lift in CTR un- Multimedia, MM ’16, Association for Computing derscores the holistic performance improvement achieved Machinery, New York, NY, USA, 2016, p. 811–820. by the proposed solution. URL: https://doi.org/10.1145/2964284.2964325. doi:10. 1145/2964284.2964325. [4] D. Chakrabarti, D. Agarwal, V. Josifovski, Contex- tual advertising by combining relevance with click feedback, in: Proceedings of the 17th International Conference on World Wide Web, WWW ’08, Associa- tion for Computing Machinery, New York, NY, USA, 2008, p. 417–426. URL: https://doi.org/10.1145/1367497. 1367554. doi:10.1145/1367497.1367554. [5] N. Silva, T. Silva, H. Werneck, L. Rocha, A. Pereira, User cold-start problem in multi-armed bandits: When the first recommendations guide the user’s experience, ACM Trans. Recomm. Syst. 1 (2023). URL: https://doi. org/10.1145/3554819. doi:10.1145/3554819. [6] L. Guo, J. Jin, H. Zhang, Z. Zheng, Z. Yang, Z. Xing, F. Pan, L. Niu, F. Wu, H. Xu, C. Yu, Y. Jiang, X. Zhu, Figure 4: Cumulative Lift in Click Volume versus Control We know what you want: An advertising strategy recommender system for online advertising, in: Pro- ceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, As- 8. Conclusion sociation for Computing Machinery, New York, NY, USA, 2021, p. 2919–2927. URL: https://doi.org/10.1145/ In this paper, we define a framework for improving upon 3447548.3467175. doi:10.1145/3447548.3467175. linear contextual bandit algorithms for online ad selection, [7] F. Pan, S. Li, X. Ao, P. Tang, Q. He, Warm up cold-start particularly by focusing on performance in cold-start and advertisements: Improving ctr predictions via learning data-scarce scenarios. The Augmented Two-Stage Bandit to learn id embeddings, in: Proceedings of the 42nd Framework is a novel approach to selecting personalized International ACM SIGIR Conference on Research and ads while leveraging exploration to address the cold-start Development in Information Retrieval, SIGIR’19, As- problem present in many personalized recommendation sociation for Computing Machinery, New York, NY, models. Our framework showed a significant CTR lift in ex- USA, 2019, p. 695–704. URL: https://doi.org/10.1145/ periments, with especially large improvements in ad groups 3331184.3331268. doi:10.1145/3331184.3331268. with fewer impressions. Our framework offers practical ap- [8] H. T. Nguyen, J. Mary, P. Preux, Cold-start problems plication to online serving with low-latency requirements in recommendation systems via contextual-bandit al- significantly improving key performance metrics in our ad gorithms, CoRR abs/1405.7544 (2014). URL: http: marketplace. //arxiv.org/abs/1405.7544. arXiv:1405.7544. [9] D. Zhou, L. Li, Q. Gu, Neural contextual bandits with UCB-based exploration, in: H. D. III, A. Singh (Eds.), Proceedings of the 37th International Con- ference on Machine Learning, volume 119 of Pro- ceedings of Machine Learning Research, PMLR, 2020, pp. 11492–11502. URL: https://proceedings.mlr.press/ v119/zhou20a.html. [10] Q. Shi, F. Xiao, D. Pickard, I. Chen, L. Chen, Deep neural network with linucb: A contextual bandit ap- proach for personalized recommendation, in: Com- panion Proceedings of the ACM Web Conference 2023, WWW ’23 Companion, Association for Computing Machinery, New York, NY, USA, 2023, p. 778–782. URL: https://doi.org/10.1145/3543873.3587684. doi:10. 1145/3543873.3587684. [11] M. Collier, H. U. Llorens, Deep contextual multi-armed bandits, CoRR abs/1807.09809 (2018). URL: http://arxiv. org/abs/1807.09809. arXiv:1807.09809. [12] M. Unger, A. Tuzhilin, A. Livne, Context-aware rec- ommendations based on deep learning frameworks, ACM Trans. Manage. Inf. Syst. 11 (2020). URL: https: //doi.org/10.1145/3386243. doi:10.1145/3386243. [13] S. Groman, B. Massi, S. Mathias, D. Lee, J. Taylor, Model-free and model-based influences in addiction- related behaviors, Biological Psychiatry 85 (2019). doi:10.1016/j.biopsych.2018.12.017. [14] J. Chen, H. Dong, X. Wang, F. Feng, M. Wang, X. He, Bias and debias in recommender sys- tem: A survey and future directions, CoRR abs/2010.03240 (2020). URL: https://arxiv.org/abs/2010. 03240. arXiv:2010.03240. [15] S. Agrawal, N. Goyal, Thompson sampling for contextual bandits with linear payoffs, CoRR abs/1209.3352 (2012). URL: http://arxiv.org/abs/1209. 3352. arXiv:1209.3352.