Augmented Two-Stage Bandit Framework: Practical Approaches
                         for Improved Online Ad Selection
                         Seowon Han† , Ryan Lakritz† and Hanxiao Wu†
                         Reddit Inc., 303 2nd Street, South Tower, Floor 5, San Francisco, CA 94107


                                         Abstract
                                         In online advertising, maximizing user engagement and advertiser performance hinges on effective ad selection algorithms. Algorithms
                                         that tackle Multi-armed bandit problems, such as Thompson Sampling, excel in exploration, but their utilization of contextual information
                                         remains limited. Conversely, contextual bandit approaches personalize ad selection by leveraging user and ad-specific features. However,
                                         they perform poorly in contexts with limited data and often encounter cold start problems for new ad groups. To address this dilemma,
                                         we propose a novel bandit framework that combines context-free and context-aware rewards and is augmented with historical predicted
                                         performance, for which we use predicted click-through rate (pCTR) scores. We will refer to this bandit framework as the Augmented
                                         Two-Stage Bandit Framework.
                                             Our bandit framework is comprised of two stages. In the first stage, the framework applies context-free Thompson Sampling
                                         augmented by historical pCTR scores for initial exploration. The non-contextual bandit algorithm and generalized patterns recognized
                                         by our pCTR model allow for effective mitigation of the cold start problem. In the second stage, the framework shifts to a contextual
                                         bandit algorithm for refined exploration and exploitation.
                                             We demonstrate the efficacy of our proposed method using extensive simulation and experiments conducted on a real-world ads
                                         marketplace at Reddit. Compared to traditional bandit algorithms, our historical pCTR augmented Two-Stage Bandit framework
                                         achieves significant improvements in click-through rate. These findings underscore the ability of an Augmented Two-Stage Bandit
                                         Framework to enhance online ad selection and improve key performance metrics.

                                         Keywords
                                         Online advertising, bandit algorithms, reinforcement learning, contextual bandits, multi-armed bandits, deep neural networks, ad
                                         optimization, ad selection, ad retrieval


                         1. Introduction                                                                                            improvements in click-through rate and other key perfor-
                                                                                                                                    mance metrics, underscoring its potential to improve online
                         This paper introduces a novel Augmented Two-Stage Ban-                                                     ad selection.
                         dit Framework that improves click-through-rate prediction.                                                    We delve into the challenges of data sparsity and cold
                         Our framework aims to optimize ad performance while han-                                                   starts, showcase the benefits of our pCTR-augmented ap-
                         dling the inherent data sparsity that is introduced by newly                                               proach, and present the compelling results of our exper-
                         formed ads. Our framework brings two major novelties:                                                      iments. By bridging the gap between exploration and
                         a two-stage approach and pCTR (predicted click-through                                                     exploitation, while enabling personalization and context-
                         rate) augmentation. The former helps performance in data-                                                  awareness, the proposed method solves a significant issue
                         sparse environments while the latter further helps with                                                    in intelligent ad selection today, ultimately benefiting both
                         data-sparsity and improves overall outcomes.                                                               users and advertisers in the online advertising landscape.
                            In the initial stage, we leverage a context-free bandit al-
                         gorithm enhanced by historical pCTR scores to effectively
                         explore candidate ads, even with limited data. The context-                                                2. Related Work
                         free bandit algorithm we use is Thompson Sampling, but the
                         general framework is adaptable to any context-free bandit                                                  Extensive research exists on exploration versus exploitation
                         algorithm. This mitigates the cold start problem for new                                                   algorithms.
                         ads, allowing them to quickly learn and adapt to user pref-                                                   Multi-Armed Bandit Algorithms (MABs): Many solu-
                         erences.                                                                                                   tions to the online ad selection problem fall squarely within
                            As data accumulates, the framework transitions to a con-                                                the domain of bandit algorithms. Traditional algorithms that
                         textual bandit algorithm. We use Linear Thompson Sam-                                                      address the Multi-Armed Bandit problem, like Thompson
                         pling to incorporate user and ad-specific features for refined                                             Sampling, offer efficient exploration-exploitation trade-offs
                         exploration and exploitation. This ensures personalized ad                                                 for selecting ads in dynamic environments. However, the
                         selection that maximizes click-through rates and improves                                                  lack of context-awareness limits their ability to personalize
                         overall campaign performance.                                                                              recommendations based on user and ad-specific features
                            Our proposed two-stage approach effectively addresses                                                   [1, 2].
                         data sparsity and imbalance while reaping the benefits of                                                     Contextual Bandit Algorithms: Recognizing the limita-
                         personalized, context-aware selection with prior knowledge                                                 tions of MABs, contextual bandit solutions, such as LinUCB
                         about predicted performance. We demonstrate the efficacy                                                   and Linear Thompson Sampling, are able to leverage ad-
                         of this method through extensive simulations and real-world                                                ditional information, such as user demographics and ad
                         experiments on a large-scale ads marketplace. Compared to                                                  context, for personalized selection[1]. These features are
                         traditional contextual bandit algorithms, the proposed Aug-                                                typically integrated through embedding techniques or neu-
                         mented Two-Stage Bandit Framework achieves significant                                                     ral networks, such as deep neural networks (DNNs) [3],
                                                                                                                                    which enable the model to learn complex interactions be-
                          AdKDD’24: Barcelona, Spain                                                                                tween the contextual information and ad performance [4].
                         †
                           These authors contributed equally.                                                                       While demonstrating superior performance compared to
                          $ samantha.han@reddit.com (S. Han); ryan.lakritz@reddit.com                                               MABs, their reliance on sufficient per-context data can hin-
                          (R. Lakritz); sylvia.wu@reddit.com (H. Wu)                                                                der their effectiveness in data-scarce scenarios [5], which
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                     Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
introduces cold start issues in ad selection.                     model to all candidate ads. As such, the ad selection model
   Cold Start Solutions: Various approaches have been             is an important component of our ad auction framework
explored to address the cold start problem in bandit problem      and requires a unique modeling solution.
formulations[6, 7, 8, 2, 5]. Bayesian approaches leverage
prior information from similar contexts to inform initial         3.2. Problem Framing
decision-making. Others utilize Thompson Sampling with
confidence bounds to prioritize exploration for new items.        Ad selection can be formulated as a contextual bandit prob-
However, these methods often rely on strong assumptions           lem. At time t, an ad impression is requested by a user. The
about data similarity or require careful parameter tuning,        decision-making agent of the bandit will observe the feature
limiting their ability to generalize [5].                         vector of the impression, which is considered as context
   Deep Learning Integration: Recently, the integration           𝑠𝑡 ∈ 𝑆, and be presented with a set of eligible ads to choose
of deep neural networks (DNNs) with bandit algorithms             from action space 𝐴. The chosen ad, 𝑎𝑡 , is the action taken
has shown promising results for improving ad selection            by the agent, where 𝑎𝑡 ∈ 𝐴. The agent uses the learned
performance [9, 10]. Proposed frameworks integrate these          selection policy 𝜋 to choose an ad 𝑎𝑡 = 𝜋(𝑠𝑡 ) for the ad
features in a non-linear manner, enabling principled explo-       impression at time 𝑡 to maximize the expected click. The
ration through techniques like neural collaborative filter-       system collects the clicks generated by the agent’s chosen
ing recommendation and inference time dropout, leading            ad, denoted as reward 𝑟(𝑠, 𝑎). The given policy’s expected
to improved performance in various applications [11, 12].         reward is denoted as 𝑟𝜋 (𝑠, 𝑎) = 𝐸[𝑟(𝑠, 𝑎)]. For each obser-
However, their computational complexity and potential for         vation at time 𝑡 agent improves its selection policy based
overfitting remain challenges to be addressed.                    on the observation tuple < 𝑠𝑡 , 𝑎𝑡 , 𝑟𝜋 (𝑠𝑡 , 𝑎𝑡 ) >.
   Our proposed Augmented Two-Stage Bandit Framework                 The objective is to find an optimal
                                                                                                       ∑︀ policy 𝜋 that maxi-
combines the strengths of context-free bandits for explo-         mizes the expected total reward 𝐸[ 𝑇𝑡=1 𝑟(𝑠𝑡 , 𝑎𝑡 )]. During
ration and historical pCTR scores for cold start mitigation,      ad selection, the agent will select an ad that maximizes the
and we aim to:                                                    expected reward given the observed      context of an impres-
                                                                  sion: 𝑎𝑡 := arg max𝑎𝑡 ∈𝐴 𝐸[ 𝑇𝑡=1 𝑟(𝑠𝑡 , 𝑎𝑡 )] . The subse-
                                                                                                 ∑︀
     • Bridge the gap between exploration and ex-                 quent sections will delve into the methodologies employed
       ploitation: Our approach effectively explores the ad       to enhance the policy, ultimately aiming for improved click
       space in the initial phase with minimal data, while        performance.
       seamlessly transitioning to context-aware decision-
       making as data accumulates [13].
     • Address the cold start problem: The two-stage              4. Methods
       approach with pCTR augmentation mitigates the
       data sparsity issue for new agents, enabling them to       4.1. Two-Stage Bandit Framework
       quickly learn and adapt to user preferences.               One notable constraint of contextual bandit algorithms is
     • Improve overall ad performance: By personal-               high variance at the initial stages of learning and, as a result,
       izing recommendations based on user and ad fea-            can over-emphasize exploration. Excessive exploration is
       tures, our framework seeks to achieve significant im-      costly in an ads marketplace, resulting in sub-optimal ad
       provements in click-through rates and other key per-       performance.
       formance metrics compared to existing approaches.             To tackle this challenge, we introduce a two-stage bandit
       We selected features with a total cardinality of less      framework that optimizes 𝑟𝑇 𝑆 (𝑠, 𝑎). The framework con-
       than 101 across all features, based on offline feature     sists of a context-free reward 𝑟𝑀 𝐴𝐵 (𝑎) at the initial stage,
       importance analysis and online experimentation.            and a context-aware reward 𝑟𝐶𝐵 (𝑠, 𝑎), which it switches
   The proposed bandit framework also has some drawbacks          to as data is accumulated. The motivation behind this ap-
to consider. It does not solve the problem that bandit algo-      proach is that in ad selection, the best performing ad for
rithms, in general, do not scale well with a large number of      the overall marketplace is likely a better-than-average can-
features. Another thing to consider is that it’s more difficult   didate under different contexts. By initially relying on the
to do an unbiased offline evaluation of bandit performance,       context-free policy’s rewards when the context information
compared to prediction models. This might lead to more            is sparse, and then transitioning to the context-aware pol-
work to develop a trustworthy offline evaluation framework        icy’s rewards once it outperforms the context-free bandit
or run more online experimentation, which comes with a            policy, our proposed method aims to improve the perfor-
cost.                                                             mance at the early stage while preserving the advantages
                                                                  of personalized recommendations in the long run.
                                                                     In order to switch between the context-free and con-
3. Problem                                                        textual reward, the variance is introduced to the frame-
                                                                  work as a measure of uncertainty and a degree of explo-
3.1. Background on Ad Auction Funnel                              ration. We utilize variance of the expected contextual re-
                                                                  wards collected from the first observation through time 𝑡 −
We apply an ad selection model as part of our ad auction          1: 𝑉 𝑎𝑟𝑡 (𝑠, 𝑎) = 𝑉 𝑎𝑟1≤𝑖≤𝑡−1,(𝑠𝑖 ,𝑎𝑖 )=(𝑠,𝑎) (𝑟𝐶𝐵 (𝑠𝑖 , 𝑎𝑖 )).
funnel. Its primary function is to select a subset of candi-      This yields the variance of the contextual rewards for the
date ads that will proceed to lower-funnel steps, including       given state-action pair up to time 𝑡 − 1.
inference by a heavy ranking (pCTR) model and final auc-             The 𝑟𝑇 𝑆 (𝑠, 𝑎) uses context-free reward until 𝑉 𝑎𝑟𝑡 (𝑠, 𝑎))
tion ranking. The ad selection stage is necessary due to          is below a threshold, denoted as 𝜏 . In practice, the threshold
infrastructure constraints and the amount of computation          𝜏 was tuned using offline evaluation and online experimen-
resources necessary to complete the lower-funnel steps. It is     tation, such that it maximized the total reward of the Two-
not cost-efficient or technically feasible to apply the pCTR
                                                                                        {︃
Stage framework. We started with a wide range of testable 𝜏                               𝑟𝐴−𝑀 𝐴𝐵 (𝑎𝑡 ) if 𝑉 𝑎𝑟𝑡 (𝑠, 𝑎) > 𝜏
                                                                    𝑟𝐴−𝑇 𝑆 (𝑠𝑡 , 𝑎𝑡 ) =
values and narrowed down to values that indicated stability.                              𝑟𝐶𝐵 (𝑠𝑡 , 𝑎𝑡 ) otherwise
From that point, we conducted online experiments which                                                                        (3)
tested smaller adjustments in 𝜏 to find an optimally tuned
value.
                      {︃                                          5. Model Training and Serving
                        𝑟𝑀 𝐴𝐵 (𝑎𝑡 ) if 𝑉 𝑎𝑟𝑡 (𝑠, 𝑎) > 𝜏
    𝑟𝑇 𝑆 (𝑠𝑡 , 𝑎𝑡 ) =                                   (1)
                        𝑟𝐶𝐵 (𝑠𝑡 , 𝑎𝑡 ) otherwise                  5.1. Training
                                                                  Establishing an online Reinforcement Learning environ-
4.2. Historical Predicted Click Performance                       ment for linear bandit algorithms is a significant infras-
                                                                  tructure investment that is difficult to balance with serving
Our ad auction pipeline relies on the pCTR model to pre-
                                                                  latency constraints at scale. To mitigate these risks, we
dict click-through rates, which has a deep neural network
                                                                  opted for a mini-batch training approach, with a training
(DNN) model architecture. This architecture allows us to
                                                                  intervals of one hour. This training interval was decided
capture complex interactions between user and ad features
                                                                  based on an offline simulation.
and adapt to changing user behavior to enhance the accu-
                                                                     The offline simulation compares the performance of
racy of click-through rate predictions.
                                                                  agents that are retrained at different frequencies. Figure
   In the heavy ranking stage, the pCTR model is used be-
                                                                  1 below shows the normalized average reward, i.e. the sim-
cause it can achieve higher accuracy with sufficient data
                                                                  ulated reward of the given agents divided by the simulated
and features. It is more efficient at incorporating contexts,
                                                                  reward of an agent that is retrained at real-time. One hour
such as user-specific and interaction features. However, a
                                                                  provided the best performance-to-cost balance, showing
complex model like pCTR has its drawbacks, including la-
                                                                  marginal drop-off from 15 minutes but a large improvement
tency and cost, as it requires more time and computational
                                                                  over daily retraining.
resources to train, maintain, and use for inference. Infer-
ence with the pCTR model in real-time for all ads entering
the auction pipeline is computationally infeasible, given the
latency constraints.
   In the ad selection stage of the early auction pipeline, the
pCTR models — as well as other non-bandit algorithms —
can suffer from feedback loops and selection bias [14]. Ban-
dit algorithms are well-suited for ad selection in this early
stage, where exploration is crucial, due to their emphasis on
exploration versus exploitation. Additionally, bandit algo-
rithms are advantageous in ad marketplaces where ads can
be created or changed at any time, due to their real-time
adaptability. These reasons are what lead us to utilizing a
bandit algorithm, while incorporating some key strengths
of the pCTR model.
                                                                  Figure 1: Average Reward at Given Retrain Frequencies
4.3. Incorporating Two-Stage Approach and
     pCTR into our Framework
To address this limitation, we explored an alternative ap-        5.2. Serving
proach that augments our two-stage framework with the
pCTR model. We observed a remarkably high correlation             As noted, the auction pipeline has very tight serving latency
(r-squared = 0.9907) between pCTR and the estimated CTR           constraints in an online ad serving environment. Our model
of the following day, suggesting that the pCTR scores from        inference service, which is called at every impression, con-
the previous day can still effectively capture the underly-       tains a framework that gets features from our online feature
ing patterns and trends in user behavior. By incorporating        store, routes requests between different models to compute
pCTR scores as weights in our ad selection bandit frame-          model inferences from our Augmented Two-Stage Bandit
work, we harness the strengths of both models to improve          Framework.
the accuracy and efficiency of our ad auction pipeline.              With our framework, we are able to achieve sub-2ms
   We introduce the previous day’s, pre-computed pCTR             model inference time while serving in production.
scores as weights to the policy’s reward with a multiplica-
tive application for the context-free stage. Once the variance
of the contextual bandit agent crosses the threshold 𝜏 , the
                                                                  6. Experimentation
framework switches to the context-aware stage. pCTR aug-
mentation is no longer used at this stage in order to reduce      6.1. Methodology and Dataset
infrastructure cost and improve latency.                          In order to evaluate the impact of our proposed method in
   The final reward function of our pCTR-augmented two-           a real-world setting, we conduct an online experiment. Our
stage bandit framework, denoted 𝑟𝐴−𝑇 𝑆 (𝑠, 𝑎) is defined          control variant is the Contextual Bandit algorithm without
as:                                                               alterations. We have three treatment variants: Two-Stage
                                                                  Bandit, Augmented Contextual Bandit, and Augmented Two-
         𝑟𝐴−𝑀 𝐴𝐵 (𝑎𝑡 ) = 𝑟𝑀 𝐴𝐵 (𝑎𝑡 ) * 𝑝𝐶𝑇 𝑅(𝑎𝑡 )          (2)    Stage Bandit. All of the models are initialized from a cold-
start and are updated at an hourly cadence. The details of      the highest overall performance across all ad groups. Our
each model are illustrated in the next section.                 bandit framework effectively mitigates the cold-start prob-
   We use data collected in real-time from Reddit’s ad im-      lem, enabling better CTR performance even during early
pressions. To handle business logic within the auction sys-     exploration. In contrast, the Augmented Contextual Bandit
tem, the model is designed to evaluate at the ad group level;   exhibited slight negative lift, highlighting the significance
the model chooses an ad with the highest expected reward        of integrating the two-stage bandit framework in achieving
within an ad group for all ad groups passing through this       these gains. Ultimately, the combination of historical pCTR
part of the ad funnel. As such, we only include ad groups       augmentation and the two-stage bandit framework demon-
with more than one ad for the analysis. Then we evaluate        strates a combined effect that surpasses the performance of
the lift in click-through rate (CTR) of proposed methods        either approach in isolation.
versus control variant.
   The A/B experiment occurred over 7 days to achieve           7.2. CTR Lift in Data-Scarce Ad Groups
statistical power and significance on the key metrics and to
account for weekly seasonality.                                 Figure 2 provides important insight into performance in
                                                                cold start and data-scarcity by segmenting the performance
                                                                using impressions per ad within the ad group. The ad groups
6.2. Model Variants
                                                                with low number of impressions per ad indicate that the
The following models are included in the online experiment.     algorithms have less data to train and are prone from heavy
   Contextual Bandit: In the control variant, a contextual      exploration in contextual bandit algorithms. In Figure 2,
bandit algorithm, specifically, Linear Thompson Sampling        each dot presents the lift for ad groups within the percentile
[15], is used to select one ad per ad group for the ad selec-   of cumulative impressions per ad within the ad group.
tion requests. The control model has the same contextual           The Augmented Two-Stage Bandit performs significantly
awareness features and click-based reward compared to the       better than baseline, particularly for low impression ad
following treatment variants, but it does not apply two-stage   groups. Specifically, for the "p10 Ad Groups" - which com-
framework or augmentation.                                      prise of ad groups with impressions per ad below the 10th
   Two-Stage Bandit: This is the bandit framework intro-        percentile - our approach achieves 10.95% improvement in
duced in Section 4.1 Equation 4.1. It relies on Thompson        CTR compared to the baseline.
Sampling during the initial stage and transitions to Linear
Thompson Sampling as the variance of the contextual re-
ward is below the threshold 𝜏 .
   Augmented Contextual Bandit: This is the Linear
Thompson Sampling algorithm, with rewards that are aug-
mented by pCTR.

          𝑟𝐴−𝐶𝐵 (𝑠, 𝑎) = 𝑟𝐶𝐵 (𝑠, 𝑎) * 𝑝𝐶𝑇 𝑅(𝑎)            (4)

   Augmented Two-Stage Bandit: This is the proposed
bandit framework introduced in Section 4.3 Equation 3. This
framework applies the pCTR augmentation to context-free
Thompson Sampling and switches to Linear Thompson Sam-
pling when the variance of the contextual reward is below
the threshold 𝜏 .
                                                                Figure 2: CTR Lift at Ad Group Percentile Based on Impressions
                                                                per Ad
7. Results
7.1. Aggregate Results
Table 1 summarizes the lift in click-through rate, a key per-   7.3. Cumulative CTR Lift Over Time
formance metric in ads marketplace, and compares three          Figure 3 shows the cumulative daily lift achieved by our pro-
test variants against the control model. The results indicate   posed methods. Notably, the Augmented Two-Stage Bandit
a marginal decline in lift for the Augmented Contextual         exhibits a pronounced lift during the initial two days of the
Bandit, whereas the Augmented Two-Stage Bandit exhibits         experiment where most of the ad groups have low impres-
a modest increase in click-through rate (CTR).                  sion counts. This highlights the improved cold-start per-
                                                                formance at the onset of model deployment. As more data
                                             CTR Lift           is accumulated, the Augmented Two-Stage Bandit Frame-
                 Control                          -             work switches to Contextual Bandit and the lift decreases
        Augmented Two-Stage Bandit             0.97%            as expected.
            Two-Stage Bandit                   0.49%
        Augmented Contextual Bandit           -0.12%
Table 1
Relative CTR lift versus Control

   Notably, our proposed method, which integrates the two-
stage bandit framework and pCTR augmentation, achieves
                                                                   9. Acknowledgements
                                                                   Thank you to Bee Massi, Simon Kim, and Josh Cherry for
                                                                   reviewing and advising on the underlying two-stage and
                                                                   augmentation approaches that are utilized in this paper.


                                                                   References
                                                                    [1] J. Langford, T. Zhang, The epoch-greedy algorithm
                                                                        for multi-armed bandits with side information,
                                                                        in: J. Platt, D. Koller, Y. Singer, S. Roweis (Eds.),
                                                                        Advances in Neural Information Processing Systems,
                                                                        volume 20, Curran Associates, Inc., 2007. URL: https:
                                                                        //proceedings.neurips.cc/paper_files/paper/2007/file/
Figure 3: Cumulative CTR Lift versus Control
                                                                        4b04a686b0ad13dce35fa99fa4161c65-Paper.pdf.
                                                                    [2] S. Caron, S. Bhagat, Mixing bandits: a recipe for im-
                                                                        proved cold-start recommendations in a social net-
7.4. Cumulative Click Volume Lift Over                                  work, in: Proceedings of the 7th Workshop on Social
     Time                                                               Network Mining and Analysis, SNAKDD ’13, Associa-
                                                                        tion for Computing Machinery, New York, NY, USA,
In addition to the lift in click-through rate, Figure 4 presents        2013. URL: https://doi.org/10.1145/2501025.2501029.
a comparative analysis of click volume across the different             doi:10.1145/2501025.2501029.
variants, further reinforcing the benefits of the proposed          [3] J. Chen, B. Sun, H. Li, H. Lu, X.-S. Hua, Deep
approach. Our proposed strategy has consistently generated              ctr prediction in display advertising, in: Proceed-
the highest click volume throughout the experiment. The                 ings of the 24th ACM International Conference on
increase in click volume, coupled with the lift in CTR un-              Multimedia, MM ’16, Association for Computing
derscores the holistic performance improvement achieved                 Machinery, New York, NY, USA, 2016, p. 811–820.
by the proposed solution.                                               URL: https://doi.org/10.1145/2964284.2964325. doi:10.
                                                                        1145/2964284.2964325.
                                                                    [4] D. Chakrabarti, D. Agarwal, V. Josifovski, Contex-
                                                                        tual advertising by combining relevance with click
                                                                        feedback, in: Proceedings of the 17th International
                                                                        Conference on World Wide Web, WWW ’08, Associa-
                                                                        tion for Computing Machinery, New York, NY, USA,
                                                                        2008, p. 417–426. URL: https://doi.org/10.1145/1367497.
                                                                        1367554. doi:10.1145/1367497.1367554.
                                                                    [5] N. Silva, T. Silva, H. Werneck, L. Rocha, A. Pereira,
                                                                        User cold-start problem in multi-armed bandits: When
                                                                        the first recommendations guide the user’s experience,
                                                                        ACM Trans. Recomm. Syst. 1 (2023). URL: https://doi.
                                                                        org/10.1145/3554819. doi:10.1145/3554819.
                                                                    [6] L. Guo, J. Jin, H. Zhang, Z. Zheng, Z. Yang, Z. Xing,
                                                                        F. Pan, L. Niu, F. Wu, H. Xu, C. Yu, Y. Jiang, X. Zhu,
Figure 4: Cumulative Lift in Click Volume versus Control                We know what you want: An advertising strategy
                                                                        recommender system for online advertising, in: Pro-
                                                                        ceedings of the 27th ACM SIGKDD Conference on
                                                                        Knowledge Discovery & Data Mining, KDD ’21, As-
8. Conclusion                                                           sociation for Computing Machinery, New York, NY,
                                                                        USA, 2021, p. 2919–2927. URL: https://doi.org/10.1145/
In this paper, we define a framework for improving upon                 3447548.3467175. doi:10.1145/3447548.3467175.
linear contextual bandit algorithms for online ad selection,        [7] F. Pan, S. Li, X. Ao, P. Tang, Q. He, Warm up cold-start
particularly by focusing on performance in cold-start and               advertisements: Improving ctr predictions via learning
data-scarce scenarios. The Augmented Two-Stage Bandit                   to learn id embeddings, in: Proceedings of the 42nd
Framework is a novel approach to selecting personalized                 International ACM SIGIR Conference on Research and
ads while leveraging exploration to address the cold-start              Development in Information Retrieval, SIGIR’19, As-
problem present in many personalized recommendation                     sociation for Computing Machinery, New York, NY,
models. Our framework showed a significant CTR lift in ex-              USA, 2019, p. 695–704. URL: https://doi.org/10.1145/
periments, with especially large improvements in ad groups              3331184.3331268. doi:10.1145/3331184.3331268.
with fewer impressions. Our framework offers practical ap-          [8] H. T. Nguyen, J. Mary, P. Preux, Cold-start problems
plication to online serving with low-latency requirements               in recommendation systems via contextual-bandit al-
significantly improving key performance metrics in our ad               gorithms, CoRR abs/1405.7544 (2014). URL: http:
marketplace.                                                            //arxiv.org/abs/1405.7544. arXiv:1405.7544.
                                                                    [9] D. Zhou, L. Li, Q. Gu, Neural contextual bandits
                                                                        with UCB-based exploration, in: H. D. III, A. Singh
     (Eds.), Proceedings of the 37th International Con-
     ference on Machine Learning, volume 119 of Pro-
     ceedings of Machine Learning Research, PMLR, 2020,
     pp. 11492–11502. URL: https://proceedings.mlr.press/
     v119/zhou20a.html.
[10] Q. Shi, F. Xiao, D. Pickard, I. Chen, L. Chen, Deep
     neural network with linucb: A contextual bandit ap-
     proach for personalized recommendation, in: Com-
     panion Proceedings of the ACM Web Conference 2023,
     WWW ’23 Companion, Association for Computing
     Machinery, New York, NY, USA, 2023, p. 778–782.
     URL: https://doi.org/10.1145/3543873.3587684. doi:10.
     1145/3543873.3587684.
[11] M. Collier, H. U. Llorens, Deep contextual multi-armed
     bandits, CoRR abs/1807.09809 (2018). URL: http://arxiv.
     org/abs/1807.09809. arXiv:1807.09809.
[12] M. Unger, A. Tuzhilin, A. Livne, Context-aware rec-
     ommendations based on deep learning frameworks,
     ACM Trans. Manage. Inf. Syst. 11 (2020). URL: https:
     //doi.org/10.1145/3386243. doi:10.1145/3386243.
[13] S. Groman, B. Massi, S. Mathias, D. Lee, J. Taylor,
     Model-free and model-based influences in addiction-
     related behaviors, Biological Psychiatry 85 (2019).
     doi:10.1016/j.biopsych.2018.12.017.
[14] J. Chen, H. Dong, X. Wang, F. Feng, M. Wang,
     X. He,      Bias and debias in recommender sys-
     tem: A survey and future directions,             CoRR
     abs/2010.03240 (2020). URL: https://arxiv.org/abs/2010.
     03240. arXiv:2010.03240.
[15] S. Agrawal, N. Goyal, Thompson sampling for
     contextual bandits with linear payoffs,          CoRR
     abs/1209.3352 (2012). URL: http://arxiv.org/abs/1209.
     3352. arXiv:1209.3352.