<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Prediction Models with Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Karol Radziszewski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Piotr Ociepka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ringier Axel Springer Polska</institution>
          ,
          <addr-line>Warsaw/Kraków</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Warsaw University of Technology</institution>
          ,
          <addr-line>Warsaw</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a large-scale news recommendation system implemented at Ringier Axel Springer Polska, focusing on enhancing prediction models with reinforcement learning techniques. The system, named Aureus, integrates a variety of algorithms, including multi-armed bandit methods and deep learning models based on large language models (LLMs). We detail the architecture and implementation of Aureus, emphasizing the significant improvements in online metrics achieved by combining ranking prediction models with reinforcement learning. The paper further explores the impact of diferent models mixing on key business performance indicators. Our approach efectively balances the need for personalized recommendations with the ability to adapt to rapidly changing news content, addressing common challenges such as the cold start problem and content freshness. The results of online evaluation demonstrate the efectiveness of the proposed system in a real-world production environment.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Personalization</kwd>
        <kwd>News recommendations</kwd>
        <kwd>Reinforcement Learning</kwd>
        <kwd>Deep Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The complexity of news recommendation systems exceeds that of other systems due to the rapid
data updates and content obsolescence, with thousands of articles published daily. A key challenge is
addressing cold start users, as many visitors rely on cookie IDs without logging in.</p>
      <p>
        Traditional collaborative filtering methods, such as matrix factorization [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], face dificulties due to
the cold start problem, requiring multiple observations for each user. To overcome this, models using
external features have been proposed. Wang et al. introduced RippleNet, a deep learning model that
leverages external data via knowledge graphs, enabling recommendations with minimal previous users’
interactions [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Reinforcement learning, particularly multi-armed bandit algorithms, is another solution to cold start.
This approach has been successfully used in our systems for several years [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The subsequent methodology of modelling recommendation systems, content-based filtering is
crucial, particularly in news recommendations. Recent Natural Language Processing (NLP) advances,
such as pretrained models like GPT [6] and PolBERT [7], have enhanced the generation of personalized
recommendations through embeddings.</p>
      <p>However, these models are often large and costly. Wu et al. addressed this by introducing NewsBert
[8], a distilled version of BERT tailored for the news domain, reducing model size and complexity.</p>
      <p>The core aim of recommendation systems is to boost user satisfaction, often measured by clicks
or time spent on the platform. While click modeling is straightforward, time-based metrics are more
complex. Covington et al. proposed a method to weight clicks based on time spent, implemented in
YouTube’s system [9].</p>
      <p>Our approach uniquely integrates bandit algorithms with traditional ranking models, creating an
adaptive news recommendation engine that combines the strengths of both multi-armed bandits and
deep learning models within a unified architecture.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach</title>
      <p>Over time, Aureus has expanded to incorporate a variety of recommendation algorithms and methods,
each characterized by distinct capabilities and limitations. This section provides a detailed overview of
several of these methods, followed by the introduction of a novel approach for aggregating multiple
recommendations into a unified output. This approach leverages the unique strengths of each constituent
algorithm while efectively mitigating the specific drawbacks associated with individual methods.</p>
      <sec id="sec-3-1">
        <title>3.1. Reinforcement Learning</title>
        <p>The initial application of Aureus was to automate the curation process for the Onet.pl news feed. The
method required the capability to rapidly collect user feedback, identify both short- and long-term
popularity trends, and recommend content that was both highly popular and engaging. Additionally,
the system needed to adapt to emerging articles as well as those experiencing a decline in user
engagement over time. Given the primary objective of automating the existing editorial workflow, the
recommendations were designed to be population-wide, independent of individual user preferences or
tastes.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Multi-armed Bandits</title>
          <p>Considering the outlined requirements, we selected multi-armed bandit algorithms as the foundation of
our approach. This class of methods is particularly suited for balancing the trade-of between
exploration – acquiring knowledge regarding each article’s performance and popularity – and exploitation –
recommending the highest-performing content. Moreover, multi-armed bandit algorithms possess the
capability to optimize a wide range of business-related Key Performance Indicators (KPIs), including
both continuous and discrete metrics. This flexibility makes them an ideal choice for the dynamic and
demanding environment of the publishing industry. Following extensive ofline and online evaluations,
we identified Upper Confidence Bound [10] and Thompson Sampling [11] as the most efective bandit
methods for this application.</p>
          <p>Nevertheless, the exclusion of individual user preferences emerged as a significant limitation of the
selected approach. To overcome this constraint, while preserving the robustness, simplicity,
trendresponsiveness, and cost- and time-eficiency of the bandit-based recommender system, we introduced
the concept of user segmentation.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Users Segmentation</title>
          <p>Segmentation involves dividing the entire user population into smaller, more homogeneous groups, each
consisting of users with similar tastes. By applying multi-armed bandit algorithms separately within
each segment, the recommendation process remains primarily popularity-based. However, through
segmentation, each user is presented with a set of articles that are most popular among individuals
with comparable interests, thereby enhancing the overall user experience.</p>
          <p>
            The initial approach to user segmentation was based on topic modeling [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. Specifically, each article
was transformed into a simplified embedding using Latent Dirichlet Allocation (LDA). Subsequently,
user interest profiles were generated by averaging the LDA embeddings of the articles read by each
individual user. Then, user interest profiles were clustered using the k-Means algorithm.
          </p>
          <p>Although successful and efective, this method was soon enhanced by substituting LDA modeling with
Item2Vec embeddings [12]. This modification significantly simplified and accelerated the segmentation
process by eliminating the need for text analysis, thereby rendering the method language-agnostic.
Consequently, this improvement allows for the deployment of Aureus across digital publishers regardless
of the language in which they publish.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Prediction Model</title>
        <p>Our models are based on articles that a user has read within the last N days. In our experiments, we
use an arbitrary value of N = 30 days. We calculate user representation by averaging the embeddings
of these articles, created by already pretrained LLM PolBert model [7]. Subsequently, we develop two
types of models:</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Similarity Model</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Deep Model</title>
          <p>A simple model that compares user embedding to article embeddings using cosine similarity. This was
our initial approach.</p>
          <p>We created a trainable model that was trained on user clicks as a target variable. Given our large and
imbalanced dataset, we sampled an equal number of clicked and unclicked articles to ensure balanced
data for evaluation. Using neural network architecture drawn in Figure 2, we seamlessly integrated
additional features into our recommender system, such as article length and other parameters. Since
our business KPI is a continuous variable, we also trained models with clicks weighted by this KPI,
similar to the approach described in [9]. Weighting by the business KPI resulted in an increase in this
KPI in online tests.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Model Ensemble Architecture</title>
        <p>We previously outlined two key components of our recommendation system. The reinforcement
learning module identifies popular and trending articles, while the prediction model captures individual
user preferences. For optimal user satisfaction, the recommendation system must integrate both
aspects. Relying solely on a popularity-based model neglects individual user preferences, whereas a
user-preference model may overlook trending articles, which are crucial in the news domain.</p>
        <p>We evaluated several methods for combining multiple recommendations, two of which advanced to
the online testing phase and are now employed in daily operations:
• Proportional Random Mixer — In this approach, each recommendation method is assigned
a target proportion within the final content set (e.g., 40% of recommendations from a bandit
algorithm and 60% from a deep learning model). For the k-th position in the final recommendation
list, an article is selected randomly from the k-th positions of the input recommendations, with
the selection probability proportional to the assigned target share.
• Weighted Average Mixer — In this method, each content item from the input recommendations
is associated with a score from the corresponding model. These scores are normalized to the range
[0.0, 1.0] to ensure equity. Each content item is then assigned a new score, which is a weighted
average of the scores from the input models, and the final recommendation list is ordered based
on these new scores.</p>
        <p>Online testing proved that the weighted average mixer performed significantly better. Consequently,
all results and conclusions presented in this paper are based on the weighted average mixer.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation</title>
      <p>In this section, we present a comprehensive evaluation of our proposed recommendation system. The
evaluation is divided into three subsections: Ofline Evaluation, Online Evaluation, and Results.</p>
      <p>Ofline Evaluation describes the performance metrics derived from historical data, allowing us to
assess the model’s predictive accuracy in a controlled environment. This process helps identify the best
model for subsequent online testing.</p>
      <p>Online Evaluation involves deploying the model in a live setting on Onet.pl, where we measure its
real-time efectiveness, user engagement metrics and buisness KPI metrics.</p>
      <p>Finally, the Results subsection synthesizes the findings from both ofline and online evaluations.</p>
      <sec id="sec-4-1">
        <title>4.1. Ofline Setup</title>
        <sec id="sec-4-1-1">
          <title>4.1.1. Baselines</title>
          <p>Each time we develop a new architecture or introduce a new feature, we evaluate the models against at
least two baselines:
• random model — If our model does not outperform random recommendations, we conclude
that it is unsuited for production deployment.
• current production model — Our goal is to match or exceed the results of the current production
model. If the new model achieves comparable or better results, we proceed to test it in the online
environment.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. Ofline Evaluation Metrics</title>
          <p>We implemented both standard and custom ranking evaluation metrics on historical data. Each metric
is calculated with diferent values of k (primarily 3, 5, 10, 15, and 30). Our goal is to train a single model
that can be deployed for an extended period. Therefore, we validate our model on three diferent days:
one day, seven days, and thirty days after training. We utilize the following metrics:
• Standard Metrics — NDCG, Precision, Recall, Coverage and AUC
• Custom Metrics — We aim to optimize a continuous buisness KPI with a click prediction model,
so we calculate the average value of this KPI for a given ranking at k as our custom metric. These
are:
– Average Label Value — This metric considers all articles in the list.</p>
          <p>– Average Positive Label Value — This metric considers all articles that are clicked.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Online AB Tests</title>
        <sec id="sec-4-2-1">
          <title>4.2.1. Testing Setup</title>
          <p>A critical component of the Aureus system, alongside the recommendation models, is the A/B testing
engine. This engine facilitates statistically significant and fair online testing of multiple recommendation
approaches. Users are randomly and stably assigned to one of the testing variants, independent of user
agent, demographic factors, or other variables that might influence the test results. During the test
period, each user is exclusively presented with recommendations generated by the model associated
with their assigned variant. Key performance indicator (KPI) values for content are collected and
recorded according to the testing variants, enabling subsequent analysis and comparison.</p>
          <p>It is important to note that the online tests presented in this paper were conducted on a curated sample
of users and focused specifically on a designated section of the webpage (recommendations displayed
beneath articles). As such, the results may not fully generalize to similar experiments conducted under
diferent conditions or in other areas of the webpage.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Online Monitoring</title>
          <p>When the model is deployed in a production environment, we continuously monitor its performance
with respect to business KPIs and latency. We enforce a stringent latency threshold, beyond which
the recommendations generated by the model would not be utilized. To track these online metrics, we
employ AWS QuickSight for business-related metrics and Grafana for technical metrics.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results</title>
        <p>We implemented our models in two production environments: the Onet.pl homepage and article pages
(with recommendations below each article). The determination of the model that achieves the status of
„king of the hill” is based on the results of online testing. This approach allows for an evaluation of not
only the model’s performance but also its alignment with the actual needs of users. In the following
section, we compare several models employed by Aureus:
• random sample from the set of articles,
• Thompson Sampling bandit (our golden standard of recommenders),
• Thompson Sampling bandit with user segmentation enabled,
• items’ cosine similarity to the currently read article,
• segmented bandit mixed with item-to-item similarity model,
• segmented bandit mixed with user-to-item deep model,
• segmented bandit mixed with both item-to-item similarity and user-to-item deep model.
For comparison, we use two main metrics:
• uplift – the percentage diference between the average business KPI value of pieces of content
returned by the tested model and that returned by the baseline model,
• latency – the median response time, measured in milliseconds, of the tested model; this auxiliary
metric serves as a sanity check to ensure that the news website provides users with reasonably
responsive performance.</p>
        <p>Table 2 presents the results observed during our online testing process. The data clearly demonstrate
the synergy efect of the ensembled models, which consistently outperform the individual models. It
is also noteworthy that the ofline evaluations difer slightly from the online test results, where the
combination of similarity-based methods with bandits slightly outperformed the deep model mixed with
bandits. From our experience, this discrepancy is common in the context of news and time-sensitive
content, where deep models alone may struggle to capture the temporal dynamics.</p>
        <p>In terms of latency, deep models substantially increase the response times of the Aureus
recommender. However, this increase remains within acceptable limits and does not negatively impact the
user experience. Furthermore, incorporating more than two models in the mixing process does not
significantly extend response times.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>We demonstrated an enhancement of recommendation systems by integrating multiple models into a
unified architecture. This hybrid approach facilitates the seamless incorporation of new recommendation
scores, enabling the modeling of diverse recommendation aspects. Future work will involve the
exploration of additional features, diferent mixing strategies and various embedding models to further
refine the system.
system, in: Proceedings of the 7th International Workshop on News Recommendation and
Analytics in conjunction with the 13th ACM Conference on Recommender Systems, INRA @ RecSys
2019, Copenhagen, Denmark, September 20, 2019, volume 2554 of CEUR Workshop Proceedings,
CEUR-WS.org, 2019, pp. 53–62. URL: http://ceur-ws.org/Vol-2554/paper_08.pdf.
[6] OpenAI, New embedding models and API updates, 2024. URL:
https://openai.com/index/newembedding-models-and-api-updates/, , last visited on 2024-08-30.
[7] D. Kłeczek, Polbert: Attacking polish nlp tasks with transformers, in: M. Ogrodniczuk, Łukasz
Kobyliński (Eds.), Proceedings of the PolEval 2020 Workshop, Institute of Computer Science, Polish
Academy of Sciences, 2020.
[8] C. Wu, F. Wu, Y. Yu, T. Qi, Y. Huang, Q. Liu, NewsBERT: Distilling pre-trained language model
for intelligent news application, in: Findings of the Association for Computational Linguistics:
EMNLP 2021, Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021,
pp. 3285–3295. URL: https://aclanthology.org/2021.findings-emnlp.280. doi: 10.18653/v1/2021.
findings-emnlp.280.
[9] P. Covington, J. Adams, E. Sargin, Deep neural networks for youtube recommendations, in:</p>
      <p>Proceedings of the 10th ACM Conference on Recommender Systems, New York, NY, USA, 2016.
[10] P. Auer, Using confidence bounds for exploitation-exploration trade-ofs, Journal of Machine</p>
      <p>Learning Research (2002) 397–422.
[11] S. Agrawal, N. Goyal, Analysis of thompson sampling for the multi-armed bandit problem,
in: S. Mannor, N. Srebro, R. C. Williamson (Eds.), Proceedings of the 25th Annual Conference
on Learning Theory, volume 23, 2012, pp. 39.1–39.26. URL: https://proceedings.mlr.press/v23/
agrawal12.html.
[12] O. Barkan, N. Koenigstein, Item2vec: Neural item embedding for collaborative filtering, in: 2016
IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), 2016, pp.
1–6. doi:10.1109/MLSP.2016.7738886.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Wirtualnemedia.pl, Strony główne po zmianach w Mediapanelu</article-title>
          .
          <source>WP wyprzedziła Onet</source>
          ,
          <year>2024</year>
          . URL: https://www.wirtualnemedia.pl/artykul/najpopularniejsze-serwisy
          <article-title>-strony-glowne-wp-plonet-pl-interia-gazeta-</article-title>
          <string-name>
            <surname>pl</surname>
          </string-name>
          , ,
          <source>last visited on 2024-08-30.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] similarweb.com, Top websites ranking</article-title>
          .
          <source>Most visited news &amp; media publishers websites</source>
          ,
          <year>2024</year>
          . URL: https://www.similarweb.com/top-websites/news-and-media/, , last visited on 2024-
          <volume>09</volume>
          -06.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Volinsky</surname>
          </string-name>
          ,
          <article-title>Collaborative filtering for implicit feedback datasets</article-title>
          ,
          <year>2008</year>
          , pp.
          <fpage>263</fpage>
          -
          <lpage>272</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDM.
          <year>2008</year>
          .
          <volume>22</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Ripplenet: Propagating user preferences on the knowledge graph for recommender systems</article-title>
          ,
          <year>2018</year>
          , pp.
          <fpage>417</fpage>
          -
          <lpage>426</lpage>
          . doi:
          <volume>10</volume>
          .1145/3269206. 3271739.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Misztal-Radecka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rusiecki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Żmuda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bujak</surname>
          </string-name>
          ,
          <article-title>Trend-responsive user segmentation enabling traceable publishing insights. A case study of a real-world large-scale news recommendation</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>