Enhancing Prediction Models with Reinforcement Learning

Enhancing Prediction Models with Reinforcement Learning KarolRadziszewski karol.radziszewski@ringieraxelspringer.pl Ringier Axel Springer Polska

Warsaw, Kraków Poland

Warsaw University of Technology

Warsaw Poland

PiotrOciepka piotr.ociepka@ringieraxelspringer.pl Ringier Axel Springer Polska

Warsaw, Kraków Poland

Enhancing Prediction Models with Reinforcement Learning 1613-0073 0CF17538B53FA24976683E875A705B79 GROBID - A machine learning software for extracting information from scholarly documents Personalization News recommendations Reinforcement Learning Deep Learning

We present a large-scale news recommendation system implemented at Ringier Axel Springer Polska, focusing on enhancing prediction models with reinforcement learning techniques. The system, named Aureus, integrates a variety of algorithms, including multi-armed bandit methods and deep learning models based on large language models (LLMs). We detail the architecture and implementation of Aureus, emphasizing the significant improvements in online metrics achieved by combining ranking prediction models with reinforcement learning. The paper further explores the impact of different models mixing on key business performance indicators. Our approach effectively balances the need for personalized recommendations with the ability to adapt to rapidly changing news content, addressing common challenges such as the cold start problem and content freshness. The results of online evaluation demonstrate the effectiveness of the proposed system in a real-world production environment.

Introduction

Ringier Axel Springer Polska is among the largest media companies in Poland, operating the news website www.onet.pl. Onet.pl attracts approximately 6 million unique users monthly, representing about 20% of the Polish internet users [1]. According to SimilarWeb, Onet is the 18th largest news website globally [2]. Our recommendation system, called Aureus, processes over a thousand requests per second, necessitating low latency to ensure users experience minimal wait times for website loading.

Aureus comprises a variety of recommendation components, including user segmentation and reinforcement learning (popularity-based component). Over the past year, we implemented additional modules responsible for content similarity and deep learning models based on on large language models (LLM) to capture individual preferences.

In this article, we focus on describing the architecture of a real-world large-scale news recommendation system, in particular:

• we show that combining ranking prediction models with reinforcement learning significantly improves online metrics, • we further analyze different aspects of model configuration and training objectives concerning multiple business KPIs.

Related Work

The complexity of news recommendation systems exceeds that of other systems due to the rapid data updates and content obsolescence, with thousands of articles published daily. A key challenge is addressing cold start users, as many visitors rely on cookie IDs without logging in.

Traditional collaborative filtering methods, such as matrix factorization [3], face difficulties due to the cold start problem, requiring multiple observations for each user. To overcome this, models using external features have been proposed. Wang et al. introduced RippleNet, a deep learning model that leverages external data via knowledge graphs, enabling recommendations with minimal previous users' interactions [4].

Reinforcement learning, particularly multi-armed bandit algorithms, is another solution to cold start. This approach has been successfully used in our systems for several years [5].

The subsequent methodology of modelling recommendation systems, content-based filtering is crucial, particularly in news recommendations. Recent Natural Language Processing (NLP) advances, such as pretrained models like GPT [6] and PolBERT [7], have enhanced the generation of personalized recommendations through embeddings.

However, these models are often large and costly. Wu et al. addressed this by introducing NewsBert [8], a distilled version of BERT tailored for the news domain, reducing model size and complexity.

The core aim of recommendation systems is to boost user satisfaction, often measured by clicks or time spent on the platform. While click modeling is straightforward, time-based metrics are more complex. Covington et al. proposed a method to weight clicks based on time spent, implemented in YouTube's system [9].

Our approach uniquely integrates bandit algorithms with traditional ranking models, creating an adaptive news recommendation engine that combines the strengths of both multi-armed bandits and deep learning models within a unified architecture.

Proposed Approach

Over time, Aureus has expanded to incorporate a variety of recommendation algorithms and methods, each characterized by distinct capabilities and limitations. This section provides a detailed overview of several of these methods, followed by the introduction of a novel approach for aggregating multiple recommendations into a unified output. This approach leverages the unique strengths of each constituent algorithm while effectively mitigating the specific drawbacks associated with individual methods.

Reinforcement Learning

The initial application of Aureus was to automate the curation process for the Onet.pl news feed. The method required the capability to rapidly collect user feedback, identify both short-and long-term popularity trends, and recommend content that was both highly popular and engaging. Additionally, the system needed to adapt to emerging articles as well as those experiencing a decline in user engagement over time. Given the primary objective of automating the existing editorial workflow, the recommendations were designed to be population-wide, independent of individual user preferences or tastes.

Multi-armed Bandits

Considering the outlined requirements, we selected multi-armed bandit algorithms as the foundation of our approach. This class of methods is particularly suited for balancing the trade-off between exploration -acquiring knowledge regarding each article's performance and popularity -and exploitationrecommending the highest-performing content. Moreover, multi-armed bandit algorithms possess the capability to optimize a wide range of business-related Key Performance Indicators (KPIs), including both continuous and discrete metrics. This flexibility makes them an ideal choice for the dynamic and demanding environment of the publishing industry. Following extensive offline and online evaluations, we identified Upper Confidence Bound [10] and Thompson Sampling [11] as the most effective bandit methods for this application.

Nevertheless, the exclusion of individual user preferences emerged as a significant limitation of the selected approach. To overcome this constraint, while preserving the robustness, simplicity, trend-responsiveness, and cost-and time-efficiency of the bandit-based recommender system, we introduced the concept of user segmentation.

Users Segmentation

Segmentation involves dividing the entire user population into smaller, more homogeneous groups, each consisting of users with similar tastes. By applying multi-armed bandit algorithms separately within each segment, the recommendation process remains primarily popularity-based. However, through segmentation, each user is presented with a set of articles that are most popular among individuals with comparable interests, thereby enhancing the overall user experience.

The initial approach to user segmentation was based on topic modeling [5]. Specifically, each article was transformed into a simplified embedding using Latent Dirichlet Allocation (LDA). Subsequently, user interest profiles were generated by averaging the LDA embeddings of the articles read by each individual user. Then, user interest profiles were clustered using the k-Means algorithm.

Although successful and effective, this method was soon enhanced by substituting LDA modeling with Item2Vec embeddings [12]. This modification significantly simplified and accelerated the segmentation process by eliminating the need for text analysis, thereby rendering the method language-agnostic. Consequently, this improvement allows for the deployment of Aureus across digital publishers regardless of the language in which they publish.

Prediction Model

Our models are based on articles that a user has read within the last N days. In our experiments, we use an arbitrary value of N = 30 days. We calculate user representation by averaging the embeddings of these articles, created by already pretrained LLM PolBert model [7]. Subsequently, we develop two types of models:

Similarity Model

A simple model that compares user embedding to article embeddings using cosine similarity. This was our initial approach.

Deep Model

We created a trainable model that was trained on user clicks as a target variable. Given our large and imbalanced dataset, we sampled an equal number of clicked and unclicked articles to ensure balanced data for evaluation. Using neural network architecture drawn in Figure 2, we seamlessly integrated additional features into our recommender system, such as article length and other parameters. Since our business KPI is a continuous variable, we also trained models with clicks weighted by this KPI, similar to the approach described in [9]. Weighting by the business KPI resulted in an increase in this KPI in online tests.

Model Ensemble Architecture

We previously outlined two key components of our recommendation system. The reinforcement learning module identifies popular and trending articles, while the prediction model captures individual user preferences. For optimal user satisfaction, the recommendation system must integrate both aspects. Relying solely on a popularity-based model neglects individual user preferences, whereas a user-preference model may overlook trending articles, which are crucial in the news domain.

We evaluated several methods for combining multiple recommendations, two of which advanced to the online testing phase and are now employed in daily operations:

• Proportional Random Mixer -In this approach, each recommendation method is assigned a target proportion within the final content set (e.g., 40% of recommendations from a bandit algorithm and 60% from a deep learning model). For the k-th position in the final recommendation list, an article is selected randomly from the k-th positions of the input recommendations, with the selection probability proportional to the assigned target share. • Weighted Average Mixer -In this method, each content item from the input recommendations is associated with a score from the corresponding model. These scores are normalized to the range [0.0, 1.0] to ensure equity. Each content item is then assigned a new score, which is a weighted average of the scores from the input models, and the final recommendation list is ordered based on these new scores.

Online testing proved that the weighted average mixer performed significantly better. Consequently, all results and conclusions presented in this paper are based on the weighted average mixer.

Experimental Evaluation

In this section, we present a comprehensive evaluation of our proposed recommendation system. The evaluation is divided into three subsections: Offline Evaluation, Online Evaluation, and Results.

Offline Evaluation describes the performance metrics derived from historical data, allowing us to assess the model's predictive accuracy in a controlled environment. This process helps identify the best model for subsequent online testing.

Figure 1 :1Figure 1: The diagram of the segments calculation process.

Figure 2 :2Figure 2: The diagram of the deep model architecture. Input embeddings are calculated with pretrained models.

Figure 3 :3Figure 3: The diagram of the Aureus recommendation system illustrates the following components: Inputs consist of the user ID, a set of content items, and online business KPI metrics. The system integrates two submodels: a deep learning-based user interest model and a multi-armed bandit content popularity model. These submodels are combined using a specified combination strategy.

Online Evaluation involves deploying the model in a live setting on Onet.pl, where we measure its real-time effectiveness, user engagement metrics and buisness KPI metrics.

Finally, the Results subsection synthesizes the findings from both offline and online evaluations.

Offline Setup

Baselines

Each time we develop a new architecture or introduce a new feature, we evaluate the models against at least two baselines:

• random model -If our model does not outperform random recommendations, we conclude that it is unsuited for production deployment. • current production model -Our goal is to match or exceed the results of the current production model. If the new model achieves comparable or better results, we proceed to test it in the online environment.

Offline Evaluation Metrics

We implemented both standard and custom ranking evaluation metrics on historical data. Each metric is calculated with different values of k (primarily 3, 5, 10, 15, and 30). Our goal is to train a single model that can be deployed for an extended period. Therefore, we validate our model on three different days: one day, seven days, and thirty days after training. We utilize the following metrics:

• Standard Metrics -NDCG, Precision, Recall, Coverage and AUC • Custom Metrics -We aim to optimize a continuous buisness KPI with a click prediction model, so we calculate the average value of this KPI for a given ranking at k as our custom metric. These are:

-Average Label Value -This metric considers all articles in the list.

-Average Positive Label Value -This metric considers all articles that are clicked.

Online AB Tests

Testing Setup

A critical component of the Aureus system, alongside the recommendation models, is the A/B testing engine. This engine facilitates statistically significant and fair online testing of multiple recommendation approaches. Users are randomly and stably assigned to one of the testing variants, independent of user agent, demographic factors, or other variables that might influence the test results. During the test period, each user is exclusively presented with recommendations generated by the model associated with their assigned variant. Key performance indicator (KPI) values for content are collected and recorded according to the testing variants, enabling subsequent analysis and comparison.

It is important to note that the online tests presented in this paper were conducted on a curated sample of users and focused specifically on a designated section of the webpage (recommendations displayed beneath articles). As such, the results may not fully generalize to similar experiments conducted under different conditions or in other areas of the webpage.

Online Monitoring

When the model is deployed in a production environment, we continuously monitor its performance with respect to business KPIs and latency. We enforce a stringent latency threshold, beyond which the recommendations generated by the model would not be utilized. To track these online metrics, we employ AWS QuickSight for business-related metrics and Grafana for technical metrics.

Results

Table 1 presents the offline performance metrics of three predictive models: the random baseline, the similarity model, and the deep learning model. The deep learning model demonstrates superior performance, surpassing the similarity model by approximately 65.7% in NDCG, around 16.3% in AvgLabelValue and around 0.9% in AvgPositiveLabelValue. This indicates that the deep learning model is more effective in terms of business (KPIs) and has been deployed in online tests as the user-to-item recommendation model. We implemented our models in two production environments: the Onet.pl homepage and article pages (with recommendations below each article). The determination of the model that achieves the status of "king of the hill" is based on the results of online testing. This approach allows for an evaluation of not only the model's performance but also its alignment with the actual needs of users. In the following section, we compare several models employed by Aureus:

• random sample from the set of articles, • Thompson Sampling bandit (our golden standard of recommenders), • Thompson Sampling bandit with user segmentation enabled, • items' cosine similarity to the currently read article, • segmented bandit mixed with item-to-item similarity model, • segmented bandit mixed with user-to-item deep model, • segmented bandit mixed with both item-to-item similarity and user-to-item deep model.

For comparison, we use two main metrics:

• uplift -the percentage difference between the average business KPI value of pieces of content returned by the tested model and that returned by the baseline model, • latency -the median response time, measured in milliseconds, of the tested model; this auxiliary metric serves as a sanity check to ensure that the news website provides users with reasonably responsive performance.

Table 2 presents the results observed during our online testing process. The data clearly demonstrate the synergy effect of the ensembled models, which consistently outperform the individual models. It is also noteworthy that the offline evaluations differ slightly from the online test results, where the combination of similarity-based methods with bandits slightly outperformed the deep model mixed with bandits. From our experience, this discrepancy is common in the context of news and time-sensitive content, where deep models alone may struggle to capture the temporal dynamics.

In terms of latency, deep models substantially increase the response times of the Aureus recommender. However, this increase remains within acceptable limits and does not negatively impact the user experience. Furthermore, incorporating more than two models in the mixing process does not significantly extend response times.

Conclusions and Future Work

We demonstrated an enhancement of recommendation systems by integrating multiple models into a unified architecture. This hybrid approach facilitates the seamless incorporation of new recommendation scores, enabling the modeling of diverse recommendation aspects. Future work will involve the exploration of additional features, different mixing strategies and various embedding models to further refine the system.

Wirtualnemedia Strony główne po zmianach w Mediapanelu. WP wyprzedziła Onet 2024. 2024-08-30 Similarweb Top websites ranking. Most visited news & media publishers websites 2024. 2024-09-06 Collaborative filtering for implicit feedback datasets YHu YKoren CVolinsky 10.1109/ICDM.2008.22 2008 Ripplenet: Propagating user preferences on the knowledge graph for recommender systems HWang FZhang JWang MZhao WLi XXie MGuo 10.1145/3269206.3271739 2018 Trend-responsive user segmentation enabling traceable publishing insights. A case study of a real-world large-scale news recommendation system JMisztal-Radecka DRusiecki MŻmuda ABujak Proceedings of the 7th International Workshop on News Recommendation and Analytics in conjunction with the 13th ACM Conference on Recommender Systems, INRA @ RecSys 2019 CEUR Workshop Proceedings the 7th International Workshop on News Recommendation and Analytics in conjunction with the 13th ACM Conference on Recommender Systems, INRA @ RecSys 2019

Copenhagen, Denmark

September 20, 2019. 2019 2554 OpenAI, New embedding models and API updates 2024. 2024-08-30 Polbert: Attacking polish nlp tasks with transformers DKłeczek Proceedings of the PolEval 2020 Workshop MOgrodniczuk ŁukaszKobyliński the PolEval 2020 Workshop 2020 Institute of Computer Science, Polish Academy of Sciences NewsBERT: Distilling pre-trained language model for intelligent news application CWu FWu YYu TQi YHuang QLiu 10.18653/v1/2021.findings-emnlp.280 Findings of the Association for Computational Linguistics: EMNLP 2021, Association for Computational Linguistics

Punta Cana, Dominican Republic

2021 Deep neural networks for youtube recommendations PCovington JAdams ESargin Proceedings of the 10th ACM Conference on Recommender Systems the 10th ACM Conference on Recommender Systems

New York, NY, USA

2016 Using confidence bounds for exploitation-exploration trade-offs PAuer Journal of Machine Learning Research 2002 Analysis of thompson sampling for the multi-armed bandit problem SAgrawal NGoyal Proceedings of the 25th Annual Conference on Learning Theory SMannor NSrebro RCWilliamson the 25th Annual Conference on Learning Theory 2012 23 26 Item2vec: Neural item embedding for collaborative filtering OBarkan NKoenigstein 10.1109/MLSP.2016.7738886 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP) 2016. 2016