Non-Stationary Multi-Armed Bandits for News Recommendations

Non-Stationary Multi-Armed Bandits for News Recommendations NoahDaniëls noah.daniels@uantwerpen.be University of Antwerp

Antwerp Belgium

Froomle

Antwerp Belgium

BartGoethals University of Antwerp

Antwerp Belgium

Froomle

Antwerp Belgium

Monash University

Melbourne Australia

Non-Stationary Multi-Armed Bandits for News Recommendations 1613-0073 74275C303A52EB45D116499D42E755E7 GROBID - A machine learning software for extracting information from scholarly documents news recommender system multi-armed bandit problem non-stationarity

In the dynamic landscape of online news, recommending relevant articles is key to enhancing user engagement and satisfaction. Traditional Multi-Armed Bandit (MAB) approaches, however, struggle with the specific nonstationary nature of news consumption, where article relevance shifts rapidly due to emerging trends and breaking news. Through simulations, we explore the effectiveness of non-stationary adaptations such as sliding windows, exponentially weighted averages, and discounting, revealing some limitations in their performance. Furthermore, we introduce a novel algorithm, Optimistic/Pessimistic-Greedy (OP-Greedy), which leverages external signals such as article page views to optimize exploration. Our findings demonstrate that tailoring MAB algorithms to the unique dynamics and opportunities of news recommendations can significantly improve their adaptability and effectiveness.

Introduction

In the rapidly evolving landscape of online news consumption, the ability to effectively recommend relevant articles has become crucial for user engagement and satisfaction. This task, characterized by the inherent dilemma of balancing exploration (discovering new, potentially relevant articles) and exploitation (leveraging known popular articles), is well-modeled by the Multi-Armed Bandit (MAB) problem. Based on our experience in running recommendation systems for several large online news media sites, we noticed that traditional MAB approaches fall short in addressing the unique challenges presented by the non-stationary nature of news consumption, where article relevance fluctuates drastically over short periods due to emerging trends, breaking news, and shifting user interests.

This study introduces practical enhancements to conventional MAB algorithms, specifically tailored to the domain of news recommendations. Unlike existing methodologies that predominantly focus on contextual [1,2,3,4] or general non-stationary settings [5,6,7,8], our approach deals explicitly with the specific non-stationary dynamics inherent in online news recommendations. For this study, we focus on the non-contextual setting before investigating the more complex contextual version. As the largest proportion of traffic on a typical news media site is from anonymous, or cold-start, visitors, any improvement might already have a significant impact on the news website.

Our contributions are twofold: First, we evaluate established MAB algorithms in their ability to handle the dynamic nature of news content and show some surprising results. We run experiments in a simulated environment and investigate how mechanisms such as sliding windows can help or hinder algorithms in their ability to deal with the non-stationary behavior of news. Second, we describe a novel algorithm called Optimistic/Pessimistic-Greedy (OP-Greedy) which leverages external signals, such as article page views, to guide the exploration process more efficiently, reducing the need for indiscriminate exploration and focusing efforts on articles with higher potential for user engagement.

By addressing the unique challenges of the news recommendation domain, our study seeks to enhance the practical utility of MAB algorithms, ultimately contributing to the development of more adaptive, effective, and user-centric news recommendation systems.

Background

In the classical multi-armed bandit problem [9], an agent is faced with a row of slot machines (arms), each with an unknown reward distribution. The goal is to maximize the total reward gained over a series of pulls. This requires balancing the exploration of different arms to learn their reward distributions with the exploitation of arms that seem to provide the highest rewards. When applied to news recommendation systems, the arms represent different news articles and the act of pulling an arm corresponds to recommending the associated article. An article's reward distribution is characterized by its click-through-rate (CTR) and the agent typically receives binary feedback indicating whether the article was clicked or not. The agent must simultaneously identify articles with high CTRs and maximize aggregate clicks.

Our investigation is anchored in the non-stationary multi-armed bandit (NS-MAB) problem [5,6], where the unknown reward distributions change over time. This introduces additional complexity as now there isn't a single optimal arm anymore. In general, when the rewards can change arbitrarily, no theoretical guarantees can be provided. For this reason, researchers study classes of problems in which the amount of change is bounded, for example by having a limited number of time points at which the rewards change [7,8] or a limited total variation of the expected rewards [5,8].

While these research efforts provide interesting theoretical guarantees for large classes of nonstationary problems, they sometimes fail to produce effective algorithms in practice. Our work diverges from this existing literature by addressing the specific non-stationary dynamics inherent in online news recommendations. By adapting MAB algorithms to this context, we contribute to closing the gap between theoretical bandit solutions and their applicability in the dynamic, real-world scenario of news recommendations.

Concretely, we make the following key assumptions about the characteristics of the news recommendation environment:

• Decreasing CTRs: due to the rapid turnover inherent to news content, article CTRs tend to decrease as more recent news becomes available and old news loses relevancy [10]. • Large and dynamic item pool: the number of news articles is large but not all articles are available from the start. Instead, new articles are introduced periodically while old articles are removed after their fixed lifetime. • Low signal-to-noise ratio: The large item pool and short lifespans result in few recommendation that actually receive clicks. The CTRs for most items is lower than 1%, which necessitates extensive data collection. • External signals: A news site might share their articles on social-media, send push-notification or news letters, and generate clicks through other curated or automated recommendations. These external sources of article clicks and page view lead to signals about an article's popularity which we should take advantage of.

These assumptions motivate the design of our simulated environment and they are crucial for explaining and understanding our results. Furthermore, the opportunity represented by external signals is the motivation behind our novel OP-Greedy algorithm.

Related Work

The theoretical foundations regarding different variations of the multi-armed bandit problem have been extensively studied, from the classical problem [11,12] to variants including contextual [13,14,15] and non-stationary [5,6,7,16,8] problems. Besides these theoretical works, a lot of research has also been done on applying these technique to recommendation systems. The most well-known of these efforts is linUCB [17] which was conceptualized in the domain of personalized news recommendations, but did not consider the non-stationarity of news. In general, the focus of recommendation system bandit research has been on contextual bandits where user and or item features can be used to create personalized recommendations. For this reason, recent research considering non-stationarity in recommendation systems is applied to contextual bandits [1,2,3,4]. Moreover, this research primarily focuses on changing user interests rather than changing item relevance

We believe that the non-contextual case can be equally important for the news domain as a large portion of news traffic is often from anonymous users. Therefore, we decided to diverge from existing literature and focus on this overlooked case, and leave the much more complex contextual version for further work.

Methodology

In this section, we describe popular MAB algorithms in the context of news recommendations and highlight possible adaptations to make them more suitable for the news NS-MAB problem. We also detail how our experiments were conducted, including a detailed description of our simulation environment.

Base Algorithms

𝜖-Greedy (EG), Upper Confidence Bound (UCB) [11], and Thompson Sampling (TS) [18,19], are three popular MAB algorithms, all of which base their decisions on estimates computed from observed clicks and the number of impressions (recommendations). To ease notation, let 𝑛 𝑖 (𝑡) and 𝑐 𝑖 (𝑡) denote the number of times article 𝑖 has been recommended and clicked respectively, up until time 𝑡. The estimated CTR of item 𝑖 at time 𝑡 is then given by 𝜇 ¯𝑖(𝑡) = 𝑛 −1 𝑖 (𝑡)𝑐 𝑖 (𝑡). During each time step, after the agent has chosen a set 𝐼 𝑡 of items to recommend and obtained feedback for them in the form of boolean rewards indicating clicks {𝑟 𝑖,𝑡 |𝑖 ∈ 𝐼 𝑡 }, the statistics are updated as follows:

𝑛 𝑖 (𝑡) = 𝑛 𝑖 (𝑡 − 1) + 1[𝑖 ∈ 𝐼 𝑡 ] 𝑐 𝑖 (𝑡) = 𝑐 𝑖 (𝑡 − 1) + 𝑟 𝑖,𝑡

with 𝑛 𝑖 (0) = 𝑐 𝑖 (0) = 0 and 𝑟 𝑖,𝑡 being zero if item 𝑖 was not recommended at time 𝑡.

Epsilon Greedy

The most basic, yet surprisingly effective algorithm is 𝜖-greedy. It handles the exploration/exploitation trade-off by picking a random set of items with probability 𝜖 and greedily picking the items with the highest estimated CTR otherwise. If 𝜖 is chosen properly, this approach will exploit the best items often while not getting stuck on sub-optimal items.

Upper Confidence Bound

Instead of exploring randomly as with EG, the UCB algorithm will guide exploration by considering the uncertainty of its estimates. Concretely, UCB looks at the confidence intervals for the CTR estimates and always select the article which has the highest upper confidence bound. This way, items with high CTR will get recommended, but also those that have not been explored sufficiently and have the potential for high CTR. In practice, UCB assigns a score 𝑞 𝑖 (𝑡) to each item and picks the highest scoring items. The scores are calculated as follows:

𝑞 𝑖 (𝑡) = 𝜇 ¯𝑖(𝑡) + 𝑐 √︃ ln 𝑛(𝑡) 𝑛 𝑖 (𝑡)

where 𝑐 is a hyper-parameter regulating the required confidence and 𝑛(𝑡) it the total impressions count before time 𝑡 of all available items.

Thompson Sampling

TS guides its exploration by recommending each item in proportion with the probability that they are optimal based on previous evidence. Similarly to UCB, this will cause more exploration for items with uncertain and promising estimates. In practice, the probability of item 𝑖 being optimal is modeled using a beta distribution 𝐵𝑒𝑡𝑎(1 + 𝑐 𝑖 (𝑡), 1 + 𝑛 𝑖 (𝑡) − 𝑐 𝑖 (𝑡)). At recommendation time, each item's distribution is sampled and those with the highest samples are recommended.

Adaptations for Non-Stationary Settings

Three notable ways to adapt MAB algorithms for non-stationary settings are to use sliding windows, discounting and weighted averages. All these approaches aim to increase the accuracy of the CTR estimates by either discarding old observations (sliding windows) or placing more importance on recent observations (weighted averages and discounting).

Sliding windows

A global (or time-based) sliding window will only consider the last 𝜏 interactions (or last 𝜏 minutes of interactions) when determining the click and impression counts [20,8,7]. For UCB, this also means changing the calculation of 𝑛(𝑡). This is the most common form of sliding windows because it ensures that items which have stopped being recommended will eventually be tried again as their observations fall outside the window and they are treated as new items again. This is necessary in cases where the relevance of items can change arbitrarily. However, in news, the relevance of items is expected to decrease, which would lead a global sliding window to keep exploring unnecessarily.

To resolve this, we can apply the sliding windows locally, making the algorithm only consider the last 𝜏 interactions of each item individually. When an item is deemed irrelevant by an algorithm using this approach, it is far less likely to be recommended again because the window will stop advancing as the item stops receiving recommendations. On the other hand, relevant items which receive many recommendations do advance the window, allowing the estimates to update quickly if they start to become less relevant.

Weighted averages

With exponential recency-weighted averages [9], the estimated CTR 𝜇 ¯𝑖,𝑡 of item 𝑖 is not calculated from the click and impression counts but instead iteratively updated each time item 𝑖 is recommended according to the following update rule:

𝑄 𝑖 (𝑛) = 𝑄 𝑖 (𝑛 − 1) + 𝛼(𝑟 𝑖,𝑛 − 𝑄 𝑖 (𝑛 − 1))

where 𝛼 is the learning rate, and 𝑟 𝑖,𝑛 indicates the reward obtained from the 𝑛-th recommendation. The estimated CTR is always set to the latest value: 𝜇 ¯𝑖(𝑡) = 𝑄 𝑖 (𝑛 𝑖 (𝑡)).

With this update rule, more weight is given to more recent observations and similar to the local sliding windows, this technique does not change CTR estimates when an item is not recommended.

For TS, where the beta distribution is parameterized based on the click and impression counts, and not the estimated CTR, we first calculate the number of clicks that would achieve the estimated CTR given the number of impressions:

𝐵𝑒𝑡𝑎(1 + 𝑛 𝑖 (𝑡)𝜇 ¯𝑖(𝑡), 1 + 𝑛 𝑖 (𝑡)(1 − 𝜇 ¯𝑖(𝑡))

A possible extra hyper-parameter when using weighted averages is the starting estimate 𝑄 𝑖 (0). When it is set higher than the expected true CTR we call it an optimistic starting value while if it is set lower we call it a pessimistic starting value. Sometimes, optimistic weighted averages are used in combination with 𝜖-greedy in which case 𝜖 can be set to zero as the optimism can drive the exploration for new items.

Discounting

A related technique to weighted averages is that of discounting [7,16], where the impression and click counts are individually updated after each time step with the following update rule:

𝑛 𝑖 (𝑡) = 𝛼𝑛 𝑖 (𝑡 − 1) + 1[𝑖 ∈ 𝐼 𝑡 ] 𝑐 𝑖 (𝑡) = 𝛼𝑐 𝑖 (𝑡 − 1) + 𝑟 𝑖,𝑡

While this technique gives more weight to more recent observations like weighted averages, it does so in way where all estimates are updated after each time step, which makes it more similar to global sliding windows.

Optimistic/Pessimistic-Greedy

On typical news portals, bandit-based recommender system will only be a small part of a larger ecosystem. A news site might share their articles on social-media, send push-notification or news letters, and generate clicks through other curated or automated recommendations. These other sources of article clicks and page view leads to signals about an article's popularity which we should take advantage of.

An interesting option presents itself when we use such popularity signals to decide whether an article warrants exploration or not. In this subsection, we describe a novel algorithm for doing so called optimistic/pessimistic-greedy (OP-greedy) which is based on weighted averages and greedy item selection. The idea behind our approach is to use external popularity information to retroactively decide whether we should use optimistic or pessimistic starting values.

Our algorithm expects as input the ranking of items based on the popularity signal. Note that we expect this ranking to contain valuable information, but not be good enough to use without modifcation. Based on an item's ranking rank(𝑖), we interpolate between an optimistic starting value 𝑄 ^(0) and a pessimistic starting value 𝑄 ˇ(0) as follows:

𝑄 𝑖 (0) = 𝑄 ˇ(0) + (𝑄 ^(0) − 𝑄 ˇ(0)) exp (︁ − rank(𝑖) 𝛾 )︁

where 𝛾 is a hyper-parameter controlling how large the rank can be before being seen as pessimistic.

By writing the weighted average update rule as a direct formula, we see how the starting value can be retroactively changed:

𝑄 𝑖 (𝑛) = (1 − 𝛼) 𝑛 𝑄 𝑖 (0) + 𝑛−1 ∑︁ 𝑘=0 (1 − 𝛼) 𝑘 𝛼𝑟 𝑖,𝑡−𝑘

The effect of the starting value can be included in the CTR estimate by using weighted averages as normal (without a special starting value) and adding 𝑄 𝑖 (0) multiplied by (1 − 𝛼) 𝑛 𝑖 (𝑡) to determine the final result.

Experimental Setup

Each experiment operates on a fixed number of episodes 𝑇 = 2000. During each episode, the agent under evaluation must make a number of top-3 recommendation from a set of available items. After each recommendation the simulator provides binary feedback for each of the 3 recommended items, indicating whether each item was clicked or not. The feedback is based on simulated ground truth CTRs which remain constant during the episode. After 𝑃 recommendations have been made (episode length 𝑃 depends on the simulation), the next episode starts with possibly changed CTRs and item availability.

The main metric we look at is the average CTR, which is the total number of obtained clicks divided by the total number of impressions. Since the number of impressions is constant for all agents, this effectively measures how many clicks each agent obtained. To normalize the results across environments, we report the relative CTR compared to an omniscient agent which always picks the 3 articles with highest ground truth CTR. Thus, a relative CTR of 80% means the agent missed out on 20% of clicks that it could have received if it had picked optimally.

We repeat each experiment 𝑁 times with different seeds and report the 95% confidence intervals of the results. The seed determines the evolution of the CTRs and thus each run of the experiment corresponds to an entirely different environment with different items. To further improve the accuracy of our results, we count the expected number of clicks instead of the obtained number of clicks to calculate the CTR. In the simulations we have access to the ground truth CTR which is equivalent to the click probability. While the agent receives boolean feedback based on the probabilities, data collection can use the CTRs directly for more accurate results.

All algorithms were tuned by evaluating each parameter combination on 15 runs. The best performing combinations were selected and used for the final 100 runs on different seeds. Those final results are the ones we report.

Simulation

In this subsection, we describe how the ground truth CTRs and their dynamics are determined in our simulation. As a reminder, the CTR of an item should be highest in the first half of its lifetime and decrease as time goes on. Nevertheless, the simulation should leave room for variation. Some items may keep their CTR constant or slightly rising for a while and some may exhibit multiple peaks of high CTR or reach their maximum near the middle of their lifetime.

To achieve this behavior, we model the base relevance 𝑓 𝑖,𝑡 of item 𝑖 during episode 𝑡 of its lifetime using exponential decay as follows

𝑓 𝑖,𝑡 = 𝑏 𝑖 exp(−(𝜎 𝑖 𝑡) 2 )

where 𝑏 𝑖 and 𝜎 𝑖 are item dependent simulation parameters controlling the item's maximum relevance and evolution speed respectively.

Only using the base relevance would result in all CTR curves simply trending toward 0 which is not what we want. Instead, the base relevance is modified by looking at the rank rank 𝑡 (𝑖) of item 𝑖's relevance w.r.t. all other items. This gives rise to the following formula for the ground truth CTR of item 𝑖 𝜇 𝑖,𝑡 = 𝑔

(︂ 𝑓 𝑖,𝑡 ln(𝑎 • rank 𝑡 (𝑖) + 𝑒) )︂

where 𝑎 ≥ 0 is a constant regulating how much relevant articles are boosted compared to less relevant articles and 𝑔(•) denotes Gaussian smoothing. Using this formula, the items with highest base relevance have boosted CTR, but can suddenly drop when their underlying relevance falls below the relevance of other items. Figure 1 shows that the CTR curves described above indeed follow the desired behaviour. The CTRs are generally highest in the beginning and generally decrease, but some items have multiple peaks, no peaks at all, or only peak after a while.

To define sensible simulation parameter values, we assume an episode corresponds to 5 minutes of real time, meaning the time horizon covers about a week. We set the lifespan of items to 900 episodes (3 days) and use 300 items in total (135 items on average available at any time). To achieve a low signal to noise ratio, we set the maximum relevance to 0.005 ± 0.002 uniformly distributed across items. The evolution speed range, 𝑎 and smoothing parameters were chosen by trial and error to achieve the desired CTR curve shapes.

Regarding the episode length 𝑃 , we consider two simulation versions with identical parameters as described above but different amounts of throughput. This allows us to study how algorithms are affected by different amounts of available information. • Simulation A: we set 𝑃 = 100 corresponding to 1 item recommendation per second of real time (600,000 in total). • Simulation B: we set 𝑃 = 1000 corresponding to 10 item recommendations per second of real time (6,000,000 in total).

Results

Evaluating Non-stationary Adaptations

Table 1 shows the performance on Simulation A of various algorithms combined with various nonstationarity adaptations. The best combination for each algorithm is highlighted in bold. Table 2 shows the same but for Simulation B.

Sliding windows

From the results, we can see that global sliding windows are not beneficial for algorithms which employ guided exploration, such as UCB and TS. The reason for this is that-by design-such sliding windows forget previously explored items, resulting in those items being explored again. In arbitrary non-stationary environments, this can be desirable as those items could have become relevant since they were last recommended. In the case of news however, the relevancy of items tends to decrease, making such exploration efforts unnecessary. For random exploration strategies such as 𝜖-greedy, this is not a concern because the exploration is purely random. Instead, global sliding windows provide an advantage in this case as items that are starting to perform poorly get identified more quickly.

Local sliding windows are beneficial to all algorithms, especially in Simulation B. Local sliding windows enable algorithms to more quickly detect when the items they are recommending start to under-perform, without incurring excessive exploration as global sliding windows do.

Weighted averages

The results also show that weighted averages are beneficial for news. Surprisingly, pessimistic starting values perform the best with optimistic starting values even being slightly worse than most non-adapted algorithms in Simulation A. The success of pessimistic starting values further supports the view that limiting exploration is the best strategy in this environment. In fact, with pessimistic starting values, new items will generally not be explored as much when they are added. Only when the best performing items (which are being recommended) start to lose relevance and drop below the starting value, will more items be explored. There are too many new items to explore effectively so algorithms should only explore when necessary.

Pessimistic greedy

Pessimistic greedy (𝜖-greedy with 𝜖 = 0 and pessimistic weighted averages) takes the previous one step further and actually does not explore new items at all, until the currently best performing items perform worse than the starting value. Despite presumably missing many good items this way, pessimistic greedy is the best performing algorithm tied with UCB in both simulations.

Discounting

The results show that discounting is the least effective strategy. As with global sliding windows, discounting encourages unwanted continued exploration in guided strategies. However, unlike global sliding windows which caused the greedy algorithm to perform better, discounting has no such benefit. The reason for this is that with global windows the estimated CTR of items which are not being recommend can increase when impressions fall outside the window, which can lead to the greedy action changing. With discounting, the estimates only ever decrease if the item is not being recommended, causing the algorithm to get stuck on suboptimal items as in the non-adapted version of greedy.

Optimistic/Pessimistic-greedy

Table 3 shows the performance of optimistic/pessimistic-greedy (OP-greedy). To simulate the external popularity signal, we apply varying amounts of noise to the ground truth CTRs. The items are then ranked by their perturbed CTR and this ranking is provided as popularity information to the algorithm. To validate that this popularity signal is not too accurate (which would make increased performance trivial), we report the performance of a pure popularity algorithm which simply recommends the top-3 items according to the signal. We calibrated the amount of noise based on the performance of this popularity algorithm and report results for low, medium and high amounts of noise representing good, medium and poor quality signals.

The results show that leveraging popularity signals to guide exploration can be very beneficial. Even with high amounts of noise the performance is comparable to that of the best algorithms in our other results, showing that OP-greedy is robust to poor signal quality. Furthermore, when the amount of noise is low, OP-greedy significantly outperforms the other algorithms we tested. Knowing that pure popularity approaches already typically perform surprisingly well in practice, this approach is very promising.

Impression plots

To gain further insight into why these algorithms are performing the way they do, we investigated which items are being recommended across times. To facilitate this, we evaluated some of the algorithms on a fixed simulation seed (the same seed as in Figure 1) so that the items are the same across runs. From these runs, we created plots showing the impression rate of the items. This was done by plotting a bar chart for every 10 episodes which shows how often each item is recommended in that time span. By placing those charts next to each other, we get a plot where the x-axis represents time and the height of items on the y-axis represents their impression rate.

As there are hundreds of items, we identify them using a sequence of 20 repeating colors. The ordering of items is fixed and based on the order they were added to the item pool, with older items appearing lower in the bar charts. To help interpret the plot, we also show the optimal impression distribution according to the omniscient agent above the plots. As each recommendation consists of three items, the optimal distribution shows three items on average. How close an algorithm's impression plot matches with the optimal impressions, directly determines its performance. Figure 2 shows an example impression plot for the random algorithm which always selects random available items. The plot shows how each item gets an equal share of the impressions. The colouring gives a striped appearance evoking downwards movement. This is explained by the introduction of new items, which are added to the top of the bars, and old items leaving the item pool, which are removed from the bottom. This highlights that only an item's height shows the impression rate; its position is irrelevant. The plots show how the non-adapted versions suffer from not being able to detect when an item stops being relevant. For example, both algorithms continue to recommend the dark green item (optimal in the beginning) long after it stopped being optimal. In contrast, the local sliding window versions adapt much more quickly and are able to switch more quickly to recommending relevant items.

Sliding windows

As explained previously, the global sliding window adaptation works well for the unguided 𝜖-greedy algorithm, but falls apart for TS because it needs to keep exploring all items. This results in an impressions plot which resembles random recommendations. On the other hand, the 𝜖-greedy plots of both sliding window types look nearly identical because the exploration is unaffected and so both versions give the same benefit of faster change detection.

Greedy strategies

Figure 4 shows impression plots for various greedy strategies using weighted averages on Simulation B. The first two plots are for optimistic and pessimistic starting values and the last plot is for the optimistic/pessimistic-greedy algorithm (OP-greedy) which decides between these variants based on an external signal. The plots shows that all variant perform well and match the optimal distribution closely, but there are still some differences.

The most obvious difference is that optimistic greedy allocates a considerable number of impressions to new items. This can be seen in the plots by the items at the top of the bar charts extending down much farther for optimistic greedy than for the other algorithms. These impressions are "wasted" in the sense that the other algorithms are able to gather enough information to determine the optimal items, without needing so much exploration.

The difference between pessimistic-and OP-greedy is more subtle. The amount of exploration of new items is about the same for both algorithms as most items are seen pessimistically. However, items which the external signal indicates are popular will be explored more by OP-greedy. In some cases this is unwarranted and results in impressions being wasted on new items (the gray item in the top right). On the other hand, sometimes the extra exploration pays off and results in optimal items being discovered slightly sooner (the optimal dark blue item in the middle receives more attention sooner).

Discounting

Figure 5 shows impression plots for UCB and TS using discounting on Simulation B. We can see that the effect of discounting is similar to that of a global sliding window as it improves the change detection speed, but also incurs extra exploration costs. These graphs further make it clear that the extra exploration happens for old items as the bottom parts shows old items being recommended often.

Conclusion

In terms of the exploration/exploitation trade-off for news recommendation, it seems that the best strategies consist of being pessimistic and doing as little exploration as possible. The first reason for this is that the CTR of articles generally decreases. This means that exploring old articles which previously started to perform poorly is not needed. The second reason is that there are just too many new items to explore. You are better off searching for good performing items instead of chasing the optimal items, as they will quickly lose relevancy anyway.

In terms of handling the non-stationary behavior of news, we conclude that the focus should be on quickly detecting when the best performing articles stop performing well. Weighted average updates and local sliding windows are excellent for this. Importantly, sliding windows applied globally or discounting can be counterproductive in combination with guided exploration strategies as they lead to excessive continued exploration, which is undesirable.

When there is external information available indicating article popularity, we can make use of this by deciding which articles warrant exploration. We introduced optimistic/pessimistic-greedy, which uses this data effectively and outperformed all other algorithms in our simulated experiments.

Discussion and Future Work

Our findings highlight the importance of considering decreasing relevancy and article lifespan in the news recommendation domain. The superior performance of local sliding windows and weighted averages compared to global sliding windows or discounting, underscores the potential benefits of investigating domain-tailored adaptations into recommendation strategies. However, while these results are encouraging, they also open up several avenues for further investigation.

The evaluation is based on a simulated environment that, despite being designed to mimic real-world scenarios, cannot capture the full complexity of actual user interactions. Another limitation of our experiments is the idealized evaluation procedure. In real recommendation systems, feedback is not obtained immediately and updates are performed in batches. For these reasons, one of the primary discussions arising from this study concerns the applicability of these results in live settings. Future investigations and especially deployments of these algorithms on real-world platforms would provide valuable insights into their practical effectiveness and user satisfaction.

Another point for discussion is the computational efficiency of the adapted algorithms. While local sliding windows demonstrate comparable and in some cases superior performance to weighted averages, the optimal window size for most algorithms was quite large, which can pose challenging infrastructure problems to accommodate. Weighted averages on the other hand, are in principle easier to implement due to their iterative nature, but can still present difficulties in large distributed systems.

Future Work

Building on the foundation laid by this study, several directions for future work can be identified:

• Real-world Implementation: Deploying the adapted algorithms in a live news recommendation system would allow for the collection of empirical data, providing a clearer picture of their performance and user reception. • Personalization: Integrating user-specific context, such as historical interactions and preferences, could enhance the personalization of recommendations, potentially leading to even higher engagement rates. • Longitudinal Studies: Conducting long-term studies on the impact of adapted MAB algorithms on user engagement and satisfaction would provide deeper insights into their effectiveness over time.

In conclusion, this study contributes valuable insights into the application of multi-armed bandit algorithms in the context of online news recommendation. The promising results pave the way for future research and development in this area, with the potential to significantly enhance the user experience on news platforms.

Figure 1 :1Figure 1: Simulated CTR curves.

Figure 33Figure 3 shows impressions plots for 𝜖-greedy and TS and their adapted versions with global and local sliding windows on Simulation B.

Figure 2 :2Figure 2: Impression plot for random recommendations. Optimal distribution is also shown (unlabeled, top). Each vertical slice shows how the impressions for that time interval were distributed across the items. Each color represents an item, with colors repeating and items alwyas occuring in the same order from oldest to newest, top to bottom.

Figure 3 :3Figure 3: Impression plots for 𝜖-greedy (left) and TS (right). Standard version (center top), global sliding windows (center bottom) and local sliding windows (bottom) are shown, as well as the optimal distribution (unlabeled, top).

Figure 4 :4Figure 4: Impression plots for optimistic greedy (center top), pessimistic greedy (center bottom) and optimistic/pessimistic-greedy (bottom). Optimal distribution is also shown (unlabeled, top).

Figure 5 :5Figure 5: Impression plots for UCB (center) and TS (bottom) with discounting. Optimal distribution is also shown (unlabeled, top)

Table 11Relative CTR of various algorithms for Simulation A.Base algorithm

Table 22Relative CTR of various algorithms for Simulation B.Base algorithm

Table 33Relative CTR of popularity and optimistic/pessimistic-greedy with varying amounts of noise.Simulation AlowmediumhighPopularity 0.735 ± 0.014 0.510 ± 0.015 0.375 ± 0.009OP-Greedy 0.814 ± 0.008 0.731 ± 0.011 0.690 ± 0.012Simulation BlowmediumhighPopularity 0.727 ± 0.013 0.523 ± 0.014 0.375 ± 0.010OP-Greedy 0.924 ± 0.003 0.884 ± 0.005 0.867 ± 0.006

When and whom to collaborate with in a changing environment: A collaborative dynamic bandit solution CLi QWu HWang Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval ACM 2021 Learning contextual bandits in a non-stationary environment QWu NIyer HWang 10.1145/3209978.3210051 doi:10.1145/3209978.3210051 The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR '18

New York, NY, USA

Association for Computing Machinery 2018 Non-stationary linear bandits with dimensionality reduction for large-scale recommender systems SGhoorchian EKortukov SMaghsudi 10.1109/OJSP.2024.3386490 IEEE Open Journal of Signal Processing 5 2024 Non-stationary latent bandits JHong BKveton MZaheer YChow AAhmed MGhavamzadeh CBoutilier 2020 Stochastic multi-armed-bandit problem with non-stationary rewards OBesbes YGur AZeevi Proceedings of the 27th International Conference on Neural Information Processing Systems -Volume 1, NIPS'14 the 27th International Conference on Neural Information Processing Systems -Volume 1, NIPS'14 MIT Press 2014 The non-stationary stochastic multi-armed bandit problem RAllesiardo RFéraud O.-AMaillard International Journal of Data Science and Analytics 3 2017 AGarivier EMoulines arXiv:0805.3415 On upper-confidence bound policies for non-stationary bandit problems 2008 arXiv preprint Sliding-window thompson sampling for non-stationary settings FTrovo SPaladino MRestelli NGatti Journal of Artificial Intelligence Research 68 2020 Reinforcement learning: An introduction RSSutton AGBarto 2018 MIT press Online models for content optimization DAgarwal B.-CChen PElango NMotgi S.-TPark RRamakrishnan SRoy JZachariah Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS'08 the 21st International Conference on Neural Information Processing Systems, NIPS'08 Curran Associates Inc 2008 Finite-time analysis of the multiarmed bandit problem PAuer NCesa-Bianchi PFischer Mach. Learn 47 2002 Bandit processes and dynamic allocation indices JCGittins 10.1111/j.2517-6161.1979.tb01068.x Journal of the Royal Statistical Society Series B: Statistical Methodology 41 1979 Associative reinforcement learning using linear probabilistic concepts NAbe PMLong Proceedings of the Sixteenth International Conference on Machine Learning, ICML '99 the Sixteenth International Conference on Machine Learning, ICML '99 Morgan Kaufmann Publishers Inc 1999 Using upper confidence bounds for online learning PAuer 10.1109/SFCS.2000.892116 Proceedings 41st Annual Symposium on Foundations of Computer Science 41st Annual Symposium on Foundations of Computer Science 2000 Contextual bandits with linear payoff functions WChu LLi LReyzin RSchapire Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics GGordon DDunson MDudík the Fourteenth International Conference on Artificial Intelligence and Statistics

PMLR

2011 15 Proceedings of Machine Learning Research VRaj SKalyani arXiv:1707.09727 Taming non-stationary bandits: A Bayesian approach 2017 arXiv preprint A contextual-bandit approach to personalized news article recommendation LLi WChu JLangford RESchapire 10.1145/1772690.1772758 Proceedings of the 19th International Conference on World Wide Web, WWW '10 the 19th International Conference on World Wide Web, WWW '10 ACM 2010 On the likelihood that one unknown probability exceeds another in view of the evidence of two samples WRThompson 10.2307/2332286 Biometrika 25 285 1933 A modern Bayesian look at the multi-armed bandit SLScott 10.1002/asmb.874 Appl. Stoch. Model. Bus. Ind 26 2010 An empirical evaluation of Thompson sampling OChapelle LLi Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS'11 the 24th International Conference on Neural Information Processing Systems, NIPS'11 Curran Associates Inc 2011