A Ranked Bandit Approach for Multi-stakeholder Recommender Systemsβˆ— TAHEREH ARABGHALIZI, University of Pittsburgh, USA ALEXANDROS LABRINIDIS, University of Pittsburgh, USA Recommender systems traditionally find the most relevant products or services for users tailored to their needs or interests but they ignore the interests of the other sides of the market (aka stakeholders). In this paper, we propose to use a Ranked Bandit approach for an online multi-stakeholder recommender system that sequentially selects top π‘˜ items according to the relevance and priority of all the involved stakeholders. We presented three different criteria to consider the priority of each stakeholder when evaluating our approach. Our extensive experimental results on a movie dataset showed that the contextual multi-armed bandits with a relevance function make a higher level of satisfaction for all involved stakeholders in the long term. Keywords: Multi-stakeholder Recommender Systems; Multi-armed Bandits; Ranked Bandit; CCS Concepts: β€’ Information systems β†’ Recommender systems. 1 INTRODUCTION With information technology permeating all aspects of our modern life, recommender systems have become an integral part of personalizing the user experience. Conventional recommender systems only consider the needs and interests of the end users and overlook the preferences of the other sides of the marketplace, also known as stakeholders, who might benefit from the recommendation selections. 1.1 Motivating Example In this paper, we address the problem of how to best recommend coupons, offered by local businesses such as coffee shops, to passengers awaiting the arrival of a bus at a nearby bus stop. The coupons are offered when the next bus is expected to be full in order to encourage the bus passengers to take advantage of the recommended incentive rather than taking a full bus [3]. The goal of this coupon recommender system is to take the preferences of all involved stakeholders into account and be able to prioritize different stakeholders at different times. It is worth noting that the preferences of local businesses towards the bus passengers are based on their marketing purposes (e.g., special coupons for college students). Although our motivational application targets people arriving at bus stops, our proposed solution can be used in a wide range of applications in which people receive recommendations and there are multiple stakeholders (beyond just the people) who have preferences that need to be considered. 1.2 Background and Related Work A traditional user-centric recommender system recommends items based on the interests and preferences of the user. However, the user is not the sole stakeholder in many real-world applications. In those cases, the recommendations could benefit other individuals or organizations [1, 2]. Moreover, online recommendation tasks, which ingest data one observation at a time, are not well served by traditional offline recommendation techniques such as Collaborative Filtering [17] which rely on historical data. It has recently come to the attention of online recommendation tasks to study multi-armed bandits (MAB), a classic reinforcement learning problem. As a result of this βˆ— Copyright 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Presented at the MORS workshop held in conjunction with the 16th ACM Conference on Recommender Systems (RecSys), 2022, in Seattle, USA. 1 MORS Workshop - RecSys, 2022, Seattle, WA Tahereh Arabghalizi, Alexandros Labrinidis approach, a solution can be provided for the dilemma between exploration and exploitation that maximizes the expected cumulative payoff over the long term [5]. Depending on whether side information (aka context) is taken into account, bandit algorithms fall into two categories: context-free and contextual. In context-free bandits, the observed reward depends on the selected arm while it is both the selected arm and its context that determine the observed reward in contextual bandits [13]. While contextual bandits have been used extensively for online user-centric recommender systems [11, 20] and much research has been conducted on multi-objective multi-armed bandit algorithms [7, 21, 23], multi-stakeholder resommender systems have received less attention. Most recently, Mehrotra et al. [14] proposed a multiple-objective contextual bandit approach to maximize the long-term payoffs for different objectives such as diversity for a multi-stakeholder recommender system on a music streaming platform. Although, there are several existing works that consider multiple stakeholders in their recommendation generation [15, 19], to the best of our knowledge, there is no work that considers the contextual information of users, preferences and priority of all stakeholders in an online recommender system. 1.3 Contributions In this work, we aim to provide long-term, acceptable levels of satisfaction for all the stakeholders involved in a recommender system, according to the given context (if available) and priorities. We propose a bandit-based approach that sequentially selects top candidate items and then accepts the best candidates that maximize the total payoff, given relevance and priority of each stakeholder. Our contributions are as follows: β€’ we propose to use a Ranked Bandit approach to recommend multiple items to a user and provide acceptable levels of satisfaction for all stakeholders in a recommender system over time. β€’ we introduce three different criteria including Deterministic, Probabilistic and Multi-sided relevance Function to consider the priority of each stakeholder when evaluating our approach. β€’ we define and use different metrics to evaluate the satisfaction of stakeholders and compare different benchmarks. β€’ we use real-world and synthetic datasets and multiple sensitivity analyses to experimentally evaluate the performance of our approach. 2 PROBLEM STATEMENT In this paper, we address the problem of recommending items to users in an online multi-stakeholder platform, considering the contextual information of users and items (if available), relevance and priority of involved stakeholder in order to provide them with an acceptable level of satisfaction in the long term. In the application of coupon recommendation, the recommender system needs to recommend top π‘˜ coupons (from the finite set of available coupons at each bus stop) to a bus passenger whose up-coming bus is going to be full. This system should take the preferences of all stakeholders such as bus passengers, coupon suppliers and minority-owned businesses (i.e., another stakeholder group that we want to prioritize) into account and offer the relevant coupons based on the given priority (weight) of stakeholders. This recommender system aims to maximize the total payoffs and makes a good balance between the satisfaction of all stakeholders over time. 3 PROPOSED SOLUTION To address this problem, we propose to utilize a Ranked Bandit approach that sequentially selects top π‘˜ items and accepts the best candidates that maximize the payoff, given relevance and priority of each stakeholder. 2 A Ranked Bandit Approach for Multi-stakeholder Recommender Systems MORS Workshop - RecSys, 2022, Seattle, WA In Ranked Bandit algorithm, which was first introduced by [16, 18], one multi-armed bandit algorithm is instantiated for each slot of a ranked list with π‘˜ slots, to learn the greedy-optimal solution. If a user clicks on slot 𝑖, that slot gains a reward of 1, and all other slots 𝑗 where 𝑗 < 𝑖 receive a reward of 0. In a multi-stakeholder platform, however, the algorithm should consider the relevance and priority (if applicable) of all involved stakeholders to obtain a reward of 1 for each slot. To this end, we introduce three different criteria including deterministic, probabilistic, and multi-sided relevance function (see Section 4.1), apply them after displaying the selected items to a user and receive a reward accordingly. Our proposed approach is described in Algorithm 1. As you can see, at each round, the Ranked Bandit algorithm plays a multi-armed bandit instance 𝑀𝑖 for each ranking slot 𝑖. If the arm 𝑠𝑖 was already selected at a higher ranking slot, an arbitrary unselected arm is chosen instead. This process is being sequentially repeated for top π‘˜ ranking slot (Lines 3-9). Then the algorithm receive a feedback for each multi-armed bandit instance 𝑀𝑖 through running a criterion. If the arm in ranking slot 𝑖 is relevant to the involved stakeholders based on the used criterion and the given priority (weight) of stakeholders, then bandit 𝑀𝑖 receives a reward of 1 and all higher bandits 𝑀 𝑗 (𝑗 < 𝑖) receive a reward of 0 (Lines 12-17). Each bandit instance updates its reward afterwards. Algorithm 1 Ranked Bandit for Multi-stakeholder Recommender Systems 1: Inputs: M: base bandits, A: arms, T: rounds, k: number of slots, 𝐢𝑒 : user context (if any), πΉπ‘Ž : arm features (if any), C(𝑀 1, 𝑀 2, ..., 𝑀𝑛 ): criterion (Deterministic, Probabilistic or Multi-sided relevance function) where 𝑀𝑛 is the weight of the 𝑛 th stakeholder. 2: for 𝑑 = 1, 2, . . . ,𝑇 do 3: for 𝑖 = 1, 2, . . . , π‘˜ do 4: 𝑠𝑖 (𝑑) ← 𝑀𝑖 .selectArm(A, 𝑒𝑑 , 𝐢𝑒𝑑 , πΉπ‘Žπ‘‘ ) 5: if 𝑠𝑖 (𝑑) ∈ {𝑠 1 (𝑑), 𝑠 2 (𝑑), ..., π‘ π‘–βˆ’1 (𝑑)} then 6: 𝑠𝑖 (𝑑) ← π‘Žπ‘Ÿπ‘π‘–π‘‘π‘Ÿπ‘Žπ‘Ÿπ‘¦ 𝑒𝑛𝑠𝑒𝑙𝑒𝑐𝑑𝑒𝑑 π‘Žπ‘Ÿπ‘š 7: end if Ð 8: 𝑆 (𝑑) ← 𝑖 𝑠𝑖 (𝑑) 9: end for 10: Display 𝑆 (𝑑) to user and receive feedback for 𝑀𝑖 : 11: for 𝑖 = 1, 2, . . . , π‘˜ do 12: Apply criterion C(𝑀 1, 𝑀 2, ..., 𝑀𝑛 ) 13: if 𝑠𝑖 (𝑑) is the first relevant arm to stakeholders (based on the used criterion) then 14: π‘Ÿπ‘’π‘€π‘Žπ‘Ÿπ‘‘π‘ π‘– (𝑑 ) = 1 15: else 16: π‘Ÿπ‘’π‘€π‘Žπ‘Ÿπ‘‘π‘ π‘– (𝑑 ) = 0 17: end if 18: 𝑀𝑖 .UpdateReward(𝑒𝑑 , 𝐢𝑒𝑑 , πΉπ‘Žπ‘‘ , 𝑠𝑖 (𝑑), π‘Ÿπ‘’π‘€π‘Žπ‘Ÿπ‘‘π‘ π‘– (𝑑 ) ) 19: end for 20: end for 4 EVALUATION The training process in online algorithms, such as bandits, occurs incrementally over time. How- ever, offline evaluation methods can be used to evaluate the performance of such algorithms using historical datasets. One of the most popular offline evaluation methods is called Replay. In this methodology, the bandit algorithm selects an arm for each record in the historical data. If this arm has been seen by the user before, the replay methodology accepts the arm, otherwise the arm is denied by the replay method [12]. This method works well in offline evaluation of bandits 3 MORS Workshop - RecSys, 2022, Seattle, WA Tahereh Arabghalizi, Alexandros Labrinidis employed in the user-centric recommender systems which focus on user satisfaction obtained based on user clicks. In a multi-stakeholder recommender system such as coupon recommendation, however, the replay evaluation should be able to consider the relevance of a selected arm to multiple stakeholders given the priority of those stakeholders. To this end, we propose three criteria called Deterministic, Probabilistic, and Multi-sided Relevance Function to address the relevance of a selected arm to multiple weighted stakeholders in the replay evaluation. 4.1 Selection/Evaluation Criteria Deterministic: based on this criterion, the replay method accepts an arm if it is relevant to the most prioritized stakeholder which is the one with the highest weight. Probabilistic: this criterion determines the winner arm by first partitioning the [0,1] interval to different ranges according to the weights of the different stakeholders and then selecting a random number ∈ [0,1] which would map to one of the ranges and therefore one of different stakeholders. The replay method would then accept the arm which is relevant to that stakeholder. Multi-sided Relevance Function: the multi-sided relevance function can be defined as the weighted sum of the relevance score of each stakeholder to each selected arm a. The relevance scores can be calculated based on the feedback of users and other stakeholders about the quality of the recommended items. The relevance function is defined as below: π‘Ÿ π‘Ž (𝑀, 𝑠) = 𝑀 1𝑠 1π‘Ž + ... + 𝑀𝑛 π‘ π‘›π‘Ž (1) where π‘Ÿ π‘Ž ∈ [0, 1] is the estimated multi-sided relevance score, n is the number of stakeholders, 𝑠 1π‘Ž , 𝑠 2π‘Ž , ..., π‘ π‘›π‘Ž are the relevance scores of arm a for each stakeholder where each score is either 1 (relevant) or 0 (non-relevant), 𝑀𝑖 is the given priority or weight of each stakeholder where Í 𝑀𝑖 ∈ [0, 1], 𝑛𝑖=1 𝑀𝑖 = 1. We also define a relevance threshold, 𝛿, to let the replay evaluation accepts an arm if the multi-sided relevance score is greater than this threshold [4]. 4.2 Evaluation Metrics We use three different types of metrics to evaluate the performance of our proposed approach: β€’ Average Rewards: we accumulate the reward values of the accepted arms in each round and return the average of the rewards for total rounds. The ultimate goal of a bandit algorithm is to maximize the rewards in the long term. β€’ Average NDCG : Normalized Discounted Cumulative Gain (NDCG) measures the usefulness of an item based on its position in the recommendation list [9]. We compute the Average NDCG at π‘˜ for each stakeholder over time as the representation of their satisfaction level. β€’ Balance-Relevance Rate: this metric computes the product of the balance and relevance rates which are defined as follows: – Balance Rate: to measure how similar the weights of the stakeholders and their satisfaction levels (i.e. Average NDCG) are, we compute the cosine similarity between the vector of weights and their corresponding Average NDCG. – Relevance Rate: this is a measure that calculates the number of times (out of total rounds) that the selected arms are relevant to stakeholders (using one of the evaluation criteria) and are accepted by the replay evaluation. 5 EXPERIMENTAL EVALUATION In this section, we conduct several experiments on a movie recommendation data to evaluate the performance of our proposed approach. 4 A Ranked Bandit Approach for Multi-stakeholder Recommender Systems MORS Workshop - RecSys, 2022, Seattle, WA 5.1 Dataset Since there are currently no public datasets available for a multi-stakeholder platform in the coupon recommendation domain, we merged two real-world datasets including MovieLens (1m) [8] with IMDB (81k+) [10] and also generated a synthetic dataset. In our experiments, we use a scenario where there are three stakeholders involved in the coupon recommendation system. According to the data, movies are used as coupons, users as bus passengers (first stakeholder), movie production companies as local businesses/coupon suppliers (second stakeholder) and movies with a specific genre as a minority-owned business (third stakeholder). We created the following datasets for training and offline evaluation of the bandit algorithms: Movies: we use top m movies with the highest number of ratings as arms. Movies’ Features: the genres of movies are used as the context information of the arms. Users’ Features: users’ demographic features such as age range and gender along with the the normalized average of the genres of the movies that each user rated, are used as the context information of the users. Users Data: the ratings of the users to the movies (integers between 1 and 5) are used as the feedback of users in bandit evaluation. If the rating of a selected movie is greater than 3.0, the movie is likely relevant to the preferences of that user and they will click on it. It should be noted that we used Singular Value Decomposition (SVD) technique to fill in the missing data in the user ratings. Suppliers Data: as there aren’t any ratings data for production companies towards the users in the MovieLens or IMDB datasets, we generated a synthetic dataset. First we clustered all users using KMeans algorithm and then for each production company, we used a Truncated Normal Distribution with mean = 3 and standard deviation = 1, to generate random integer numbers between 1 and 5 as the ratings of that production company towards the users in the same cluster. If a production company has given a rating greater than 3.0 to the current user, that user is likely relevant to that company’s preferences. Minority Data: we consider β€˜Sci-Fi’ movies as minority-owned businesses (since only about 7% of all movies in the MovieLens data are β€˜Sci-Fi’). If a selected movie is Sci-Fi, it is assumed to be relevant to minority-owned businesses. Observation Data: we use a stream of users who rated the top movies (arms) in the past to evaluate the bandits. 5.2 Base Bandit Algorithms In this section, we describe the base multi-armed bandit algorithms that are used in our approach: β€’ Hybrid LinUCB: as mentioned earlier, contextual bandits is a variant of the multi-armed bandit problems that utilizes contextual information. In many applications, the arms are not distinct from each other and have shared features. For example, in a coupon recommendation system where the coupons are arms that are recommended to users, there may be some similarities between coupons, and thus they share similar features. Li et al. [11] introduced LinUCB with Hybrid Linear Model where the payoff of each arm is a linear function of shared and non-shared components. We use the contextual information of users along with the arms features for this algorithm. LinUCB has a hyper-parameter, 𝛼, which should be tuned in advance. β€’ Thompson Sampling: it is a Bayesian multi-armed bandit algorithm where a posterior dis- tribution is used to summarize reward values inferred from past data. Under this distribution, the algorithm selects the arm proportionately to its likelihood of being optimal [22]. 5 MORS Workshop - RecSys, 2022, Seattle, WA Tahereh Arabghalizi, Alexandros Labrinidis β€’ Epsilon-greedy: in each round, this algorithm selects a random arm with probability πœ–, and chooses the arm with the highest empirical mean with probability 1 βˆ’ πœ– [6]. πœ– is a hyper-parameter and needs to be tuned properly. β€’ UCB1: an upper confidence bound has to be calculated for each arm for the algorithm to be able to choose an arm in each round [5]. β€’ Random: it randomly selects an arm to pull at each round. 5.3 Experimental Results In this section we present and discuss the results of our experiments. Results with Different Criteria: we conducted multiple experiments for all base bandit algorithm with default settings (see Table 1) to compare the impact of our proposed evaluation criteria in terms of Balance-Relevance Rate. As you can see in Table 2, where we show the results for the Hybrid LinUCB algorithm, the multi-sided relevance function outperforms the other two criteria (i.e. Deterministic and Probabilistic) in almost all experiments with different weight combinations. This shows that given the priority of stakeholders, the multi-sided relevance function can provide a better trade-off between the satisfaction of all stakeholders in the long term. For this reason, we carried out the next experiments using the the multi-sided relevance function as the evaluation criterion. Table 1. Default experimental settings Number of Rounds (𝑇 ) 5000 Number of Arms (|𝐴|) 100 Number of Slots (π‘˜) 3 Stakeholders’ Weights (𝑀𝑖 ) All triplet combinations with 0.6, 0.3, 0.1 Relevance Threshold (𝛿) 0.5 LinUCB Hyper Parameter (𝛼) 0.5 EG Hyper Parameter (πœ–) 0.15 Table 2. Comparing evaluation criteria for Hybrid LinUCB in terms of balance-relevance rate Stakeholders’ Prioritized Multi-sided Deterministic Probabilistic Weights Stakeholder Relevance Function (0.33, 0.33, 0.33) - 0.442 0.755 0.900 (0.6, 0.3, 0.1) user 0.840 0.725 0.851 (0.6, 0.1, 0.3) user 0.818 0.695 0.841 (0.3, 0.6, 0.1) supplier 0.965 0.789 0.967 (0.1, 0.6, 0.3) supplier 0.917 0.834 0.921 (0.3, 0.1, 0.6) minority 0.967 0.823 0.967 (0.1, 0.3, 0.6) minority 0.954 0.862 0.948 Results with Multi-sided Relevance Function: in this section we describe and illustrate the resultsof several experiments with the default settings (see Table 1) and compare the base bandit algorithms in terms of average rewards and average NDCG while prioritizing different stakeholders. As it is shown in Figure 1, Hybrid LinUCB has the highest average rewards compared to the other context-free bandit algorithms when we prioritize users (Figure 1a) and suppliers (Figure 1b). This indicates that utilizing the contextual features of users and items can make higher average rewards for all stakeholders when applying the Multi-sided Relevance Function as the evaluation criterion. 6 A Ranked Bandit Approach for Multi-stakeholder Recommender Systems MORS Workshop - RecSys, 2022, Seattle, WA It should be mentioned that our results show that the Hybrid LinUCB algorithm outperforms other base bandits when we also prioritize minority (as the third stakeholder). Furthermore, Figure 2 illustrates the Average NDCG for different stakeholders when running different bandit algorithms with multi-sided relevance function. As one can see, Hybrid LinUCB provides higher Average NDCG for the prioritized stakeholder and makes a good balance rate (a) weights: (user: 0.6, supplier: 0.3, minority: 0.1) (b) weights: (user: 0.3, supplier: 0.6, minority: 0.1) Fig. 1. Average rewards for 100 arms and 3 ranking slots when prioritizing users (a) and suppliers (b) 1.00 1.00 1.00 0.75 0.75 0.75 0.50 0.50 0.50 0.25 0.25 0.25 0.00 0.00 0.00 HybridLinUCB TS EG UCB1 RAND HybridLinUCB TS EG UCB1 RAND HybridLinUCB TS EG UCB1 RAND user supplier minority user supplier minority user supplier minority (a) weights: (0.6, 0.3, 0.1) (b) weights: (0.3, 0.6, 0.1) (c) weights: (0.1, 0.3, 0.6) Fig. 2. Average NDCG when prioritizing users (a), suppliers (b) and minority (c) between the Average NDCG and the weights of stakeholders, in all scenarios when we prioritize users (a), suppliers (b) or minority (c). Sensitivity Analysis: we performed multiple sensitivity analyses to see how Balance-Relevance Rate changes with different set of stakeholders’ weights, different number of arms and different number of ranking slots: β€’ Stakeholders’ Weights: we set the weights of stakeholders to different values (e.g., 0.5, 0.25, 0.25), where sum is equal to 1. The results of these experiments showed a similar pattern to the the results of experiments with the default weights. β€’ Number of Arms: we set the number of arms to 100 (default), 200 and 300 for different bandits while prioritizing different stakeholders. As the figures in Table 3 show, Hybrid LinUCB with with multi-sided relevance function outperforms the other bandit algorithms in terms of balance-relevance rate even when we change the number of arms. β€’ Number of Slots: we change the number of ranking slots (offered items) to 1, 3 (default) and 5 while prioritizing one stakeholder at a time for 100 arms. As the results of these experiments in Table 4 show, Hybrid LinUCB with multi-sided relevance function works better than other algorithms in terms of balance-relevance rate for various number of slots. There are only 7 MORS Workshop - RecSys, 2022, Seattle, WA Tahereh Arabghalizi, Alexandros Labrinidis two scenario where Thompson Sampling works a little better than Hybrid LinUCB but the difference between their balance-relevance rates is negligible (<0.006). It is worth noting that balance-relevance rates grow when the number of offered items (slots) increases which is reasonable because when the recommender system offers higher number of items to users, there is a higher probability that the offered items are relevant to stakeholders. Table 3. Comparing bandits in terms of balance-relevance rate when changing number of arms Stakeholders’ Prioritized Number of Hybrid TS EG UCB1 RAND Weights Stakeholder Arms LinUCB 100 0.851 0.657 0.715 0.723 0.429 (0.6, 0.3, 0.1) User 200 0.835 0.527 0.698 0.575 0.348 300 0.834 0.525 0.704 0.466 0.293 100 0.967 0.724 0.801 0.758 0.424 (0.3, 0.6, 0.1) Supplier 200 0.930 0.716 0.821 0.653 0.444 300 0.934 0.678 0.870 0.588 0.434 100 0.948 0.942 0.888 0.852 0.262 (0.1, 0.3, 0.6) Minority 200 0.966 0.955 0.891 0.780 0.250 300 0.953 0.942 0.887 0.694 0.201 Table 4. Comparing bandits in terms of balance-relevance rate when changing number of slots Stakeholders’ Prioritized Number of Hybrid TS EG UCB1 RAND Weights Stakeholder Slots LinUCB 1 0.808 0.557 0.593 0.64 0.316 (0.6, 0.3, 0.1) User 3 0.851 0.657 0.715 0.723 0.429 5 0.864 0.65 0.757 0.732 0.473 1 0.953 0.631 0.701 0.664 0.270 (0.3, 0.6, 0.1) Supplier 3 0.967 0.724 0.801 0.758 0.424 5 0.967 0.741 0.836 0.773 0.466 1 0.935 0.938 0.84 0.792 0.154 (0.1, 0.3, 0.6) Minority 3 0.948 0.942 0.888 0.852 0.262 5 0.949 0.955 0.955 0.857 0.307 6 CONCLUSIONS In this paper, we addressed the problem of recommending coupons in a multi-stakeholder platform where stakeholders can be prioritized. We proposed to use a Ranked Bandit approach that sequen- tially selects top π‘˜ items and accepts the best candidates that maximize the payoff, given relevance and priority of each stakeholder. We introduced three different criteria including Deterministic, Probabilistic and Multi-sided Relevance Function to consider the priority of each stakeholder when evaluating our approach. We also defined different metrics to evaluate the satisfaction of stake- holders and compare different benchmarks. Our experimental results on real-world and synthetic datasets showed that contextual multi-armed bandits (i.e. Hybrid LinUCB) with a multi-sided relevance function outperforms context-free bandits with any evaluation criterion. ACKNOWLEDGMENTS This work is part of the PittSmartLiving project which is supported by NSF award CNS-1739413. 8 A Ranked Bandit Approach for Multi-stakeholder Recommender Systems MORS Workshop - RecSys, 2022, Seattle, WA REFERENCES [1] Himan Abdollahpouri, Gediminas Adomavicius, Robin Burke, Ido Guy, Dietmar Jannach, Toshihiro Kamishima, Jan Krasnodebski, and Luiz Pizzato. 2020. Multistakeholder recommendation: Survey and research directions. User Modeling and User-Adapted Interaction 30, 1 (2020), 127–158. [2] Himan Abdollahpouri and Robin Burke. 2022. Multistakeholder recommender systems. In Recommender systems handbook. Springer, 647–677. [3] Tahereh Arabghalizi and Alexandros Labrinidis. 2020. Data-driven bus crowding prediction models using context- specific features. ACM Transactions on Data Science 1, 3 (2020), 1–33. [4] Tahereh Arabghalizi and Alexandros Labrinidis. 2022. Context-aware Multi-stakeholder Recommender Systems. In The International FLAIRS Conference Proceedings, Vol. 35. [5] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47, 2 (2002), 235–256. [6] Nicolo Cesa-Bianchi and Paul Fischer. 1998. Finite-Time Regret Bounds for the Multiarmed Bandit Problem.. In ICML, Vol. 98. 100–108. [7] Madalina M Drugan and Ann Nowe. 2013. Designing multi-objective multi-armed bandits algorithms: A study. In The 2013 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8. [8] F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2015), 1–19. [9] Kalervo JΓ€rvelin and Jaana KekΓ€lΓ€inen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446. [10] Stefano Leone. 2019. IMDb movies extensive dataset. https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset [11] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web. 661–670. [12] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proc of the 4th ACM intl conference on Web search and data mining. 297–306. [13] Tyler Lu, DΓ‘vid PΓ‘l, and Martin PΓ‘l. 2010. Contextual multi-armed bandits. In Proceedings of the Thirteenth international conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 485–492. [14] Rishabh Mehrotra, Niannan Xue, and Mounia Lalmas. 2020. Bandit based Optimization of Multiple Objectives on a Music Streaming Platform. In Proc of the 26th ACM SIGKDD Intl Conference on Knowledge Discovery & Data Mining. 3224–3233. [15] Phong Nguyen, John Dines, and Jan Krasnodebski. 2017. A multi-objective learning to re-rank approach to optimize online marketplaces for multiple stakeholders. arXiv preprint arXiv:1708.00651 (2017). [16] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. 2008. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th international conference on Machine learning. 784–791. [17] Francesco Ricci, Lior Rokach, and Bracha Shapira. 2011. Introduction to recommender systems handbook. In Recommender systems handbook. Springer, 1–35. [18] Matthew Streeter, Daniel Golovin, and Andreas Krause. 2009. Online learning of assignments. Advances in neural information processing systems 22 (2009). [19] Γ–zge SΓΌrer, Robin Burke, and Edward C Malthouse. 2018. Multistakeholder recommendation with provider constraints. In Proceedings of the 12th ACM Conference on Recommender Systems. 54–62. [20] Liang Tang, Yexi Jiang, Lei Li, and Tao Li. 2014. Ensemble contextual bandits for personalized recommendation. In Proc. of the 8th ACM Conf. on Recommender Systems. 73–80. [21] Cem Tekin and Eralp Turğay. 2018. Multi-objective contextual multi-armed bandit with a dominant objective. IEEE Transactions on Signal Processing 66, 14 (2018), 3799–3813. [22] William R Thompson. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 3-4 (1933), 285–294. [23] Saba Q Yahyaa, Madalina M Drugan, and Bernard Manderick. 2014. The scalarized multi-objective multi-armed bandit problem: An empirical study of its exploration vs. exploitation tradeoff. In 2014 International Joint Conference on Neural Networks (IJCNN). IEEE, 2290–2297. 9