1. Introduction

1613-0073

Meta-Learning and MAB Approach for Context-Specific Multi-Objective Recom mendation Optimization

Tiago Cunha

tsacunha@expediagroup.com 0

Andrea Marchini

amarchini@expediagroup.com 1

Workshop

Multi-Stakeholder, Multi-Armed bandits, Meta-Learning

0 Expedia Group , Portugal 1 Expedia Group , United Kingdom

Recommender systems in online marketplaces face the challenge of balancing multiple objectives to satisfy various stakeholders, including customers, providers, and the platform itself. This paper introduces Juggler-MAB, a hybrid approach that combines meta-learning with Multi-Armed Bandits (MAB) to address the limitations of existing multi-stakeholder recommendation systems. Our method extends the Juggler framework, which uses meta-learning to predict optimal weights for utility and compensation adjustments, by incorporating a MAB component for real-time, context-specific refinements. We present a two-stage approach where Juggler provides initial weight predictions, followed by MAB-based adjustments that adapt to rapid changes in user behavior and market conditions. Our system leverages contextual features such as device type and brand to make fine-grained weight adjustments based on specific segments. To evaluate our approach, we developed a simulation framework using a dataset of 0.6 million searches from Expedia's lodging booking platform. Results show that Juggler-MAB outperforms the original Juggler model across all metrics, with NDCG improvements of 2.9%, a 13.7% reduction in regret, and a 9.8% improvement in best arm selection rate.

1. Introduction

Recommender systems often focus solely on user satisfaction. However, in many real-world applications, particularly in online marketplaces, multiple stakeholders’ interests need to be considered [ 1, 2 ]. These stakeholders typically include users (customers), item providers (e.g., hotel owners), and the platform. Multi-stakeholder recommenders aim to balance these diverse and often conflicting objectives [ 3, 4 ].

The Juggler framework [ 5 ] was introduced to address this multi-stakeholder recommendation problem by using meta-learning [ 6, 7 ] to predict optimal weights for utility and compensation adjustments in real-time scoring. Deployed in production, Juggler has been an integral part of the Lodging Ranking stack at Expedia. However, Juggler’s reliance on a pre-configured set of five options for relevance and compensation limits its ability to fine-tune recommendations for specific contexts. Additionally, its infrequent training cycles make it less responsive to rapid changes in trafic patterns across segments.

To address these limitations, we propose a two-step approach that combines meta-learning (Juggler) with Multi-Armed Bandits for multi-stakeholder recommendations for real-time weight adjustments. This approach aims to: 1) provide more granular weight adjustments based on specific segments (e.g., device type, brand) and 2) adapt quickly to changes in trafic patterns without requiring frequent retraining of the main Juggler model. Our research questions are: • Can the integration of MAB with Juggler improve the performance and adaptability of multistakeholder recommendations in online marketplaces? • Are contextual features useful to improve the MAB’s efectiveness at making the right decisions? SURE workshop held in conjunction with the 18th ACM Conference on Recommender Systems (RecSys), 2024, in Bari, Italy.

CEUR

ceur-ws.org

The rest of this paper is organized as follows: Section 2 presents the related work, while Section 3 introduces the proposed hybrid solution. Section 4 covers the experimental setup used to validate the proposal and while Section 5 reports the results to the research questions. Lastly, Section 6 highlights the main conclusions and avenues for future work.

2. Background and Related Work

The Juggler framework [ 5 ] was introduced as a meta-learning approach to address the multi-stakeholder recommendation problem. It dynamically predicts the ideal weights for utility (user relevance) and compensation (platform revenue) for each search query. The meta-model leverages a collection of historical search queries and learns the mapping between the search context and the ideal utility and compensation weights, learned via ofline simulations. Juggler selects from five pre-configured options, each representing a diferent balance between relevance and compensation: 1) Lower relevance, lower compensation, 2) Lower relevance, higher compensation, 3) Neutral relevance, neutral compensation, 4) Higher relevance, lower compensation and 5) Higher relevance, higher compensation. The preconfigured options refer to sections of the search space which are explored to identify diferent directions of improvement, while reducing the number of options to ultimately choose from. It is noteworthy that although the pre-configured options are fixed, the actual instantiation of weights for each option depends on the ranking problem characteristics and Juggler framework hyper-parameters. While Juggler has shown success in production, its reliance on these fixed options and infrequent training cycles limits its adaptability to rapid changes in user behavior and market conditions.

Multi-Armed Bandits (MAB) are a class of reinforcement learning algorithms that balance exploration and exploitation in decision-making processes [ 8 ]. In the context of recommenders, MABs have been used to address the exploration-exploitation dilemma and to adapt to changing user preferences [ 9, 10 ].

The integration of meta-learning and bandit algorithms has been explored in other domains, such as algorithm selection [ 11 ] and hyperparameter optimization [ 12 ]. Our work extends these ideas to the realm of multi-stakeholder recommendations, addressing the unique challenges of online marketplaces.

Several studies have addressed the challenge of balancing multiple objectives in recommender systems. Rodriguez et al. [ 13 ] proposed a multi-objective optimization approach for job recommendations. Nguyen et al. [ 14 ] introduced a multi-objective learning to re-rank approach to optimize online marketplaces for multiple stakeholders. Sürer et al. [ 15 ] explored multi-stakeholder recommendation with provider constraints. These approaches provide valuable insights into balancing multiple objectives, but our proposed method aims to extend their capabilities by combining meta-learning with multi-armed bandits for enhanced adaptability in dynamic online marketplaces.

Recent developments in industry have led to the creation of self-service platforms for deploying contextual bandits, such as AdaptEx [ 16 ]. These platforms provide powerful tools for optimizing user experiences at scale, which we leverage in our hybrid approach to combine the strengths of meta-learning and MAB algorithms. To evaluate our approach, we utilized a custom simulation framework based on real-world data from an online travel marketplace. This allowed us to assess the performance of our system in a controlled yet realistic setting, similar to other sophisticated simulation environments [ 17 ].

3. Juggler with MAB

We present a hybrid approach that combines the Juggler framework’s meta-learning capabilities [ 5 ] with a MAB system powered by the AdaptEx SDK [ 16 ]. This approach, which we call ”Juggler-MAB” aims to address the limitations of the original Juggler system while leveraging the adaptive capabilities of contextual bandits. The Juggler-MAB system operates in two stages: 1. Juggler Stage: The meta-learning model predicts initial utility and compensation weights based on search context. 2. MAB Stage: A contextual MAB refines these weights in real-time based on user interactions and search features.

The Juggler framework selects from five pre-configured options for utility and compensation weights, providing a coarse adjustment of the recommendation strategy based on the search context. These options range from lower relevance and compensation to higher relevance and compensation, as described in [ 5 ] and aim to tackle the main issues in multi-objective optimization.

The MAB component introduces fine-grained adjustments to the Juggler-predicted weights. Each arm of the bandit represents a small corrective measure to be applied to the utility and compensation weights to improve relevance.

The key features of our MAB implementation include: 1. Contextual arms: The contextual bandits consider contextual features (e.g., device type, brand) when selecting arms. 2. Reward function: We use Normalized Discounted Cumulative Gain (NDCG) as a proxy for Conversion Rate, allowing for ofline simulation and evaluation.

balance exploration and exploitation efectively [ 8 ]. 3. Exploration strategy: We employ epsilon-greedy and Thompson Sampling for its ability to The integration of Juggler and MAB is achieved through an additive approach in the scoring function: (1) (2) =( + ( +

+ ) ⋅ ) ⋅ where and

corrective weights determined by the MAB.

are the weights predicted by Juggler, and and are the

We formulate our contextual MAB problem as follows: let be the set of arms, where each arm ∈ represents a pair of corrective weights ( , features such as device or brand. The reward is defined as the NDCG of the resulting ranking. The goal is to find a policy ∶ → that maximizes the expected cumulative reward: ). The context ∈

at time includes =1 max [∑ ( , ( ))] where is the time horizon.

We explored various methods to combine Juggler’s predictions with MAB corrections, ultimately settling on the additive approach described above. We carefully selected contextual variables that would help identify under-performing segments in the Juggler model, such as device type and brand. Balancing multiple objectives in a single reward function required careful consideration. We chose NDCG as an initial approach due to its widely accepted usage, with plans to explore more complex multi-objective reward functions in future work.

To evaluate our hybrid approach, we developed a custom simulator that allows us to test various configurations ofline using historical data. The simulator, built on Expedia data, enables to: 1. Replay historical searches and user interactions. Data is loaded on a daily basis, consisting of data for each property in each search and the respective user clicks and bookings. 2. Apply the Juggler-MAB model to generate new rankings. The MAB is sampled (ppotentially using contextual data) and the retrieved arm is included in the ranking formula, yielding the simulated score and the final ranking. 3. Evaluate the performance using both immediate (e.g., clicks) and delayed (e.g., bookings) feedback.

The reward function evaluates the simulated rankings and information about the arm sampled, reward and contextual information (if any) is provided to the MAB, to update its internal state.

The simulation framework provides a safe environment to test and refine our approach before considering online deployment.

4. Experimental Setup

We used a dataset of 0.6 million searches from Expedia’s lodging booking platform, covering a period of 31 consecutive days. The data has over 600000 distinct properties across approximately 41000 distinct destinations, with feedback sparsity over 96%. The dataset includes features such as device type, brand, destination, and historical user interactions.

We compared several variants of the proposed Juggler-MAB hybrid approach against the original Juggler model [ 5 ]. We tested several MAB algorithms, ranging from classical (i.e. no contextual features) to contextual bandits: • Gaussian Thompson (GT): a classical bandit using Thompson Sampling assuming a Gaussian

Distribution of reward value. • -greedy: a classical bandit using a vanilla implementation of the canonical algorithm. We have used = 0.1 and = 0.3 . • Recursive Least Squares with Thompson Sampling (RLS): a contextual bandit using a linear model with a vector of means and a matrix of variances-covariances.

The experiments use the actual production Juggler model predictions for each search. This improves the reliability of Juggler’s predictions, which in turn leads to more robust estimates of the MAB’s efect. We then implemented the MAB component using the AdaptEx SDK [ 16 ], with the following configuration: • Arm space: we explore 3 diferent values for each arm, respectively ∈ {−0.3, 0.0, 0.3} and ∈ {−0.2, 0.0, 0.2}. The selected weights are determined via domain knowledge, also ensuring non-zero weights. • Contextual features: several low cardinality categorical search features were tested, with 3 being identified as the most important: brand, user device and geographical categorization of the search destination, e.g. neighborhood vs city. • Exploration strategy: Thompson Sampling and -greedy • Reward: Normalized Discounted Cumulative Gain (NDCG), to determine how well can MAB algorithms correct towards relevance and expected conversion rate improvement.

5. Results and Discussion

Our Juggler-MAB hybrid approach outperformed the Juggler baseline across all metrics for all bandits proposed. The NDCG improvements range from +0.8% for GT bandit, all the way to +2.9% in several RLS bandits. In terms of regret, we achieve a reduction of 13.7% and an improvement in best arm selection rate of 9.8%.

The -greedy algorithms provide very strong baselines, especially when = 0.1 . GT bandit is clearly the worst bandit, but yet useful since it outperforms the baseline. Among the contextual bandits, the best one across all metrics is the . Interestingly, when using more contextual features, we did not achieve better performance. Further investigations are required to identify what matters to define the context.

We performed Wilcoxon signed-rank tests and observed no statistical diference between all RLS bandits. The Critical Diference [ 18 ] diagram for the remaining bandits is shown in Figure 1. The results show no statistical significant diference between RLS and = 0.1 , hinting that contextual features are not meaningful. However, all RLS bandits are better than the baselines. To note as well how all bandits are better than the Juggler baseline - this is a testament to the value of hybrid approach proposed.

Figure 2 shows the learning dynamics for all bandits across all days in the data sample. To improve interpretation, we include only the best contextual bandit. The Juggler-MAB demonstrated fast adaptation to changing conditions. We observed that the MAB component was able to make fine-grained adjustments to the Juggler predictions, resulting in improved performance.

We inspect now Juggler-MAB’s efect on lodging ranking top-10 average statistics in Table 2. The results are reported as diferences to the Juggler baseline, as we cannot expose the sensitive raw data.

Metric daily price guest rating star rating margin % margin $

The results show a clear pattern for all bandits: average daily price decreases and guest and star ratings increase as NDCG improves. On the contrary, margin % and margin $ decreases, which could pose problems to the marketplace objectives and long term health. The expectations, to be validated via AB test, is that the increase in relevance will lead to an improvement in conversion rate which can ofset the impact in profit per transaction.

Diving now deeper into the arms selection per bandit, we present Figure 3. The results show a clear and expected preference towards arms lower compensation weights, as they are not aligned with the NDCG reward. However, it is interesting to observe that the best bandit has learned that not only is it ideal to decrease compensation, but also to increase or decrease relevance depending on the context.

Despite the overall positive results, we identified two limitations. First, the reward function considers only a single dimension of the problem (i.e. relevance), which explains the impact to the compensation component. Future work will address this limitation by using multi-objective optimization techniques [ 13 ]. Second, our current simulations use historical interactions with a deterministic logging policy, introducing bias. To address it, we will implement of-policy evaluation techniques [ 19, 20 ]

6. Conclusion and Future Work

In this paper, we presented a novel hybrid approach combining Meta-Learning with Multi-Armed Bandits for multi-stakeholder recommendations in online travel marketplaces. Our Juggler-MAB system demonstrated significant improvements over existing methods. Key contributions of our work include 1) an integration of meta-learning and contextual bandits for recommendation systems and 2) empirical evidence of the efectiveness of our approach in a large-scale, real-world setting. Based on our findings and the limitations identified, we propose the following directions for future research: 1. Online testing: Conduct A/B tests in a production environment to validate the performance of

Juggler-MAB under real-world conditions and user behaviors 2. Dynamic arm space: Explore methods for dynamically adjusting the arm space of the MAB component based on observed performance and changing market conditions. 3. Fairness considerations: Incorporate explicit fairness constraints or objectives into the MAB formulation to ensure equitable treatment of diferent provider segments [ 21 ] 4. Long-term value optimization: Extend the approach to consider long-term user value, potentially using reinforcement learning techniques for sequential decision-making.

[1]

Abdollahpouri , G. Adomavicius,

Burke ,

Guy ,

Jannach ,

Kamishima ,

Krasnodebski , L. Pizzato, Multistakeholder recommendation: Survey and research directions, User Modeling and User-Adapted Interaction 30 ( 2020 ) 127 - 158 .

[2]

Abdollahpouri ,

Burke , Multi-stakeholder recommendation and its connection to multi-sided fairness , in: Workshop on Recommendation in Multi-stakeholder Environments (RMSE'19), in Conjunction with the 13th ACM Conference on Recommender Systems, RecSys'19 , 2019 .

[3]

Mehrotra ,

Carterette , Recommendations in a marketplace , in: Proceedings of the 13th ACM Conference on Recommender Systems , 2019 , pp. 580 - 581 .

[4]

Jannach , G. Adomavicius, Recommendations with a purpose , in: Proceedings of the 10th ACM Conference on Recommender Systems , ACM, 2016 , pp. 7 - 10 .

[5]

Cunha , I. Partalas ,

Nguyen , Juggler: Multi- stakeholder ranking with meta-learning , in: Proceedings of the MORS workshop at the 15th ACM Conference on Recommender Systems, CEUR Workshop Proceedings , 2021 .

[6]

Cunha ,

Soares , A. C. de Carvalho, Metalearning and recommender systems: A literature review and empirical study on the algorithm selection problem for collaborative filtering , Information Sciences 423 ( 2018 ) 128 - 144 .

[7]

Cunha ,

Soares , A. C. de Carvalho, Cf4cf: Recommending collaborative filtering algorithms using collaborative filtering , in: RecSys 2018 - Proceedings of the 12th ACM Conference on Recommender Systems , 2018 , p. 357 - 361 .

[8]

Lattimore ,

Szepesvári , Bandit algorithms, Cambridge University Press ( 2020 ).

[9]

Li ,

Chu ,

Langford ,

R. E.

Schapire , A contextual-bandit approach to personalized news article recommendation , in: Proceedings of the 19th international conference on World wide web , 2010 , pp. 661 - 670 .

[10]

Wang ,

Wu ,

Wang , Factorization bandits for interactive recommendation , in: Proceedings of the AAAI Conference on Artificial Intelligence , volume 31 , 2017 .

[11]

Hutter ,

H. H.

Hoos ,

Leyton-Brown , Sequential model-based optimization for general algorithm configuration , International conference on learning and intelligent optimization ( 2011 ) 507 - 523 .

[12]

Falkner ,

Klein ,

Hutter , Bohb: Robust and eficient hyperparameter optimization at scale , in: International Conference on Machine Learning, PMLR , 2018 , pp. 1437 - 1446 .

[13]

Rodriguez ,

Posse , E. Zhang, Multiple objective optimization in recommender systems , in: Proceedings of the sixth ACM conference on Recommender systems , 2012 , pp. 11 - 18 .

[14]

Nguyen ,

Dines ,

Krasnodebski , A multi-objective learning to re-rank approach to optimize online marketplaces for multiple stakeholders , arXiv preprint arXiv:1708.00651 ( 2017 ).

[15] Ö. Sürer , R.

Burke , E. C.

Malthouse , Multistakeholder recommendation with provider constraints , in: Proceedings of the 12th ACM Conference on Recommender Systems , ACM, 2018 , pp. 54 - 62 .

[16]

Black ,

Ilhan ,

Marchini ,

Markeviciute , Adaptex: A self-service contextual bandit platform , in: Proceedings of the 17th ACM Conference on Recommender Systems , 2023 , pp. 839 - 842 .

[17]

Ie ,

Jain ,

Wang ,

Navrekar ,

Agarwal ,

Wu , H.-T. Gao,

Chandra ,

Boutilier , Recsim: A configurable simulation platform for recommender systems , arXiv preprint arXiv: 1909 . 04847 ( 2019 ).

[18]

Demšar , Statistical comparisons of classifiers over multiple data sets , J. Mach. Learn. Res . 7 ( 2006 ) 1 - 30 .

[19]

Dudík ,

Langford ,

Li , Doubly robust policy evaluation and learning , ICML'11 , Omnipress , Madison, WI , USA, 2011 , p. 1097 - 1104 .

[20]

Swaminathan , T. Joachims, The self-normalized estimator for counterfactual learning , advances in neural information processing systems 28 ( 2015 ).

[21]

Burke ,

Sonboli ,

Ordoñez-Gauger , Balanced neighborhoods for multi-sided fairness in recommendation , in: Conference on Fairness, Accountability and Transparency, PMLR , 2018 , pp. 202 - 214 .