A Hybrid Meta-Learning and MAB Approach for Context-Specific Multi-Objective Recommendation Optimization

A Hybrid Meta-Learning and MAB Approach for Context-Specific Multi-Objective Recommendation Optimization TiagoCunha tsacunha@expediagroup.com Group

Expedia Portugal

AndreaMarchini amarchini@expediagroup.com Expedia Group

United Kingdom

18th ACM Conference on Recommender Systems (RecSys)

2024 Bari Italy

A Hybrid Meta-Learning and MAB Approach for Context-Specific Multi-Objective Recommendation Optimization 1613-0073 B207510EB70EC56E9E741A29BFA14BA4 GROBID - A machine learning software for extracting information from scholarly documents Multi-Stakeholder Multi-Armed bandits Meta-Learning

Recommender systems in online marketplaces face the challenge of balancing multiple objectives to satisfy various stakeholders, including customers, providers, and the platform itself. This paper introduces Juggler-MAB, a hybrid approach that combines meta-learning with Multi-Armed Bandits (MAB) to address the limitations of existing multi-stakeholder recommendation systems. Our method extends the Juggler framework, which uses meta-learning to predict optimal weights for utility and compensation adjustments, by incorporating a MAB component for real-time, context-specific refinements. We present a two-stage approach where Juggler provides initial weight predictions, followed by MAB-based adjustments that adapt to rapid changes in user behavior and market conditions. Our system leverages contextual features such as device type and brand to make fine-grained weight adjustments based on specific segments. To evaluate our approach, we developed a simulation framework using a dataset of 0.6 million searches from Expedia's lodging booking platform. Results show that Juggler-MAB outperforms the original Juggler model across all metrics, with NDCG improvements of 2.9%, a 13.7% reduction in regret, and a 9.8% improvement in best arm selection rate.

Introduction

Recommender systems often focus solely on user satisfaction. However, in many real-world applications, particularly in online marketplaces, multiple stakeholders' interests need to be considered [1,2]. These stakeholders typically include users (customers), item providers (e.g., hotel owners), and the platform. Multi-stakeholder recommenders aim to balance these diverse and often conflicting objectives [3,4].

The Juggler framework [5] was introduced to address this multi-stakeholder recommendation problem by using meta-learning [6,7] to predict optimal weights for utility and compensation adjustments in real-time scoring. Deployed in production, Juggler has been an integral part of the Lodging Ranking stack at Expedia. However, Juggler's reliance on a pre-configured set of five options for relevance and compensation limits its ability to fine-tune recommendations for specific contexts. Additionally, its infrequent training cycles make it less responsive to rapid changes in traffic patterns across segments.

To address these limitations, we propose a two-step approach that combines meta-learning (Juggler) with Multi-Armed Bandits for multi-stakeholder recommendations for real-time weight adjustments. This approach aims to: 1) provide more granular weight adjustments based on specific segments (e.g., device type, brand) and 2) adapt quickly to changes in traffic patterns without requiring frequent retraining of the main Juggler model. Our research questions are:

• Can the integration of MAB with Juggler improve the performance and adaptability of multistakeholder recommendations in online marketplaces? • Are contextual features useful to improve the MAB's effectiveness at making the right decisions?

The rest of this paper is organized as follows: Section 2 presents the related work, while Section 3 introduces the proposed hybrid solution. Section 4 covers the experimental setup used to validate the proposal and while Section 5 reports the results to the research questions. Lastly, Section 6 highlights the main conclusions and avenues for future work.

Background and Related Work

The Juggler framework [5] was introduced as a meta-learning approach to address the multi-stakeholder recommendation problem. It dynamically predicts the ideal weights for utility (user relevance) and compensation (platform revenue) for each search query. The meta-model leverages a collection of historical search queries and learns the mapping between the search context and the ideal utility and compensation weights, learned via offline simulations. Juggler selects from five pre-configured options, each representing a different balance between relevance and compensation: 1) Lower relevance, lower compensation, 2) Lower relevance, higher compensation, 3) Neutral relevance, neutral compensation, 4) Higher relevance, lower compensation and 5) Higher relevance, higher compensation. The preconfigured options refer to sections of the search space which are explored to identify different directions of improvement, while reducing the number of options to ultimately choose from. It is noteworthy that although the pre-configured options are fixed, the actual instantiation of weights for each option depends on the ranking problem characteristics and Juggler framework hyper-parameters. While Juggler has shown success in production, its reliance on these fixed options and infrequent training cycles limits its adaptability to rapid changes in user behavior and market conditions.

Multi-Armed Bandits (MAB) are a class of reinforcement learning algorithms that balance exploration and exploitation in decision-making processes [8]. In the context of recommenders, MABs have been used to address the exploration-exploitation dilemma and to adapt to changing user preferences [9,10].

The integration of meta-learning and bandit algorithms has been explored in other domains, such as algorithm selection [11] and hyperparameter optimization [12]. Our work extends these ideas to the realm of multi-stakeholder recommendations, addressing the unique challenges of online marketplaces.

Several studies have addressed the challenge of balancing multiple objectives in recommender systems. Rodriguez et al. [13] proposed a multi-objective optimization approach for job recommendations. Nguyen et al. [14] introduced a multi-objective learning to re-rank approach to optimize online marketplaces for multiple stakeholders. Sürer et al. [15] explored multi-stakeholder recommendation with provider constraints. These approaches provide valuable insights into balancing multiple objectives, but our proposed method aims to extend their capabilities by combining meta-learning with multi-armed bandits for enhanced adaptability in dynamic online marketplaces.

Recent developments in industry have led to the creation of self-service platforms for deploying contextual bandits, such as AdaptEx [16]. These platforms provide powerful tools for optimizing user experiences at scale, which we leverage in our hybrid approach to combine the strengths of meta-learning and MAB algorithms. To evaluate our approach, we utilized a custom simulation framework based on real-world data from an online travel marketplace. This allowed us to assess the performance of our system in a controlled yet realistic setting, similar to other sophisticated simulation environments [17].

Juggler with MAB

We present a hybrid approach that combines the Juggler framework's meta-learning capabilities [5] with a MAB system powered by the AdaptEx SDK [16]. This approach, which we call "Juggler-MAB" aims to address the limitations of the original Juggler system while leveraging the adaptive capabilities of contextual bandits. The Juggler-MAB system operates in two stages:

1. Juggler Stage: The meta-learning model predicts initial utility and compensation weights based on search context.

MAB Stage:

A contextual MAB refines these weights in real-time based on user interactions and search features.

The Juggler framework selects from five pre-configured options for utility and compensation weights, providing a coarse adjustment of the recommendation strategy based on the search context. These options range from lower relevance and compensation to higher relevance and compensation, as described in [5] and aim to tackle the main issues in multi-objective optimization.

The MAB component introduces fine-grained adjustments to the Juggler-predicted weights. Each arm of the bandit represents a small corrective measure to be applied to the utility and compensation weights to improve relevance.

The key features of our MAB implementation include:

1. Contextual arms: The contextual bandits consider contextual features (e.g., device type, brand) when selecting arms. 2. Reward function: We use Normalized Discounted Cumulative Gain (NDCG) as a proxy for Conversion Rate, allowing for offline simulation and evaluation. 3. Exploration strategy: We employ epsilon-greedy and Thompson Sampling for its ability to balance exploration and exploitation effectively [8].

The integration of Juggler and MAB is achieved through an additive approach in the scoring function:

𝑠𝑜𝑟𝑡𝑆𝑐𝑜𝑟𝑒 =(𝑤 𝐽 𝑢𝑔𝑔𝑙𝑒𝑟 𝑢𝑡𝑖𝑙𝑖𝑡𝑦 + 𝑤 𝑀𝐴𝐵 𝑢𝑡𝑖𝑙𝑖𝑡𝑦 ) ⋅ 𝑢𝑡𝑖𝑙𝑖𝑡𝑦𝑆𝑐𝑜𝑟𝑒 + (𝑤 𝐽 𝑢𝑔𝑔𝑙𝑒𝑟 𝑐𝑜𝑚𝑝 + 𝑤 𝑀𝐴𝐵 𝑐𝑜𝑚𝑝 ) ⋅ 𝑐𝑜𝑚𝑝𝑒𝑛𝑠𝑎𝑡𝑖𝑜𝑛𝑆𝑐𝑜𝑟𝑒(1)

where 𝑤

𝐽 𝑢𝑔𝑔𝑙𝑒𝑟

𝑢𝑡𝑖𝑙𝑖𝑡𝑦 and 𝑤

𝐽 𝑢𝑔𝑔𝑙𝑒𝑟 𝑐𝑜𝑚𝑝

are the weights predicted by Juggler, and 𝑤 𝑀𝐴𝐵 𝑢𝑡𝑖𝑙𝑖𝑡𝑦 and 𝑤 𝑀𝐴𝐵 𝑐𝑜𝑚𝑝 are the corrective weights determined by the MAB.

We formulate our contextual MAB problem as follows: let 𝒜 be the set of arms, where each arm 𝑎 ∈ 𝒜 represents a pair of corrective weights (𝑤 𝑀𝐴𝐵 𝑢𝑡𝑖𝑙𝑖𝑡𝑦 , 𝑤 𝑀𝐴𝐵 𝑐𝑜𝑚𝑝𝑒𝑛𝑠𝑎𝑡𝑖𝑜𝑛 ). The context 𝑥 𝑡 ∈ 𝒳 at time 𝑡 includes features such as device or brand. The reward 𝑟 𝑡 is defined as the NDCG of the resulting ranking. The goal is to find a policy 𝜋 ∶ 𝒳 → 𝒜 that maximizes the expected cumulative reward:

max 𝜋 𝔼 [ 𝑇 ∑ 𝑡=1 𝑟 𝑡 (𝑥 𝑡 , 𝜋(𝑥 𝑡 ))] (2)

where 𝑇 is the time horizon. We explored various methods to combine Juggler's predictions with MAB corrections, ultimately settling on the additive approach described above. We carefully selected contextual variables that would help identify under-performing segments in the Juggler model, such as device type and brand. Balancing multiple objectives in a single reward function required careful consideration. We chose NDCG as an initial approach due to its widely accepted usage, with plans to explore more complex multi-objective reward functions in future work.

To evaluate our hybrid approach, we developed a custom simulator that allows us to test various configurations offline using historical data. The simulator, built on Expedia data, enables to:

1. Replay historical searches and user interactions. Data is loaded on a daily basis, consisting of data for each property in each search and the respective user clicks and bookings. 2. Apply the Juggler-MAB model to generate new rankings. The MAB is sampled (ppotentially using contextual data) and the retrieved arm is included in the ranking formula, yielding the simulated score and the final ranking. 3. Evaluate the performance using both immediate (e.g., clicks) and delayed (e.g., bookings) feedback.

The reward function evaluates the simulated rankings and information about the arm sampled, reward and contextual information (if any) is provided to the MAB, to update its internal state.

The simulation framework provides a safe environment to test and refine our approach before considering online deployment.

Experimental Setup

We used a dataset of 0.6 million searches from Expedia's lodging booking platform, covering a period of 31 consecutive days. The data has over 600000 distinct properties across approximately 41000 distinct destinations, with feedback sparsity over 96%. The dataset includes features such as device type, brand, destination, and historical user interactions.

We compared several variants of the proposed Juggler-MAB hybrid approach against the original Juggler model [5]. We tested several MAB algorithms, ranging from classical (i.e. no contextual features) to contextual bandits:

• Gaussian Thompson (GT): a classical bandit using Thompson Sampling assuming a Gaussian Distribution of reward value. • 𝜖-greedy: a classical bandit using a vanilla implementation of the canonical algorithm. We have used 𝜖 = 0.1 and 𝜖 = 0.3. • Recursive Least Squares with Thompson Sampling (RLS): a contextual bandit using a linear model with a vector of means and a matrix of variances-covariances.

The experiments use the actual production Juggler model predictions for each search. This improves the reliability of Juggler's predictions, which in turn leads to more robust estimates of the MAB's effect. We then implemented the MAB component using the AdaptEx SDK [16], with the following configuration:

• Arm space: we explore 3 different values for each arm, respectively 𝑤 𝑀𝐴𝐵 𝑢𝑡𝑖𝑙𝑖𝑡𝑦 ∈ {−0.3, 0.0, 0.3} and 𝑤 𝑀𝐴𝐵 𝑐𝑜𝑚𝑝 ∈ {−0.2, 0.0, 0.2}. The selected weights are determined via domain knowledge, also ensuring non-zero weights.

• Contextual features: several low cardinality categorical search features were tested, with 3 being identified as the most important: brand, user device and geographical categorization of the search destination, e.g. neighborhood vs city. • Exploration strategy: Thompson Sampling and 𝜖-greedy • Reward: Normalized Discounted Cumulative Gain (NDCG), to determine how well can MAB algorithms correct towards relevance and expected conversion rate improvement.

Results and Discussion

Table 1 summarizes the main results. For each bandit, we report the average reward, regret and the percentage of best arm selections across all searches. The best results per metric are highlighted in bold. Notice that regret is best when lowest, the remaining metrics are better when maximized. Aggregated reward, regret and the percentage of best arm selections results for all bandits and baselines.

Our Juggler-MAB hybrid approach outperformed the Juggler baseline across all metrics for all bandits proposed. The NDCG improvements range from +0.8% for GT bandit, all the way to +2.9% in several RLS bandits. In terms of regret, we achieve a reduction of 13.7% and an improvement in best arm selection rate of 9.8%.

The 𝜖-greedy algorithms provide very strong baselines, especially when 𝜖 = 0.1. GT bandit is clearly the worst bandit, but yet useful since it outperforms the baseline. Among the contextual bandits, the best one across all metrics is the 𝑅𝐿𝑆 𝑏𝑟𝑎𝑛𝑑 . Interestingly, when using more contextual features, we did not achieve better performance. Further investigations are required to identify what matters to define the context.

We performed Wilcoxon signed-rank tests and observed no statistical difference between all RLS bandits. The Critical Difference [18] diagram for the remaining bandits is shown in Figure 1. The results show no statistical significant difference between RLS and 𝜖 = 0.1, hinting that contextual features are not meaningful. However, all RLS bandits are better than the baselines. To note as well how all bandits are better than the Juggler baseline -this is a testament to the value of hybrid approach proposed. Figure 2 shows the learning dynamics for all bandits across all days in the data sample. To improve interpretation, we include only the best contextual bandit. The Juggler-MAB demonstrated fast adaptation to changing conditions. We observed that the MAB component was able to make fine-grained adjustments to the Juggler predictions, resulting in improved performance.

We inspect now Juggler-MAB's effect on lodging ranking top-10 average statistics in Table 2. The results are reported as differences to the Juggler baseline, as we cannot expose the sensitive raw data.

Metric

𝜖-greedy (0. The results show a clear pattern for all bandits: average daily price decreases and guest and star ratings increase as NDCG improves. On the contrary, margin % and margin $ decreases, which could pose problems to the marketplace objectives and long term health. The expectations, to be validated via AB test, is that the increase in relevance will lead to an improvement in conversion rate which can offset the impact in profit per transaction.

Diving now deeper into the arms selection per bandit, we present Figure 3. The results show a clear and expected preference towards arms lower compensation weights, as they are not aligned with the NDCG reward. However, it is interesting to observe that the best bandit has learned that not only is it ideal to decrease compensation, but also to increase or decrease relevance depending on the context.

Despite the overall positive results, we identified two limitations. First, the reward function considers only a single dimension of the problem (i.e. relevance), which explains the impact to the compensation component. Future work will address this limitation by using multi-objective optimization techniques [13]. Second, our current simulations use historical interactions with a deterministic logging policy, introducing bias. To address it, we will implement off-policy evaluation techniques [19,20]

Conclusion and Future Work

In this paper, we presented a novel hybrid approach combining Meta-Learning with Multi-Armed Bandits for multi-stakeholder recommendations in online travel marketplaces. Our Juggler-MAB system demonstrated significant improvements over existing methods. Key contributions of our work include 1) an integration of meta-learning and contextual bandits for recommendation systems and 2) empirical evidence of the effectiveness of our approach in a large-scale, real-world setting. Based on our findings and the limitations identified, we propose the following directions for future research:

1. Online testing: Conduct A/B tests in a production environment to validate the performance of Juggler-MAB under real-world conditions and user behaviors 2. Dynamic arm space: Explore methods for dynamically adjusting the arm space of the MAB component based on observed performance and changing market conditions. 3. Fairness considerations: Incorporate explicit fairness constraints or objectives into the MAB formulation to ensure equitable treatment of different provider segments [21] 4. Long-term value optimization: Extend the approach to consider long-term user value, potentially using reinforcement learning techniques for sequential decision-making.

Figure 1 :1Figure 1: Critical Difference diagram shows superiority of bandits over multiple baselines, including simpler bandits.

Figure 2 :2Figure 2: Multiple metrics per bandit over time.

Figure 3 :3Figure 3: Arm pulls per bandit over time.

Table 11Banditavg(reward) avg(regret) best arm %Juggler0.17760.03730.7515GT0.17910.03580.7866𝜖-greedy (0.3)0.18110.03390.8095𝜖-greedy (0.1)0.18240.03250.8218𝑅𝐿𝑆 𝑏𝑟𝑎𝑛𝑑0.18270.03220.8252𝑅𝐿𝑆 𝑑𝑒𝑣𝑖𝑐𝑒0.18220.03270.8200𝑅𝐿𝑆 𝑔𝑒𝑜0.18250.03250.8228𝑅𝐿𝑆 𝑔𝑒𝑜, 𝑏𝑟𝑎𝑛𝑑0.18270.03230.8246𝑅𝐿𝑆 𝑑𝑒𝑣𝑖𝑐𝑒, 𝑏𝑟𝑎𝑛𝑑0.18270.03220.8228𝑅𝐿𝑆 𝑔𝑒𝑜, 𝑑𝑒𝑣𝑖𝑐𝑒0.18270.03220.8247𝑅𝐿𝑆 𝑔𝑒𝑜, 𝑑𝑒𝑣𝑖𝑐𝑒, 𝑏𝑟𝑎𝑛𝑑0.18260.03230.8246

Table 22Differences in several metrics in top-10 positions.3) 𝜖-greedy (0.1)RLSdaily price-0.7278-0.8324-0.8595guest rating0.04160.05720.0604star rating0.04990.07470.0796margin %-0.0034-0.0045-0.0048margin $-0.6285-0.8222-0.8633

Multistakeholder recommendation: Survey and research directions HAbdollahpouri GAdomavicius RBurke IGuy DJannach TKamishima JKrasnodebski LPizzato User Modeling and User-Adapted Interaction 30 2020 Multi-stakeholder recommendation and its connection to multi-sided fairness HAbdollahpouri RBurke Conjunction with the 13th ACM Conference on Recommender Systems, RecSys'19 2019 Workshop on Recommendation in Multi-stakeholder Environments (RMSE'19) Recommendations in a marketplace RMehrotra BCarterette Proceedings of the 13th ACM Conference on Recommender Systems the 13th ACM Conference on Recommender Systems 2019 Recommendations with a purpose DJannach GAdomavicius Proceedings of the 10th ACM Conference on Recommender Systems the 10th ACM Conference on Recommender Systems ACM 2016 Juggler: Multi-stakeholder ranking with meta-learning TCunha IPartalas PNguyen Proceedings of the MORS workshop at the 15th ACM Conference on Recommender Systems, CEUR Workshop Proceedings the MORS workshop at the 15th ACM Conference on Recommender Systems, CEUR Workshop Proceedings 2021 Metalearning and recommender systems: A literature review and empirical study on the algorithm selection problem for collaborative filtering TCunha CSoares ACDe Carvalho Information Sciences 423 2018 Cf4cf: Recommending collaborative filtering algorithms using collaborative filtering TCunha CSoares ACDe Carvalho RecSys 2018 -Proceedings of the 12th ACM Conference on Recommender Systems 2018 TLattimore CSzepesvári Bandit algorithms Cambridge University Press 2020 A contextual-bandit approach to personalized news article recommendation LLi WChu JLangford RESchapire Proceedings of the 19th international conference on World wide web the 19th international conference on World wide web 2010 Factorization bandits for interactive recommendation HWang QWu HWang Proceedings of the AAAI Conference on Artificial Intelligence the AAAI Conference on Artificial Intelligence 2017 31 Sequential model-based optimization for general algorithm configuration FHutter HHHoos KLeyton-Brown International conference on learning and intelligent optimization 2011 Bohb: Robust and efficient hyperparameter optimization at scale SFalkner AKlein FHutter International Conference on Machine Learning

PMLR

2018 Multiple objective optimization in recommender systems MRodriguez CPosse EZhang Proceedings of the sixth ACM conference on Recommender systems the sixth ACM conference on Recommender systems 2012 PNguyen JDines JKrasnodebski arXiv:1708.00651 A multi-objective learning to re-rank approach to optimize online marketplaces for multiple stakeholders 2017 arXiv preprint Multistakeholder recommendation with provider constraints ÖSürer RBurke ECMalthouse Proceedings of the 12th ACM Conference on Recommender Systems the 12th ACM Conference on Recommender Systems ACM 2018 Adaptex: A self-service contextual bandit platform WBlack EIlhan AMarchini VMarkeviciute Proceedings of the 17th ACM Conference on Recommender Systems the 17th ACM Conference on Recommender Systems 2023 EIe VJain JWang SNavrekar RAgarwal RWu H.-TGao MChandra CBoutilier arXiv:1909.04847 Recsim: A configurable simulation platform for recommender systems 2019 arXiv preprint Statistical comparisons of classifiers over multiple data sets JDemšar J. Mach. Learn. Res 7 2006 Doubly robust policy evaluation and learning MDudík JLangford LLi ICML'11

Madison, WI, USA

Omnipress 2011 The self-normalized estimator for counterfactual learning ASwaminathan TJoachims advances in neural information processing systems 28 2015 Balanced neighborhoods for multi-sided fairness in recommendation RBurke NSonboli AOrdoñez-Gauger Conference on Fairness, Accountability and Transparency, PMLR 2018