=Paper= {{Paper |id=Vol-2959/paper3 |storemode=property |title=Juggler: Multi-Stakeholder Ranking with Meta-Learning |pdfUrl=https://ceur-ws.org/Vol-2959/paper3.pdf |volume=Vol-2959 |authors=Tiago Cunha,Ioannis Partalas,Phong Nguyen |dblpUrl=https://dblp.org/rec/conf/recsys/CunhaPN21 }} ==Juggler: Multi-Stakeholder Ranking with Meta-Learning== https://ceur-ws.org/Vol-2959/paper3.pdf
Juggler: Multi-Stakeholder Ranking with Meta-Learning∗

TIAGO CUNHA, Expedia Group, Switzerland
IOANNIS PARTALAS, Expedia Group, Switzerland
PHONG NGUYEN, Expedia Group, Switzerland
Online marketplaces must optimize recommendations with regards to multiple objectives, in order to fulfill expectations from a variety
of stakeholders. This problem is typically addressed using Pareto Theory, which explores multiple objectives in a domain and identifies
the objective vectors which yield the best performance. However, such approach is computationally expensive, and available commonly
only through domain-specific solutions, which is not ideal for online marketplaces and their ever-changing business dynamics. We
tackle these limitations by proposing a Meta-Learning framework to address the Multi-Stakeholder recommendation problem, which is
able to dynamically predict the ideal settings on how business rules should be mingled into the final recommendations. The framework
is designed to be generic enough to be leveraged in any item ranking domain and requires only the definition of a policy, i.e. a set
of multi-objective metrics the meta-model should optimize for. The model finds the mapping between the search context and the
corresponding best objective vectors. This way, the model is able to predict in real-time which is the best solution for any unforeseen
search, and therefore adapt the recommendations on a search-level. We show that under this framework, the range of models one is
able to build depends only on how many policies can be defined, thus offering a virtually unlimited way to address multi-objective
problems. The experimental results showcase the generalization abilities of this framework and its highly predictive performance.
Furthermore, the simulation results confirm the ability to approximate a policy’s expectation in most cases and hints to the potential
to use this framework in many other item recommendation problems.


1   INTRODUCTION
In recent years a surge has been observed in electronic marketplaces as a consequence of the unprecedented growth
of e-Commerce which bring together sellers and buyers in a common platform. Examples of marketplaces are online
travel agencies like Expedia Group or Amazon in the retail domain. Typically in these scenarios, the platforms provide
recommendation systems that help to match users and suppliers. While in traditional recommendation systems one
optimizes for customer utility, in marketplaces one needs to take into account multiple stakeholders. This problem, also
known as Multi-Stakeholder recommendation problem [2, 28, 44], is commonly addressed through a multi-objective
optimization problem, where the collection of stakeholders’ goals is optimized simultaneously. An extensive collection
of works is available, with examples ranging from revenue [6, 27, 31] to fairness [4, 9, 17, 28].
    Despite progress in the domain, the solutions mainly focus on few and domain-dependent objectives, which makes it
hard to transfer knowledge to other domains. Our framework addresses both limitations by providing a way to consider
any and as many objectives as the practitioner desires. The practitioner only needs to define a meaningful policy to
optimize for, meaning a collection of relevant stakeholder objectives. We show that depending on the policy, the model
commonly yields predictions aligned with the policy defined, therefore approaching stakeholders’ expectations. With
such flexible framework, it is possible to define as many and as different policies as the business demands, leaving the
practitioner with the sole responsibility of exploring policies, rather than to develop new custom-built algorithms.
    To accomplish such generic framework, one needs to revisit how multi-objective problems are tackled. Typical
approaches try to find the Pareto front [18, 26, 41, 45], which refers to the set of non-dominated solutions in a domain,

∗ Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Presented at the MORS workshop held in conjunction with the 15th ACM Conference on Recommender Systems (RecSys), 2021, in Amsterdam,
Netherlands.
                                                                          1
                                                                                      Tiago Cunha, Ioannis Partalas, and Phong Nguyen


i.e. objective vectors for which maximizing one objective is only possible by being detrimental to the others. Although
effective under few objectives, this poses substantial problems when considering multiple: 1) when the number of
objectives increases, almost all solutions become non-dominated and 2) the number of solutions required for the
approximation of the Pareto front exponentially increases with the number of objectives [21]. Such constraints impose
substantial restrictions when applied in real-time inference: 1) computational requirements make it extremely difficult
to find tailored solutions for each problem instance, thus having to rely on sub-optimal global solutions and 2) even if
the best solutions are efficiently found, the practitioner is still required to make a manual decision on which solution
from the Pareto front to use [41].
    To address all these issues, we argue that the ideal approach is to forego the expensive computation of the Pareto front
and to focus on a set of diverse and meaningful solutions. A predictive model is then charged to decide which is the most
appropriate solution, depending on the context a search occurs. Then, such prediction is used to dynamically adjust the
multi-objective solution. To do so, we draw inspiration from the research area of Meta-Learning for algorithm selection [8,
39], where one learns about the learning process in order to predict the best algorithm for a new dataset [24, 36–38, 40].
Likewise, we leverage the Meta-Learning framework to predict the best objective vector amongst a set of diverse and
meaningful policy-dependent solutions for any new search. Thus, in this work we make the following contributions: 1)
we propose a general purpose framework Multi-Stakeholder recommendations leveraging Meta-Learning and 2) we
validate the procedure in the hospitality domain using Brand Expedia data.
    This document is organized as follows: Section 2 presents a summary of the relevant literature; next, Section 3
introduces the proposed framework while Section 4 shows the experimental setup used to validate it. Section 5 presents
and discusses our findings and section 6 highlights our conclusions and future work.


2     RELATED WORK
2.1    Multi-Stakeholder Recommendations
We define a stakeholder in the recommendation space as any individual or group that can affect or be affected by the
delivery of recommendations to users [1]. In the context of online marketplaces, they typically refer to three groups:
consumers (customers/users which have a need to be met by the marketplace and which receive the recommendations
to fulfill it), providers (entities which provide goods/services to the customer through the marketplace) and the system
(the platform that matches consumers to providers).
    In a Multi-Stakeholder recommendations, each stakeholder has a different expectation that the platform has to meet,
or at least to satisfy; that is a scenario which is more realistic in the real world that the basic task to optimize for
customer utility in academic research [22]. Since such problem is not trivial, much research has been devoted to this
topic [28, 44]. More recently, there was a surge of proposed approaches in industrial settings, mainly due to e-Commerce
growth: balancing semantic match and job-seeking intent features [35], to learn how to balance hotel relevance and
compensation [31] and to balance customer relevance and advertising revenue in music platforms [3, 4].
    The Multi-Stakeholder problem can be formulated through the multi-objective optimization setup, where all stake-
holder’s goals are considered as one single super-set of objectives to optimize for. Formally, the problem is defined as [18]:
given 𝑚 > 1 objective functions 𝜃 1 : 𝑋 →
                                        − R, . . . , 𝜃𝑚 : 𝑋 →
                                                            − R which map a decision space 𝑋 into R, a multi-objective
problem is defined as by:


                                             min 𝜃 1 (𝑥), . . . , min 𝜃𝑚 (𝑥), 𝑥 ∈ 𝑋                                              (1)
                                                               2
   Traditionally, the multi-objective problem is tackled with Pareto Theory methods, which aims to find the Pareto
front, i.e. the set of non-dominated vectors for any particular problem. Such set describes all vectors for which it is only
possible to perform better in one dimension, if it is accompanied by a decrease in performance in another. Such front is
ideal to uncover the relationship between objectives and, consequently, the best solution. The quintessential notion
in Pareto is the dominance, i.e. how to compare two instances to make sure one is better than the other. It is defined
as [18]: given two vectors in the objective space 𝑦 (1) ∈ R𝑚 and 𝑦 (2) ∈ R𝑚 , then the point 𝑦 (1) Pareto dominates 𝑦 (2)
                                       (1)       (2)                        (1)      (2)
if and only if: ∀𝑖 ∈ {1, . . . , 𝑚} : 𝑦𝑖     ≤ 𝑦𝑖 , ∃𝑗 ∈ {1, . . . , 𝑚} : 𝑦 𝑗     < 𝑦 𝑗 . More plainly, it states the first vector is not
worse in each of the objectives than the second vector and better in at least one [41].
   The typical approach to solve the multi-objective problem consists of two steps: to simplify the objective function
and to apply an optimization algorithm to find the solution. To simplify the problem, practitioners use scalarization
functions, which are mathematical constructs used to convert multiple objective functions into a single one. This way,
the problem can be solved using standard optimization algorithms [18]. Then, to find the solution, practitioners usually
employ evolutionary algorithms, as these are capable of finding well-distributed solutions along the Pareto front [18].
Such algorithms take advantage of natural evolution concepts of selection, mutation and combination to explore the
objective space until a near-optimal solution is found. For more details, see [26, 41, 45].

2.2    Meta-learning
The No Free Lunch Theorem [43] refutes the existence of a so-called super-algorithm: a single best, universal learning
algorithm able to obtain the best possible performance for every instance of a given task [39]. The justification lies in
the observation that if all possible data distributions are equally likely, any pair of learning algorithms will, on average,
have the same performance. Therefore, for any algorithm, superior performance over one group of task instances is
compensated with inferior performance over another.
   Researchers have then decided to focus on understanding each algorithm’s behavior/bias in order to ascertain
when they will be most successful. Meta-Learning is one of the existing tools to tackle the problem, which focuses on
using Machine Learning to understand Machine Learning algorithms (and their configurations) in order to improve
their results in future applications [32, 36]. These are commonly addressed by learning meta-models which find the
relationships between data characteristics and learning performance [36, 38]. This algorithm selection problem has first
been formalized by Rice [34], and it states that given:

      • the problem space 𝒫, representing the set of problem instances for which predictions will be made;
      • the feature space ℱ containing measurable characteristics for each instance of 𝒫;
      • the algorithm space 𝒜 as the set of all available algorithms for solving the problem;
      • the performance space 𝒴 that shows the mapping from algorithms to performance metrics,

   the problem can be stated as: for a given problem instance 𝑥 ∈ 𝒫, with features 𝒻(𝑥) ∈ ℱ, find the selection
mapping 𝑀 (𝑓 (𝑥)) into the algorithm space 𝒜, such that the selected algorithm 𝛼 ∈ 𝒜 maximizes the performance
mapping 𝑦 (𝛼 (𝑥)) ∈ 𝒴 [37].
   Using Meta-Learning and Recommender Systems simultaneously is not a new topic: there are works that focus on
the selection of the best recommendation algorithm for a new dataset [5, 12–14, 16], selection of the best algorithm for
each user [10, 11, 33] and even how to use Recommender Systems to tackle algorithm selection tasks [15, 30]. However,
to the best of our knowledge, this is the first documented solution to tackle the Multi-Stakeholder recommendation
problem using Metalearning.
                                                                       3
                                                                                        Tiago Cunha, Ioannis Partalas, and Phong Nguyen


3     JUGGLER
We focus on the problem of item ranking in online marketplaces, namely how to rank multiple items given the customer’s
query or based on the customer profile. In this setting, a ranking 𝜋 is defined by decreasingly sorting items 𝑖 ∈ 𝐼 by
their score 𝑠 (𝑖).
    In this work, we define our objective function as a sum of all stakeholder’s objectives, by building on the previous
work [31]. The score 𝑠 for any item 𝑖 in this context is given as:

                                                                    𝐾
                                                                    Õ
                                                          𝑠 (𝑖) =         𝑎𝑘 (𝑖)                                                   (2)
                                                                    𝑘=1
    where 𝑎 refers to each of 𝐾 adjustments, i.e. functions which score each item with regards to some specific objective(s).
Usually these refer to mathematical formulas to enforce business rules, but can be more complex: for instance, the
predictions from other Machine Learning models. Thus, this formulation is flexible enough to allow any number and
nature of adjustments, which is advantageous in ever-changing online platforms.
    This formulation offers a straightforward way to address the multi-objective optimization problem, namely through
a linear scalarization [18], i.e. by introducing an individual weight per adjustment. This is achieved by modifying
Equation 2 accordingly:

                                                                𝐾
                                                                Õ
                                                     𝑤𝑠 (𝑖) =           𝑤𝑘 × 𝑎𝑘 (𝑖)                                                (3)
                                                                𝑘=1
    where each parameter 𝑤 identifying a specific adjustment’s weight. From such definition, it is possible to define a
ranking of items, simply by sorting the items for a given user in a decreasing order. Thus, the ranking 𝜋 over a subset of
items 𝑝 ⊆ 𝐼 referring to search is the sequence of items given by: 𝜋 = (𝑖 1, . . . , 𝑖𝑛 ), 𝑤𝑠 (𝑖 1 ) ≥ 𝑤𝑠 (𝑖𝑛 ), ∀𝑖 ∈ 𝑠. Now, inspired
by Equation 1, we can state our objective function to optimize the weighted item scores in a way that all objectives are
maximized:


                                                    max 𝜃 1 (𝜋), . . . , max 𝜃𝑚 (𝜋)                                                (4)



3.1    Framework overview
The main hypothesis in this paper is that Multi-Stakeholder recommendations can be addressed through Meta-Learning.
We take inspiration from algorithm selection task and adapt it to our problem, by considering problem instances 𝑥
as individual searches 𝑝 and replacing algorithm 𝛼 by objective vector 𝑤𝑘 . This parallelism will become evident in
Section 3.3, when we explain how we convert objective vectors into a discrete set of meta-labels.
    This way, the Meta-learning problem addressed is: for a given search 𝑝 ∈ 𝒫, with features 𝒻(𝑠) ∈ ℱ, the meta-model
aims to find the selection mapping 𝑀 (𝑓 (𝑠)) → 𝒜, such that the selected objective vector 𝑤𝑘 ∈ 𝒜 maximizes the
performance mapping 𝑦 (𝑤𝑘 (𝑠)) ∈ 𝒴. Figure 1 presents an overview of the training and inference workflows in the
proposed Juggler framework, in order to illustrate how it can be leveraged in online marketplaces.
    In training, we leverage historical searches to build the meta-examples, i.e. a set of points in the meta-feature space,
with an associated label. Each point refers to a search and it is defined by the meta-feature extraction process; the
labels are constructed via simulations, by assessing which is the ideal multi-objective vector calibration for each search.

                                                                    4
                                                                           Model training

                                                   Perform                    Meta-
                                                 simulations                examples


                            Historical                                         Model
                            searches                                           fitting


                                                   Extract
                                                    meta-                 Meta-model
                                                   features


                               New                                             Apply
                              search                                           weights
                                                   Re-rank
                                                    items
                          —-> Training data flow                          Model inference
                          - -> Inference data flow


                                                  Fig. 1. Juggler framework.




With such meta-examples, the meta-model is fitted using a standard Machine Learning procedure. When placed into
production, it is ready to perform real-time inference.
  The inference step involves then simply submitting a new search to the same meta-feature extraction process used
in training, in order to be able to place the new search in the meta-feature space. Then, the meta-model predicts the
best multi-objective vector for this particular search, based on patterns found in the simulations employed in training.
The prediction is used to re-rank the items, following the predefined scoring system.
  We shall discuss in the remainder of this section how do we perform simulations (by introducing the concepts of
policy and how these are used to define meta-labels) and how do we characterize searches through meta-features.


3.2   Meta-performance: defining policies
We will describe now the set of metrics 𝒴 that we want to optimize with Meta-Learning. As established in the Multi-
Stakeholder literature, there are many objectives one can optimize for at any given time: it is then a question to define
which matter most in a specific use case. Common solutions focus on few objectives, highly dependent on the problem
at hand. We address this limitation by defining the concept of policy.
  A policy Γ refers to a set of multi-objective metrics. Such metrics are already naturally defined in each multi-objective
function 𝜃𝑚 in Equation 4, i.e. Γ = {𝜃 1, . . . , 𝜃𝑚 }. In our setup there is no constraint over which metrics nor how many
metrics one chooses to optimize for, as long as a valid sets of metrics is provided. Following theoretical considerations on
ranking metrics [42], we consider a metric to be valid if it possesses the distinguishability property, i.e. it can distinguish
two different rankings. We require only its output to be a continuous scalar value, with higher value meaning better.
                                                               5
                                                                                  Tiago Cunha, Ioannis Partalas, and Phong Nguyen


  Many widely-accepted ranking metrics exists today follow such constraints: for instance, Normalized Discounted
Cumulative Gain (NDCG), Mean Average Precision and Mean Reciprocal Rank [20]. These, however, are not able to
capture all the dynamics of the multi-objective recommendation problem, for instance in terms of fairness or how well
aligned the outcome is with a specific business objective. To address this gap, we propose 2 different metric classes for
rankings to be in combination and separately in our framework: Ranking Correlations and Ranking Fairness. The goal
is to provide a skeleton to enable to optimize for objectives for which there are no known metrics yet.


  Ranking Correlations. This metric class offers a way to understand how well does a specific ranking perform on
a given attribute, when compared against the ideal setting. For instance, how correlated a given search is with the
increasing price-aware sorting of the same items tells us how close we are to a ranking with cheaper properties on
the top positions. Similarly, such operation can be used for any item attribute which enables to derive a ranking of
items. Since we shall operate under a ranking setup, we take advantage of Kendall’s 𝑡𝑎𝑢 ranking correlation to measure
correlations to the ideal scenario. Formally, the correlation between a ranking 𝜋 and the ideal ranking 𝜋 ∗ , given by a
ranking function over a particular attribute, is calculated by first assessing how many concordant 𝑛𝑐 and discordant 𝑛𝑑
pairs there are when comparing pairwise items in both rankings. Kendall’s 𝜏 is thus given by:

                                                            𝑛𝑐 − 𝑛𝑑
                                                     𝜏=                                                                      (5)
                                                           𝑛(𝑛 − 1)/2

  Ranking Fairness. One type of objectives that has recently received substantial attention in the literature is the
fairness and diversity [2, 9, 17, 19]. Such approaches try to ensure the distribution of predicted items per class is
respected regardless of the context on which the prediction occurs. For instance, we could optimize for a ranking to
show all existing property types in the destination region, while trying not to hurt relevance of predictions. To that
effect, we build on a Group Fairness metric [29], originally designed to ensure fairness of item popularity, and generalize
it to ensure fairness of any item attribute. Considering an attribute which can be divided into 𝐾 classes, with each
group being represented by 𝐼 𝑗 , 𝑗 ∈ {1, . . . , 𝐾 }. Thus, one can calculate the Group Fairness metric by considering the
frequency of items in each class:

                                                           𝐾 p
                                                           Õ
                                                𝐺𝐹 (𝑠) =         |𝑖 |, ∀𝑖 ∈ 𝐼 𝑗                                              (6)
                                                           𝑗=1
  With such metric classes, one can define a variety of metrics, which in turn means it is trivial to define multiple
policies. One key advantage of this formulation is that multiple meta-models can be inferred simply by switching the
policy. This is particularly noteworthy since it means any policy can be used, even though it might at times lead to
sub-optimal performance: it is up to to practitioner to decide which are the meaningful objectives to optimize for and to
confirm through simulations and AB testing whether the expected outcome is met. Alternatively, such policy could be
defined by product or analytics teams, based on domain knowledge and business dynamics.


3.3   Meta-labels: choosing weights
To find the objective vectors 𝑤𝑘 to be used as meta-labels is no trivial task and depends heavily on the policy Γ: it is
unlikely that two policies should share exactly the same objective vectors. To that end, we define a procedure which
aims to explore a constrained objective space in order to select a few good candidates. These candidates constitute all
                                                             6
available options the model has to dynamically weight the adjustments at inference time, thus reducing the problem
from finding ideal weights to predict the best candidate. The procedure to find the meta-labels follows these steps:
   (1) To constrain each objective into a range: each adjustment 𝑎𝑘 must be constrained within a lower and higher
       boundary, i.e. [min(𝑤𝑘 ), max(𝑤𝑘 )]. The boundaries should be defined in such a way that the adjustment is
       allowed to only vary so much from the original value (i.e. 𝑤𝑘 = 1) - otherwise, it may lead to unforeseen negative
       effects on the overall ranking. Such range should be tuned through experimentation and/or domain knowledge.
   (2) To partition the objective space into sections: instead of exploring all possible combinations in the feasible
       region, we focus on the most interesting sections: the extreme ends of the bounded objective space (to explore
       the objective space in search of meaningful changes in ranking) or closer to the default values (to find the best
       fit for the current default setup). Thus, each space is defined as a polygon, covering 1/3 of each dimension, i.e.
       (max(𝑤𝑘 ) − min(𝑤𝑘 ))/3. The outer polygons start from the extreme ends of objective range, while the balanced
       section is centered on the origin point. As an example of the partition of an objective space for 2 objectives
       is shown in Figure 2. Notice this procedure generates 5 meta-labels when using 2 objectives, which is a good
       trade-off for the classification algorithm we aim to employ.
                                         high
                                         medium
                                         low




                                                    low          medium           high

Fig. 2. Meta-label sections in an objective space for 2 objective functions. Notice it identifies 5 sections (shaded in blue), which will
translate in the same number of meta-labels.


   (3) To discretise the space into representative candidates: we sample some examples from each section to serve
       as candidates as, once again, there is not necessarily any change in ranking by considering 2 contiguous objective
       vectors. To do so, we employ a grid-search approach, similarly to another procedure in the literature [35].
   (4) To create solutions using candidates: a solution 𝑐 refers to a set of objective vectors, containing exactly one
       candidate per section, i.e. 𝑐 = {𝑤𝑘1 , . . . , 𝑤𝑘5 } with 𝑤𝑘𝑁 referring to one selected objective vectors in each of the 5
       available sections. The set of all solutions 𝐶 refers to all combinations of candidates across sections.
   (5) To evaluate solutions: all solutions are evaluated on policy Γ in all searches. First, each objective vector in each
       solution is submitted to Equation 3 in order to score and, afterwards, rank each item within a search. Then, each
       candidate is evaluated individually in each objective 𝜃𝑚 with regards to each search, which enables to calculate
       its average ranking across all objectives.
   (6) To identify the best solution per policy: the best solution is simply the one which achieves the lowest average
       rank across all objectives. In the presence of ties, the candidate closer to the default weights is selected, since
       there is no gain in deviating from the default behaviour. Algorithm 1 summarizes the evaluation procedure
       explained in the previous two points.
                                                                   7
                                                                                     Tiago Cunha, Ioannis Partalas, and Phong Nguyen


Algorithm 1 Evaluate and select meta-labels
Require: set of searches 𝒫, set of solutions 𝐶, policy Γ
 1: for 𝑐 ∈ 𝐶 do
 2:     for 𝑝 ∈ 𝒫 do
 3:         for 𝑤𝑘𝑙 ∈ 𝑐 do
 4:             for 𝑖 ∈ 𝑝 do
                             Í𝐾
 5:                 𝑤𝑠 (𝑖) ← 𝑘=1   𝑤𝑘𝑙 × 𝑎𝑘 (𝑖)                                                                     ⊲ Score items
 6:                 𝜋 ← 𝑟𝑎𝑛𝑘 (𝑤𝑠 (𝑖))                                                                               ⊲ Rank items
 7:             for 𝜃𝑚 ∈ Γ do
 8:                 𝒴𝜃𝑚 (𝑝, 𝑤𝑘𝑙 ) ← 𝜃𝑚 (𝜋)                                                                 ⊲ Evaluate candidate
                            Í |Γ|
                                 𝑚=1 𝒴𝜃𝑚 (𝑝,𝑤𝑘 )
                                             𝑙
 9:        𝒴(𝑐, 𝑝) ←                  |Γ |                                                           ⊲ Score solution in search
              Í |𝒫|
                       𝒴 (𝑐,𝑝)
10: 𝒴(𝐶) ←      1
                                                                                                   ⊲ Evaluate over all searches
                      |𝒫 |
11: return arg min 𝑟𝑎𝑛𝑘 (𝒴(𝐶))                                                                           ⊲ Select best solution




3.4   Meta-features: characterizing context
One must now define the feature space ℱ which will allow to describe the context on which each search occurs.
Meta-Learning literature presents several approaches to describe a problem, but as far as we are aware, there are no
standard solutions to describe rankings. We propose two meta-feature classes, similarly to what was done for metrics in
Section 3.2, which can be used on any ranking: Ranking histograms and Ranking comparisons.

  Ranking histograms. To take advantage of histograms in Meta-Learning is not new [23] and it is appreciated since
they are capable of summarizing a given feature 𝑓 using a single summary function, i.e. the ratio of elements by bucket.
It requires, however, a significant amount of data in order to extract meaningful patterns - to do so, we focus on a
specific ranking threshold 𝜓 to extract data from, large enough to be meaningful. Thus,


                                       𝒻𝑅𝐻 = Histogram({𝑓 (𝑖)}), ∀𝑓 ∈ 𝐹, ∀𝑖 ∈ 𝑠 : 𝑟𝑎𝑛𝑘 (𝑖) < 𝜓                                  (7)

  Ranking comparisons. Here, we shift the attention from the item features to item scores: we achieve this by comparing
the same item set in a search using different scoring mechanisms, i.e. using subsets of the adjustments from Equations 2
and 3. Thus, considering two rankings 𝜋1 and 𝜋 2 of the same items, constructed with different scoring functions, we
employ multiple functions ranking comparison functions to extract summary statistics of their similarity. Examples of
such metrics include correlation metrics such as Kendall’s 𝜏 (see equation 5), Cosine similarity, Pearson Correlation, etc.
Formally, for a set of ranking correlation functions Ω, we have:


                                                     𝒻𝑅𝐶 = {𝜔 (𝜋1, 𝜋2 )}, ∀𝜔 ∈ Ω                                                (8)
  The full set of meta-features is then ℱ = 𝒻𝑅𝐻 ∪ 𝒻𝑅𝐶 , which may translate into a set of hundred or even thousands
of meta-features, depending on how many different parameters are used to instantiate the meta-feature classes. To
tackle this issue, the procedure assumes the application of a feature selection process, which shall find the best set
of meta-features per policy. We make no assumptions on the feature selection procedure, but focus particularly on
correlation feature selection.
                                                                  8
4     EXPERIMENTAL STUDY
We validate Juggler in the context of top-N property recommendations in Expedia brand. We use both private data
and proprietary recommendation algorithms, which have been designed and are continuously tuned to address Multi-
Stakeholder recommendations. In this context, each hotel ℎ is scored using the following rule:


                                       𝑠 (ℎ) = utility(ℎ) + compensation(ℎ) + . . .                                      (9)
    where utility refers to the relevance each item holds to meet the customer’s needs and compensation refers to an
adjustment responsible to penalize lower margin properties [31]. Notice the formula is incomplete, meaning there are
many hidden adjustments in place to account for many other marketplace objectives. We account for all adjustments in
the results presented in this document, which makes the problem both more realistic and more difficult to solve. Notice,
for instance, that otherwise we could not use the same weight on both adjustments, as it would always yield the same
ranking outcome, rending some combinations worthless.

4.1    Data and meta-features
The data used for meta-model fitting refers to a sample of 6 million searches from 2019. The data is split into train-
ing/tuning/validation subsets by randomly sampling searches according to the 70/10/20 rule, respectively.
    Each search is a collection of properties, and it has both search-specific attributes and property-specific attributes.
The search is also accompanied by the utility scores per property and has all features required to compute the remaining
adjustments. The complete list of attributes can be organized as follows:

      • Search features: id, destination (identifier, country and type), check-in/out derived signals (day of week, week of
        year, month), booking window and length of stay. These require no pre-processing, thus becoming meta-features
        by default.
      • Property features: price, margin, star rating, guest rating, distance to center, production statistics, etc. We use
        these attributes as features 𝑓 in 𝒻𝑅𝐻 meta-features, with 𝜓 = 30. No other thresholds have been used for
        simplicity.
      • Property scores: utility, compensation and all its variations are calculated for all properties. Then, the following
        𝜔 functions will be used to compare rankings: Kendall’s 𝜏, Pearson’s correlation and Cosine similarity.

    Afterwards, models are simulated using a more recent data sample with 100000 searches, which covers the period of
January 2020 until March 2021. The features are exactly the same as before.

4.2    Policies
Following the convention from Equation 3, we study the Multi-Stakeholder problem through the following rule:


                                𝑤𝑠 (ℎ) = 𝑤𝑢 × utility(ℎ) + 𝑤𝑐 × compensation(ℎ) + . . .                                 (10)
    defining variables 𝑤𝑢 and 𝑤𝑐 , responsible for weighting utility and compensation scores, respectively.
    We take advantage of NDCG as a measure of customer utility, given it is a standard choice in Recommender Systems.
Furthermore, we explore Ranking Correlations and Ranking Fairness through Kendall’s 𝜏 and Group Fairness metrics,
presented in Equations 5 and 6. These will be used to further explore the rankings produced during simulations, in an
attempt to understand how well do we perform in specific business objectives.
                                                              9
                                                                                                                                                                        Tiago Cunha, Ioannis Partalas, and Phong Nguyen


   In Ranking Correlations, we focus on specific attributes to create the ideal rankings: lowest price, highest margin,
lowest distance, highest guest rating and highest number of reviews. In order to ensure all objectives are optimized, the
dimensions which aim for lower values - price and distance - are ranked in increasing order. For simplicity, we refer to
these as 𝜏𝑝𝑟𝑖𝑐𝑒 , 𝜏𝑚𝑎𝑟𝑔𝑖𝑛 , 𝜏𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 , 𝜏𝑟𝑎𝑡𝑖𝑛𝑔 , 𝜏𝑟𝑒𝑣𝑖𝑒𝑤𝑠 , respectively.
   Group Fairness focus on specific marketplace components we wish to improve upon by offering a more balanced
solution of each category, namely: markets within destination, property types, property recency in marketplace (in
years), property production statistics and property branding versus non branded properties. These will be denoted as
𝐺𝐹 𝑚𝑎𝑟𝑘𝑒𝑡 , 𝐺𝐹 𝑡 𝑦𝑝𝑒 , 𝐺𝐹 𝑟𝑒𝑐𝑒𝑛𝑐𝑦 , 𝐺𝐹 𝑝𝑟𝑜𝑑 , 𝐺𝐹 𝑏𝑟𝑎𝑛𝑑 henceforth.
   The objectives can thus assigned to the following stakeholders:
     • Customers: associated with conversion and property quality metrics such as NDCG, 𝜏𝑝𝑟𝑖𝑐𝑒 , 𝜏𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 , 𝜏𝑟𝑎𝑡𝑖𝑛𝑔 ,
       𝜏𝑟𝑒𝑣𝑖𝑒𝑤𝑠
     • Providers: which aim mostly to have a fair exposure rate, regardless of their characteristics, are represented by
       all Group Fairness metrics, i.e. 𝐺𝐹 𝑚𝑎𝑟𝑘𝑒𝑡 , 𝐺𝐹 𝑡 𝑦𝑝𝑒 , 𝐺𝐹 𝑟𝑒𝑐𝑒𝑛𝑐𝑦 , 𝐺𝐹 𝑝𝑟𝑜𝑑 and 𝐺𝐹 𝑏𝑟𝑎𝑛𝑑 .
     • System: which focus mainly in conversion and revenue - namely through NDCG and 𝜏𝑚𝑎𝑟𝑔𝑖𝑛 - although it
       extends its interest also to Group Fairness metrics to maintain marketplace health.
   To make sure to design policies that are able to address multi-objective problems in a meaningful way, it is important
to focus on competing objectives, ideally addressing all stakeholder’s objectives. One way to ascertain such competition
is to ensure low correlation between objectives. The correlation matrix for all objectives proposed is shown in Figure 3.
With the exception of Group Fairness objectives, all other candidates have low correlation amongst themselves.

                                                                                                                                   t                 y
                                                         G           rgin                  ce
                                                                                                      ing
                                                                                                                   s        r ke           e      enc         d          nd
                                                        C                       e       tan                     iew         a              yp    ec          ro        ra
                                                     ND            ma       pric     dis        rat         re v       GF m            GF t GF r         GF p      GF b

                                        NDCG

                                         margin


                                             price


                                        distance


                                            rating


                                         reviews


                                      GFmarket

                                        GFtype

                                      GFrecency

                                        GFprod

                                       GFbrand




                                      0.2                    0.0                    0.2                      0.4                           0.6                    0.8          1.0


                                                       Fig. 3. Correlation matrix for all objectives.


   Under this context, we explore the following policies:
     • Policy I: focus on Customer and System objectives only: NDCG, 𝜏𝑝𝑟𝑖𝑐𝑒 , 𝜏𝑚𝑎𝑟𝑔𝑖𝑛 , 𝜏𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 , 𝜏𝑟𝑒𝑣𝑖𝑒𝑤𝑠 , 𝜏𝑟𝑎𝑡𝑖𝑛𝑔 . The
       rationale for this policy is to understand what happens if we disregard Producers altogether.
                                                                                                            10
      • Policy II: aims to find the best ranking in terms of customer relevance and marketplace fairness, thus tending to
        customer and partner’s interests, i.e. NDCG, 𝐺𝐹 𝑚𝑎𝑟𝑘𝑒𝑡 , 𝐺𝐹 𝑡 𝑦𝑝𝑒 , 𝐺𝐹 𝑟𝑒𝑐𝑒𝑛𝑐𝑦 , 𝐺𝐹 𝑝𝑟𝑜𝑑 , 𝐺𝐹 𝑏𝑟𝑎𝑛𝑑 . In this case, we
        forego the System’s own interests, attempting to straighten the relationship between Consumers and Producers.
      • Policy III: focus simply on the metrics which are directly optimized by the utility and compensation adjust-
        ments, i.e. NDCG, 𝜏𝑚𝑎𝑟𝑔𝑖𝑛 . The goal is to understand if optimizing few and adjustments directly correlated with
        the objectives is an easier task than to optimize for a larger set of objectives, with diverse set of goals.
      • Policy IV: focus mostly on customer relevance, with no direct metric aiming for compensation adjustment: i.e.
        NDCG, 𝜏𝑝𝑟𝑖𝑐𝑒 , 𝜏𝑟𝑎𝑡𝑖𝑛𝑔 . Here we attempt to observe how does the process behave when we disregard both System
        and Producers objectives.
      • Policy V: aims to optimize only partner’s fairness objectives, i.e. 𝐺𝐹 𝑚𝑎𝑟𝑘𝑒𝑡 , 𝐺𝐹 𝑡 𝑦𝑝𝑒 , 𝐺𝐹 𝑟𝑒𝑐𝑒𝑛𝑐𝑦 , 𝐺𝐹 𝑝𝑟𝑜𝑑 , 𝐺𝐹 𝑏𝑟𝑎𝑛𝑑 .
        Here, we disregard any objective regarding Customer and System, thus trying to understand how much can
        fairness be improved using the proposed approach and what are its effects on all stakeholders.
      • Policy VI: tries to find the best overall result for all objectives, thus testing the outcome of using as many objectives
        as desired: i.e. NDCG, 𝜏𝑝𝑟𝑖𝑐𝑒 , 𝜏𝑚𝑎𝑟𝑔𝑖𝑛 , 𝜏𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 , 𝜏𝑟𝑎𝑡𝑖𝑛𝑔 , 𝜏𝑟𝑒𝑣𝑖𝑒𝑤𝑠 , 𝐺𝐹 𝑚𝑎𝑟𝑘𝑒𝑡 , 𝐺𝐹 𝑡 𝑦𝑝𝑒 , 𝐺𝐹 𝑟𝑒𝑐𝑒𝑛𝑐𝑦 , 𝐺𝐹 𝑏𝑟𝑎𝑛𝑑 , 𝐺𝐹 𝑝𝑟𝑜𝑑 .
      • Policy VII: aims to find a good balance between conversion and fairness, while making sure the results are price
        sensitive, i.e. NDCG, 𝜏𝑝𝑟𝑖𝑐𝑒 , 𝐺𝐹 𝑡 𝑦𝑝𝑒 , 𝐺𝐹 𝑟𝑒𝑐𝑒𝑛𝑐𝑦 , 𝐺𝐹 𝑏𝑟𝑎𝑛𝑑 . The goal is to understand how does the framework
        behave without any direct optimization for System objectives.

4.3    Meta-labels
Following the procedure from Section 3.3, we have set the following ranges: 𝑤𝑢 ∈ [0.8, 1.5] and 𝑤𝑐 ∈ [0.3, 1.5]. The
decision has been made purely from domain knowledge and our wish to not deviate considerably from the default
settings. Furthermore, considering there are 2 adjustments, the sections defined follow the example presented in Figure 2.
Table 1 presents the respective weight ranges per section, with increments of 0.3 and 0.4 for 𝑤𝑢 and 𝑤𝑐 , respectively.

                                       low/low      low/high      med/med       high/low      high/high
                                𝑤𝑢     [0.8, 1.1]   [0.8, 1.1]    [1.1, 1.4]    [1.4, 1.7]     [1.4, 1.7]
                                𝑤𝑐     [0.3, 0.7]   [1.1, 1.5]    [0.7, 1.1]    [0.3, 0.7]     [1.1, 1.5]
                         Table 1. utility and compensation weight ranges per section in the objective space.



   Each section is explored with a grid size of 0.1 × 0.1, leading to 12 candidate points per section. The best solution per
policy is presented in Table 2, using the notation 𝑤𝑢 /𝑤𝑐 .

                                Pol.   low/low      low/high      med/med       high/low      high/high
                                   I    0.8/0.3       0.8/1.5      1.0/0.8       1.4/0.3        1.4/1.5
                                  II    0.8/0.3       0.8/1.5      1.0/0.8       1.2/0.3        1.2/1.5
                                 III    0.8/0.6       0.8/1.5      1.0/1.0       1.4/0.6        1.4/1.5
                                 IV     0.8/0.3       0.8/1.4      1.0/0.7       1.4/0.3        1.4/1.4
                                 V      0.8/0.3       0.8/1.5      1.0/0.7       1.2/0.3        1.2/1.1
                                 VI     0.8/0.4       0.8/1.5      1.0/0.8       1.4/0.4        1.4/1.5
                                VII     0.8/0.3       0.8/1.1      1.0/0.7       1.4/0.3        1.4/1.3
Table 2. Best solution per policy, with all candidate meta-labels the model can choose from at inference time. These are selected from
the ranges presented in Table 1.


                                                                    11
                                                                                    Tiago Cunha, Ioannis Partalas, and Phong Nguyen


    The results show there are indeed different candidate preferences, depending on the policy, which validate the
procedure employed. However, a couple of interesting results must be highlighted: 1) 𝑤𝑢 = 0.8 is always used for
low/low and low/high sections and medium/medium section always assigns 𝑤𝑢 = 1, but with different 𝑤𝑐 . This shows
a couple of things: the lowest end of the utility range could be even lower, to allow further discrimination amongst
policies (we refrain from doing so in order to ensure a minimum customer relevance score) and the default setting
𝑤𝑢 = 𝑤𝑐 = 1 is not always the ideal scenario across policies - it seems to work in utility but not in compensation.


4.4    Meta-model
The meta-models are fitted using lightGBM [25] and tuned using hyperopt [7]. Notice only this algorithm’s performance
is reported, since we wish to evaluate the framework’s overall behaviour and not find the best tuned model. Thus, we
have chosen a model which has provided good performance overall in offline testing, although we do not exclude the
hypothesis another model could perform better, depending on the policy employed. Furthermore, each meta-model
is compared only against majority voting baseline, which always predicts the most frequent label. This is the only
baseline we use because we could not find another direct baseline for this problem. In the simulations, we compare
against default settings (𝑤𝑢 = 𝑤𝑐 = 1).


5     RESULTS
5.1    Meta-model predictive performance
Table 3 presents the performance of Juggler and baseline models in each policy, across multiple classification predictive
performance metrics: precision, recall, F1, False Positive Rate (FPR) and Area Under Curve (AUC).



                             Policy    Model        Precision      Recall    F1    FPR     AUC
                               I      Juggler         0.40          0.41    0.39   0.16    0.72
                                      Baseline        0.06          0.25    0.10   0.25     0.5
                                II    Juggler         0.35          0.37    0.35   0.18    0.68
                                      Baseline        0.07          0.27    0.12   0.27     0.5
                               III    Juggler         0.41          0.45    0.37   0.35    0.63
                                      Baseline        0.17          0.42    0.25   0.42     0.5
                               IV     Juggler         0.40          0.45    0.38   0.26    0.67
                                      Baseline        0.11          0.33    0.17   0.33     0.5
                                V     Juggler         0.40          0.44    0.40   0.22    0.73
                                      Baseline        0.13          0.36    0.19   0.36     0.5
                               VI     Juggler         0.40          0.43    0.39   0.17    0.72
                                      Baseline        0.09          0.29    0.13   0.29     0.5
                               VII    Juggler         0.38          0.41    0.37   0.21    0.68
                                      Baseline        0.09          0.30    0.14   0.30     0.5
                                          Table 3. Meta-model scores across policies.




    The results show Juggler consistently outperforms the baseline in all metrics, showing it is the better option to
correctly predict the meta-labels assigned. We shall inspect the value of such predictions in Section 5.3.
                                                              12
5.2   Meta-feature importance
Table 4 presents the distribution of meta-features in Juggler for each policy. The procedure counts how frequently
each feature appears in tree nodes, with a higher value meaning it is more important. The results are grouped by
meta-feature class for readability purposes.

                               ℱ           I              II          III              IV         V            VI         VII
                             Search      81.60          74.35       75.20            79.43      71.38         79.77      70.79
                              𝒻𝑅𝐻        18.10          25.65       21.80            18.55      28.62         20.23      29.21
                              𝒻𝑅𝐶        0.30             0          3.00             2.02        0             0          0
                                      Table 4. Percent share from each meta-feature class.



   The results show the search features are the most frequent features (> 70%) in all Juggler models, an expected
result given the segmentation ability of multiple categorical variables - in fact, the destination identifier accounts for
the majority of such frequency. On the other hand, ranking correlation meta-features 𝒻𝑅𝐶 are almost negligible. A
justification for this behaviour may come from the fixed number of properties used to create the rankings (i.e. only
top 30 properties). Future work should address this issue by considering different thresholds. The ranking histogram
meta-features 𝒻𝑅𝐻 account for 18 − 30% of feature frequency, thus establishing themselves as a valuable meta-feature.

5.3   Simulations
Figure 4 presents the distribution of predicted meta-labels, represented by respective sections for readability purposes.

                                      100000


                                      80000


                                      60000


                                      40000


                                      20000


                                           0
                                                    I     II        III      IV        V        VI      VII
                                                                           policy
                                        high/high        high/low         low/high         low/low       medium/medium


                                         Fig. 4. Predicted section distribution per policy.


   We observe most policies take advantage of all 5 sections, which is an indicator this is a suitable solution to define
meta-labels. Good examples can be observed in policies I, II, V, VI and VII. Notice also how the medium/medium section
is typically one of the sections with less predictions. This is in itself a justification to use dynamic weight prediction,
the key contribution of our work.
   There are however, some policies with significant class imbalance, most notably policy III: the vast majority of
predictions fall under the low/high section. Such problem could potentially be addressed through increasing the
granularity of grid-search when defining sections, although we cannot exclude the hypothesis that the policy is
ill-defined.
                                                                            13
                                                                                    Tiago Cunha, Ioannis Partalas, and Phong Nguyen


5.4    Multi-Stakeholder impact
We inspect now the impact of predictions in terms of the dimensions explored by the policies. Table 5 presents the
percent changes against the default method (𝑤𝑢 = 𝑤𝑐 = 1) in top 30 properties with regards to multiple dimensions.


                               Metric         I        II     III    IV        V      VI     VII
                               NDCG          0.7     -1.8    -5.6    6.0     -3.2    -2.6    4.3
                                Price       18.7      2.7   -16.2   -27.0     -4     -0.1   22.3
                               Margin        8.7    18.6      0.5   19.4    11.7     8.7    20.7
                               Rating        3.7      2.0    -3.4    6.2      0.1    0.5     5.6
                              Distance      21.3      7.6    -17    27.7     -1.2    1.7     22
                              Reviews       92.5     -6.9   -27.8   71.7    -15.1    -9.4   24.2
                               Market       -32.8   -16.6   24.3    -51.3    -1.7    -5.4   -46.2
                             Production     -14.9    -7.3   11.3    -23.6    -1.0    -2.1   -20.6
                                Type        -31.8   -15.6   23.5     -50     -1.2    -4.8   -44.5
                              Recency       -28.7    -14    21.3    -45.1    -1.2    -4.2    -40
                               Brand         -27    -13.7   20.2    -42.4    -1.4    -4.3   -38.1
Table 5. Percent changes over default (𝑤𝑢 = 𝑤𝑐 = 1), with positive values meaning an improvement. Underlined values refer to
metrics optimized in policy, for improved readability.




  The results, show the following:

      • Policy I - improves in all metrics in policy, most notably in distance and price. However, the fairness metrics are
        negatively affected, particularly in property type and sub-market. Although this policy works as expected, the
        impact is too negative on the Providers side, meaning it will negatively affect our relationship with them and
        potentially lead to withdrawal from the System.
      • Policy II - does not work as expected: it shows a decrease in all objective metrics, specially in terms of fairness.
        It shows, however, improvement in price, margin, rating and distance. One possible justification is that the
        current set of adjustments has been designed and tuned from System’s point of view, and leads to sub-optimal
        performance when we try to tune it for a different perspective.
      • Policy III - improves in margin, but with negative effect on NDCG. Indirectly, it improves all fairness metrics,
        but conversion is severely impaired. This behaviour points out to the overall strength of the compensation
        adjustment, which seems to overpower at times the utility score - to address this issue, one could retry the
        experiment with more constraints on the compensation adjustment weight range.
      • Policy IV - improves in all metrics, plus in margin and reviews. It seems such policy optimizes also for System’s
        objectives, which is an interesting finding. However, it introduces however the most negative effect in fairness
        metrics across policies - this behaviour may lead, like in Policy I, to risk of Producers leaving the System, which
        will contribute to negative long term value.
      • Policy V - fails in all metrics in policy (and most of the remaining), except for margin. However, the resulting
        policy is the one which deviates the least from the default ranking - this indicates there may be issues in
        optimizing for fairness objectives without any specific fairness adjustment included in the equation. It may also
        mean the existing fairness metrics are not ideal to be optimized in this setup, although such observation requires
        more experimental data to be supported.
                                                             14
     • Policy VI - improves margin, rating and distance but it is worse in all other metrics. The results seem to indicate,
        like in Policy III, that the subset of weights selected for the compensation weight is not ideal as it allows the
        compensation adjustment to overpower the remaining adjustments.
     • Policy VII - improves NDCG and price, but it degrades substantially in fairness metrics. It also improves ratings,
        margin, distance and reviews. This policy performs well overall, although it also highlights the issue on the
        exclusion of fairness based adjustments from the equation.
    Overall, the simulation results show a decent behaviour against the policy expectations, with only 2 policies failing on
most or all accounts: policies II and V. Such policies have in common the fact that they optimize for all fairness metrics,
which may be an indication that the proposed Group Fairness function in Equation 6 is not the most suitable definition
for this scope. Also, it could mean that the non-existence of a fairness adjustment to weight using the meta-model,
would make such optimization troublesome. It would be interesting to try different fairness definitions in future work
to understand whether this issue can be addressed in such policies.
    Another interesting observation comes from the fact that not all policies are easily optimizable. For instance, while
policies I and IV optimize all metrics, policies III and VI are able to optimize only in a few objectives. This happens
because: 1) the hidden adjustments in Equation 10 have great impact on the overall results (these are most noticeable
when the weights predicted are lower than the default value) and 2) the predicted weights affect all items in a search
equally, meaning it has limited impact on how much can be optimized. Both points can be addressed by including the
remaining adjustments under this framework.
    Regarding stakeholder’s impact, it seems policies are better at improving recommendations for Customers and
System than for Providers: in fact, only in policy III are fairness metrics greatly improved, even though they are not
included in the policy. This means the current recommendation procedure can only increase fairness by increasing
margin. To address this issue, future work could weight directly fairness adjustments through Juggler using global
weights or weight each property individually across all adjustments - this would allow more granular corrections using
Juggler and a potential better approximation to all stakeholder’s objectives.

6    CONCLUSIONS
This paper has presented a Meta-Learning approach to address the Multi-Stakeholder recommendation problem. We
take advantage of historical transactions and simulations to understand which are the best weights per search, given a
specific policy. The results show the models have high predictive performance and simulations have shown it is able
to successfully improve multiple objectives, depending on the chosen policy. There are however many issues to be
improved, namely: replace manually defined meta-features with embeddings, to identify more and better metrics to
be optimized in policies, to use Reinforcement Learning models able to perform exploration of objective vectors per
search (for instance, Multi-Armed bandits) and to improve the policy definition mechanism using constraints and/or
the definition of primary and secondary objectives, in order to attempt to clarify assignment of best labels.




                                                             15
                                                                                                    Tiago Cunha, Ioannis Partalas, and Phong Nguyen


REFERENCES
 [1] Himan Abdollahpouri, Gediminas Adomavicius, Robin Burke, Ido Guy, Dietmar Jannach, Toshihiro Kamishima, Jan Krasnodebski, and Luiz Pizzato.
     2020. Multistakeholder recommendation: Survey and research directions. User Modeling and User-Adapted Interaction 30, 1 (may 2020), 127–158.
 [2] Himan Abdollahpouri and Robin Burke. 2019. Multi-stakeholder Recommendation and its Connection to Multi-sided Fairness. Technical Report.
     arXiv:1907.13158v1
 [3] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2017. Recommender systems as multistakeholder environments. In UMAP 2017 -
     Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization. 347–348.
 [4] Himan Abdollahpouri and Steve Essinger. 2017. Multiple stakeholders in music recommender systems. arXiv:1708.00120
 [5] Gediminas Adomavicius and Jingjing Zhang. 2012. Impact of data characteristics on recommender systems performance. ACM Transactions on
     Management Information Systems 3, 1 (2012), 1–17.
 [6] Amos Azaria, Avinatan Hassidim, Sarit Kraus, Adi Eshkol, Ofer Weintraub, and Irit Netanely. 2013. Movie Recommender System for Profit Maximization.
     Technical Report.
 [7] J. Bergstra, D. Yamins, and D. D. Cox. 2013. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision
     Architectures. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (Atlanta, GA, USA).
     JMLR.org, I–115–I–123.
 [8] Pavel Brazdil, Christophe Giraud-Carrier, Carlos Soares, and Ricardo Vilalta. 2009. Metalearning: Applications to Data Mining (1 ed.). Springer.
 [9] Robin Burke, Nasim Sonboli, and Aldo Ordoñez-Gauger. 2018. Balanced Neighborhoods for Multi-sided Fairness in Recommendation. Technical
     Report.
[10] Andrew Collins, Joeran Beel, and Dominika Tkaczyk. 2018. One-at-a-time: A Meta-Learning Recommender-System for Recommendation.
     arXiv:1805.12118
[11] Andrew Collins, Laura Tierney, and Joeran Beel. 2020. Per-Instance Algorithm Selection for Recommender Systems via Instance Clustering. Technical
     Report.
[12] Tiago Cunha, Carlos Soares, and André C.P.L.F. Carvalho. 2017. Metalearning for Context-aware Filtering: Selection of Tensor Factorization
     Algorithms. In Proceedings of the 11th ACM Conference on Recommender Systems. ACM, 14–22.
[13] T. Cunha, C. Soares, and A.C.P.L.F. de Carvalho. 2018. Metalearning and Recommender Systems: A literature review and empirical study on the
     algorithm selection problem for Collaborative Filtering. Information Sciences 423 (2018).
[14] Tiago Cunha, Carlos Soares, and André C.P.L.F. de Carvalho. 2016. Selecting Collaborative Filtering algorithms using Metalearning. In European
     Conference on Machine Learning and Knowledge Discovery in Databases. 393–409.
[15] Tiago Cunha, Carlos Soares, and André C P L F de Carvalho. 2018. CF4CF: Recommending Collaborative Filtering Algorithms Using Collaborative
     Filtering. In Proceedings of the 12th ACM Conference on Recommender Systems. 357–361.
[16] Michael Ekstrand and John Riedl. 2012. When Recommenders Fail: Predicting Recommender Failure for Algorithm Selection and Combination. In
     Proceedings of the 6th ACM Conference on Recommender Systems. ACM, 233–236.
[17] Michael D. Ekstrand, Robin Burke, and Fernando Diaz. 2019. Fairness and discrimination in recommendation and retrieval. In Proceedings of the 13th
     ACM Conference on Recommender Systems. ACM, Inc, 576–577.
[18] Michael T M Emmerich and André H Deutz. 2018. A tutorial on multiobjective optimization: fundamentals and evolutionary methods. Natural
     Computing 17 (2018), 585–609.
[19] Ruoyuan Gao and Chirag Shah. 2019. How fair can we go: Detecting the boundaries of fairness optimization in information retrieval. In ICTIR.
     ACM, 229–236.
[20] Jonathan L Herlocker, Joseph a. Konstan, Loren G Terveen, and John T Riedl. 2004. Evaluating collaborative filtering recommender systems. ACM
     Transactions on Information Systems 22, 1 (2004), 5–53.
[21] Hisao Ishibuchi, Noritaka Tsukamoto, and Yusuke Nojima. 2008. Evolutionary Many-Objective Optimization: A Short Review. In Proc. of 2008 IEEE
     Congress on Evolutionary Computation. 2424–2431.
[22] Dietmar Jannach and Gediminas Adomavicius. 2016. Recommendations with a purpose. In Proceedings of the 10th ACM Conference on Recommender
     Systems. 7–10.
[23] A Kalousis. 2002. Algorithm Selection via Meta-Learning. Ph.D. University of Geneve, Department of Computer Science.
[24] Jorge Kanda, Andre de Carvalho, Eduardo Hruschka, Carlos Soares, and Pavel Brazdil. 2016. Meta-learning to select the best meta-heuristic for the
     Traveling Salesman Problem: A comparison of meta-features. Neurocomputing 205 (2016), 393–406.
[25] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient
     Boosting Decision Tree. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc.
[26] Bingdong Li, Jinlong Li, Ke Tang, and Xin Yao. 2015. Many-objective evolutionary algorithms: A survey. ACM Comput. Surv 48, 13 (2015).
[27] Wei Lu, Shanshan Chen, Keqian Li, and Laks V.S. Lakshmanan. 2014. Show me the money: Dynamic recommendations for revenue maximization. In
     Proceedings of the VLDB Endowment. 1785–1796.
[28] Rishabh Mehrotra and Benjamin Carterette. 2019. Recommendations in a marketplace. In Proceedings of the 13th ACM Conference on Recommender
     Systems. 580–581.



                                                                          16
[29] Rishabh Mehrotra, James McInerney, Hugues Bouchard, Mounia Lalmas, and Fernando Diaz. 2018. Towards a Fair Marketplace: Counterfactual
     Evaluation of the Trade-off between Relevance, Fairness and Satisfaction in Recommendation Systems. In Proceedings of the 27th ACM International
     Conference on Information and Knowledge Management. ACM, 2243–2251.
[30] Mustafa Mısır and Michèle Sebag. 2017. Alors: An algorithm recommender system. Artificial Intelligence 244, 244 (2017), 291–314.
[31] Phong Nguyen, John Dines, and Jan Krasnodebski. 2017. A multi-objective learning to re-rank approach to optimize online marketplaces for multiple
     stakeholders. arXiv:1708.00651
[32] Phong Nguyen, Jun Wang, Melanie Hilario, and Alexandros Kalousis. 2012. Learning Heterogeneous Similarity Measures for Hybrid-
     Recommendations in Meta-Mining. IEEE International Conference on Data Mining (2012), 1026–1031.
[33] T Pereira, T Cunha, and C Soares. 2020. 𝜇 − cf2vec: Representation Learning for Personalized Algorithm Selection in Recommender Systems. In
     2020 International Conference on Data Mining Workshops. IEEE Computer Society, 181–188.
[34] John Rice. 1976. The Algorithm Selection Problem. Advances in Computers 15 (1976), 65–118.
[35] Mario Rodriguez, Christian Posse, and Ethan Zhang. 2012. Multiple objective optimization in recommender systems. In Proceedings of the 6th ACM
     Conference on Recommender Systems. 11–18.
[36] André Luis Debiaso Rossi, André Carlos Ponce De Leon Ferreira de Carvalho, Carlos Soares, and Bruno Feres de Souza. 2014. MetaStream: A
     meta-learning based method for periodic algorithm selection in time-changing data. Neurocomputing 127 (March 2014), 52–64.
[37] Kate Smith-Miles. 2008. Cross-disciplinary perspectives on meta-learning for algorithm selection. Comput. Surveys 41, 1 (Dec. 2008), 1–25.
[38] Carlos Soares, Pavel B Brazdil, and Petr Kuba. 2004. A Meta-Learning Method to Select the Kernel Width in Support Vector Regression. Machine
     Learning 54, 3 (2004), 195–209.
[39] Joaquin Vanschoren. 2010. Understanding machine learning performance with experiment databases. Ph.D. Dissertation. Katholieke Universiteit
     Leuven.
[40] Joaquin Vanschoren. 2018. Meta-Learning: A Survey. CoRR abs/1810.03548 (2018). arXiv:1810.03548
[41] Christian Von Lücken, Benjamín Barán, and Carlos Brizuela. 2014. A survey on multi-objective evolutionary algorithms for many-objective problems.
     Computational Optimization and Applications 58 (2014), 707–756.
[42] Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. 2013. A Theoretical Analysis of NDCG Type Ranking Measures. In Proceedings of the
     26th Annual Conference on Learning Theory (Proceedings of Machine Learning Research, Vol. 30). PMLR, Princeton, NJ, USA, 25–54.
[43] David Wolpert and William Macready. 1997. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1, 1 (April
     1997), 67–82.
[44] Yong Zheng. 2019. Multi-Stakeholder Recommendations: Case Studies, Methods and Challenges. In Proceedings of the 13th ACM Conference on
     Recommender Systems. ACM.
[45] Aimin Zhou, Bo-Yang Qu, Hui Li, Shi-Zheng Zhao, Ponnuthurai Nagaratnam Suganthan, and Qingfu Zhang. 2011. Multiobjective evolutionary
     algorithms: A survey of the state of the art. Swarm and Evolutionary Computation 1 (2011), 32–49.




                                                                         17