<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Momentum-based Gradient Methods in Multi-Objective Recommendation∗</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>BLAGOJ MITREVSKI</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Symphony</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>North Macedonia MILENA FILIPOVIC</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Swisscom</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Switzerland DIEGO ANTOGNINI</string-name>
          <email>diego.antognini@epfl.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ecole Polytechnique Fédérale de Lausanne</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Switzerland EMMA LEJAL GLAUDE</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Swisscom</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Switzerland BOI FALTINGS</string-name>
          <email>boi.faltings@epfl.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ecole Polytechnique Fédérale de Lausanne</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Switzerland CLAUDIU MUSAT</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Swisscom</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Switzerland</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Authors' addresses: Blagoj Mitrevski</institution>
          ,
          <addr-line>Symphony, North</addr-line>
          <country country="MK">Macedonia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Multi-objective gradient methods are becoming the standard for solving multi-objective problems. Among others, they show promising results in developing multi-objective recommender systems with both correlated and conflicting objectives. Classic multi-gradient descent usually relies on the combination of the gradients, not including the computation of first and second moments of the gradients. This leads to a brittle behavior and misses important areas in the solution space. In this work, we create a multi-objective model-agnostic Adamize method that leverage the benefits of the Adam optimizer in single-objective problems. This corrects and stabilizes the gradients of every objective before calculating a common gradient descent vector that optimizes all the objectives simultaneously. We evaluate the benefits of Multi-objective Adamize on two multi-objective recommender systems and for three diferent objective combinations, both correlated or conflicting. We report significant improvements, measured with three diferent Pareto front metrics: hypervolume, coverage, and spacing. Finally, we show that the Adamized Pareto front strictly dominates the previous one on multiple objective pairs. Additional Key Words and Phrases: multi-objective recommender systems, gradient-based optimization methods</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 INTRODUCTION</title>
      <p>Decision-making relies on multiple factors. The world is complex, and many problems require an optimization for more
than one objective. Multi-objective problems are present in fields like engineering, economics, finance, logistics, and
many more. Multi-objective optimization is the area of decision-making in which we simultaneously optimize for
more than one objective. We distinguish two types of objectives: the correlated and the conflicting ones. When the
objectives are conflicting, the choice of the optimal decisions needs to be taken in the presence of trade-ofs: choosing
one objective usually comes at the expense of the others. In practice, the decision of choosing the best solution is left to
the domain experts or the business stakeholders. Multi-objective optimization provides a data-driven alternative.</p>
      <p>
        Recommenders are not only about relevance. One of the objectives of recommender systems is to be accurate, namely
to successfully model the user’s preferences. These systems are however not limited to accuracy. Another objective that
can improve the user’s experience with the recommender system is proposing more diverse content. It helps the user to
escape their filter bubble that can reduce user creativity, learning, and connection [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Also, promoting more recent
content [
        <xref ref-type="bibr" rid="ref2 ref5">2, 5</xref>
        ] can bring social value by keeping the user up to date.
      </p>
      <p>
        However, among the multiple stakeholders of the recommender system it is possible to encounter diverse and
competing objectives. For instance, increasing the revenue for a company does not always mean the user will get a
∗Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Presented at the MORS workshop held in conjunction with the 15th ACM Conference on Recommender Systems (RecSys), 2021, in Amsterdam,
Netherlands.
†Work done while at EPFL and Swisscom.
better and improved experience. If an application store puts more importance on recommending overpriced applications
it may increase its revenue, but this strategy will hurt developers of free and cost-efective applications; also it will
put a burden on the user’s budget. This becomes more frequent, as more and more companies are becoming socially
responsible [
        <xref ref-type="bibr" rid="ref21 ref22 ref7">7, 21, 22</xref>
        ], which can be unaligned with traditional business objectives.
      </p>
      <p>
        From one to multiple objectives. Prior work [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] proposed the gradient-based multi-objective optimization algorithm,
called the Stochastic Multi-Subgradient Descent Algorithm (SMSGDA), or an improved version for recommendation [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
The method computes the gradients of each objective and then constructs a common descent vector by taking a linear
combination of the individual gradients. Each gradient’s weight is computed by solving a quadratic constrained
optimization problem. Finally, the model parameters are updated in the opposite direction of the common descent vector. The
problem with stochastic optimization is the stochasticity that comes from using mini-batches or dropout regularization.
      </p>
      <p>
        In single-objective settings, this problem is solved using optimizers like Adam [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and RMSprop [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. These stabilize
the computation and speed up to convergence. In a similar fashion, we introduce a simple yet efective Adamize trick
for multi-objective problems. We keep track of the first and second moments of the gradients and use the momentums
to correct the gradients and compute better gradient weights. Finally, we calculate a more stable common descent
vector using the corrected gradients.
      </p>
      <p>In this work, we thus make the following contributions: we address the recommendation task with multiple-objectives,
in which objectives can either be correlated or conflicting. We first present the Adamize trick to correct and stabilize
the gradients of every objective before aggregating them into a common gradient descent. We then show that our
novel multi-gradient descent method is model-agnostic and can be easily integrated into state-of-the-art recommender
systems. We evaluate our method using two real-world recommendation datasets with up to three objectives. We then
compare the results of the momentum-based optimization with the state-of-the-art using three diferent metrics based on
the resulting Pareto fronts. As the observed diferences are stark, we complement our analysis with visualizations that
further underline the usefulness of momentum-based multi-gradient descent in multi-objective recommender systems.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        With the advances of neural approaches in other fields, they also found their way into recommendation systems.
First, the introduction of Neural network-based Collaborative Filtering [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] showed promising results. Later, the
Variational Autoencoders for Collaborative Filtering [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] became state-of-the-art and still keeps its title as one of the best
collaborative filtering based recommender.
      </p>
      <p>
        Recommender systems and ranking problems have similarities: learning a personalized recommender can be
transformed as a ranking problem [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The multi-objective ranking optimization in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is solved by label aggregation. This
method collects the multiple labels of the training examples into a single label, and then use a single-objective optimizer
to rank the aggregated label, solving the multi-objective problem.
      </p>
      <p>
        Alternatively, the gradient-based methods can solve the multi-objective optimization problem. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the authors
propose the Multi-Gradient Descent Algorithm (MGDA) for optimizing multi-objective based on the steepest descent
method. This algorithm is an adjustment of the classical gradient descent algorithm to work with multiple objectives.
The same authors of the MGDA algorithm extended it to Stochastic Multi-Subgradient Descent Algorithm (SMSGDA)
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The SMSGDA is a stochastic version of the MGDA that could also work with non-smooth objective functions. A
more robust gradient-based multi-objective optimization algorithm that still works in cases when the exact gradients
could not be computed is presented in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. To alleviate the inaccuracies, an additional condition is presented for the
descent direction. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] proposed a gradient-based algorithm for optimizing multi-objective recommender systems. Their
Manuscript submitted to ACM
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>BACKGROUND 3.1</title>
      <p>
        3.1.1
There are diferent ways of solving the multi-objective optimization problem, such as re-ranking or gradient-based
solutions. In this work, we focus on the latter and base our work on the multi-gradient descent algorithm [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>Definitions</title>
      <p>Multi-Objective Optimization. The multi-objective optimization of a model can be formally defined as:
where  are the model parameters,  is the dimension of the model parameters, L ( ) : R → R is a vector valued
objective function with continuously diferentiable objective functions
L ( ) : R → R.
3.1.2</p>
      <p>
        Common Descent Vector. The common descent vector [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is the core of the multi-gradient descent algorithm. It is
computed with a linear combination of the gradients:
m∈iRn L ( ) = m∈iRn L1 ( ), . . . L ( )
∇ L ( ) =

=1
Õ  ∇ L ( )
solution is based on finding a common descent vector, which is a combination of the gradients of every objective. By
taking an optimization step in the opposite direction of this common descent vector, the model is optimized for all
objectives simultaneously. We build upon this work and improve convergence and stability of the optimization.
(1)
(2)
(3)
with  ≥ 0,  ∈ {1, . . . , }, and Í
      </p>
      <p>=0  = 1, where L ( ) is the gradient of the i-th objective,  is the weight of the
i-th gradient objective,  is the number of objectives, and  are the model parameters.
3.1.3</p>
      <p>Pareto Stationary Solution. A solution  of the eq. (2) is Pareto stationary if it satisfies the Karush–Kuhn–Tucker
(KKT) conditions. In other words, there exists 1 . . .  that statisfy the three following constraints:

=0

=1
1 . . .  ≥ 0,</p>
      <p>Õ  = 1, and Õ  ∇ L ( ) = 0
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Multi-Gradient Descent Algorithm (MGDA)</title>
      <p>
        After the definition of the common descent vector and the Pareto stationary solution, we present the multi-gradient
descent algorithm (MGDA) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The algorithm is deterministic and is proven to converge to a Pareto stationary solution.
For an arbitrary number of objectives, this algorithm computes the alphas (i.e., weights of gradients, see eq. (2)) to
create a common descent vector. This vector is made such that the optimization step in the opposite direction of this
common descent vector; all the objectives are simultaneously optimized. To compute the alphas, we need to solve the
following quadratic constrained optimization problem (QCOP):
1,...,  =1
      </p>
      <p>
min  Õ  ∇ L ( ) |
2 
=1
Õ  = 1,  ≥ 0






After computing the alphas, we compute the final common descent vector ∇ L ( ). If ∇ L ( ) = 0 the solution is
Pareto Stationary. Otherwise, ∇ L ( ) ≠ 0, the solution is not Pareto Stationary and thus, we apply an optimisation
step in the opposite direction of the common descent vector, improving each objective at once.</p>
      <p>
        It is worth noting that if there are two objectives, an analytical solution to the QCOP problem exists. Otherwise, the
QCOP can be solved by using the Frank-Wolfe constrained optimization algorithm as in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
3.3
      </p>
    </sec>
    <sec id="sec-6">
      <title>Stochastic Multi-Subgradient Descent Algorithm (SMSGDA)</title>
      <p>
        The previous multi-gradient descent algorithm has few drawbacks to be used in real-world problems. A first one is the
need to compute the full gradient at every optimization step which makes it computationally expensive and in some
cases infeasible. A second one, the requirements do not allow to use non-smooth loss functions as objective functions.
All of these drawbacks are solved by the Stochastic Multi-Subgradient Descent Algorithm (SMSGDA) presented in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
The Stochastic Multi-Subgradient Descent Algorithm is similar to the Multi-Gradient Descent Algorithm, with the
diference that instead of computing the gradients for every objective and then computing the alphas using the whole
dataset, we are computing them on a subset of the dataset. Therefore, the stochasticity comes from using mini-batches.
3.4
      </p>
    </sec>
    <sec id="sec-7">
      <title>Gradient Normalization</title>
      <p>In real-world use-cases, the objectives for which we are optimizing may have diferent scales. This causes a problem
for the MGDA and SMSGDA algorithms because they will favor the objectives that have a higher scale, leading to
unbalanced solutions that perform well on certain objectives, but badly on the others. To solve this problem, after
computing the gradients, the authors normalize them to interval according to the maximal empirical loss for each
objective: ∇ Lˆ ( ) = ∇ L ( ) ,where ∇ Lˆ ( ) is the resulting normalized gradient, ∇ L ( ) is the original</p>
      <p>L ( )
gradient of the -th objective, L ( ) is the initial loss for the -th objective which is used as approximation for the
maximum empirical loss for the given objective.
4</p>
    </sec>
    <sec id="sec-8">
      <title>THE ADAMIZE TRICK FOR MULTI-OBJECTIVE OPTIMIZATION</title>
      <p>
        When optimizing models on a single objective, we are usually doing it in a stochastic fashion. The stochasticity comes
from using mini-batch stochastic gradient descent where we use subsets of the data to compute the gradient, or use a
dropout regularization [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. The stochasticity in the optimization algorithm introduces noise in the gradient and may
cause the algorithm to converge slower, or even diverge.
      </p>
      <p>
        There exist multiple optimizers like Adam [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and RMSprop [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] which aim to stabilize the gradients when doing
an optimization step. They achieve the stabilization by keeping a running average of the first and second moments of
the gradients and taking a step in the opposite direction of the corrected gradient by using the first and the second
momentum. For example, the corrected gradient moves faster on steep slopes and oscillates less on valleys and thus,
move faster to the optima. Following the intuition behind ADAM and RMSprop, it may be beneficial, when using the
Stochastic Multi-Subgradient Descent Algorithm (SMSGD) to smooth the gradients from the diferent objectives before
calculating alphas and combining them to get the final common descent vector. Intuitively, this may lead to more stable
alpha computations, faster convergence, and convergence to better solutions.
      </p>
      <p>The vanilla SMSGD algorithm is presented in Section 3.3. Our proposition is to use Adam based optimizers for every
objective before computing the common descent vector. We directly add the Adam computation for every objective.
Therefore, the diference with the vanilla SMSGD is that we are also keeping the running average for the gradient of
every objective, instead of keeping only the average of the common descent vector. Since these are the gradients that
afect the computations of the alphas, the final common descent vector is expected to be more stable. The pseudo-code
of the Adamize trick for the gradients is presented in Algorithm 1 and Algorithm 2. The diference between the vanilla
SMSGD and Algorithm 1 is in the bold line: instead of using the original gradients from every objective, we correct them
Manuscript submitted to ACM
using the first and second momentums, and we use the corrected gradients to compute the alphas and the common
descent vector. To overcome the cold-start problem like in the original Adam algorithm, we initialize all the parameters
to zero, and then update them very epoch.</p>
      <p>In terms of computation and memory requirement, the complexity is linear with respect to the number of objectives.
Memory-wise, we need a constant memory to save the first and second momentums of every objective. Furthermore, the
computation is constant with regard to the number of objectives, and adding an additional objective would require an
additional call of the Adamize procedure, which does a constant computations with regards to the number of objectives.
As the number of objectives is small, the overhead of our method is insignificant. Finally, this method is model-agnostic
and can be used to optimize any model in a multi-objective fashion.</p>
    </sec>
    <sec id="sec-9">
      <title>5 EXPERIMENTS</title>
      <p>
        In this section, we assess the improvement of the proposed Adamize trick. We follow the experimental design of [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
and apply our method on recommender systems, although it can be used to improve any multi-objective gradient based
solution. We experiment on two datasets and up to three correlated and conflicting objectives. 1
      </p>
    </sec>
    <sec id="sec-10">
      <title>5.1 Objectives for Recommendation</title>
      <p>
        A recommender system can be trained with diferent objectives and for diferent purposes. For example, for some
companies, there might be an economic or strategic incentive to recommend newer, instead of older, content to the users.
Other socially responsible companies would like that their recommender to learn a notion of fairness or awareness
of social biases. In this section, we present the objectives we employ in our experiments. As a use-case, we use the
state-of-the-art variational autoencoder Mult-VAEPR of [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to demonstrate how to integrate our objectives into an
existing recommender training procedure. However, we emphasize that they are easily adapted to any other model that,
as recommendation, outputs a vector of probabilities across all the items.
5.1.1 Relevance Objective. This loss measures the relevance of the predicted items for the given user. The idea is to
compare the output of the model with the user’s interactions and measure how good the model can predict the user’s
interactions. The relevance loss in variational autoencoders is simply the reconstruction loss, plus the KL divergence
1For simplicity, we will use interchangeably the words objectives and losses.
between the posterior and the prior. More formally, the loss is:
      </p>
      <p>L (; ,  ) = E ( |) [log  ( |)] −  ∗  ( ( |) || ())
where  is the input vector for a user,  and  are model parameters, z is the variational parameter of the distribution,
and  is the regularizer controlling how much weight to be given to the KL term.</p>
      <p>
        We use the definition of [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to measure relevance. We quantify the ratio of relevant top-k items to users with Recall@k:
Í
 @ (, ) :=  =1I[((,|) ∈|)  ] (5)
where  ( ) is the item at rank  ,  the set of held-out items that user  interacted with, and I[·] the indicator function.
5.1.2 Revenue Objective. Alongside the enhanced user experience, a company is incentivized to use a recommender to
increase simultaneously the revenue. Thus, the revenue loss can be used in the training process to boost the
recommendations of expensive items and increase the overall revenue. The loss is similar to the relevance loss of Section 5.1.1,
with a diference that the input of the model is multiplied by a weight vector, representing the prices of the items.
Before computing the log-likelihood for a given user, the input vector for a user is multiplied with the price vector:
(4)
(6)
      </p>
      <p>
        L (; ,  ) = E ( |) [log  ( ∗  |)]
where  is the input vector for a user (it has a value 1 for the items the user has interacted with),  is the price vector
containing the prices of each item, the ∗ symbol denotes element-wise multiplication between two vectors,  and  are
model parameters, and z is the variational parameter of the variational distribution.
5.1.3 Recency Objective. From our practical experience, we came across a finding that users strongly prefer to interact
with recently added content. Furthermore, the authors of [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] have shown that with the introduction of recency we
could get improved and more precise recommender systems. Computing a recency score for items remains an open
question. For a given dataset, we propose to leverage the timestamps of the items when they first became available. For
an item, we scale its timestamp using a min-max normalization between the first and last interaction any user had with
it, obtaining values in the range of [
        <xref ref-type="bibr" rid="ref1">0-1</xref>
        ]. However, we claim that recency is not a linear function of the time. Since we
want to promote more recent items, we propose to transform the scores according to the following function:
1, if  ≥ 0.8
 () =  (7)
0.3(0.8−)∗ 130 , otherwise

The transformation function and its numeric constants have been optimized on an in-house dataset, but have been
shown to generalize on other datasets. Based on this transformation function, we proposed the recency objective which
stimulates the model to recommend recent items. The input of the model is multiplied by a weight vector, which
represents the recency score of the items, when the loss is computed. Similarly to the other losses, before computing
the log-likelihood for a given user, the input vector for a user is multiplied with the recency vector, or mathematically:
L (; ,  ) = E ( |) [log  (  () ∗  |)] ,where  is the input vector for a user (it has a value 1 for the items the user
has interacted with),  is the recency vector containing the recency score for each item scaled in the range of [
        <xref ref-type="bibr" rid="ref1">0-1</xref>
        ]
using min-max normalization,  is the function from eq. (7), the ∗ symbol denotes element-wise multiplication between
two vectors,  and  are model parameters, z is the variational parameter of the distribution.
5.2
      </p>
    </sec>
    <sec id="sec-11">
      <title>Datasets</title>
      <p>
        In order to assess the efectiveness of our proposed model, we first carried out experiments on the well-known Amazon
Books dataset, being a subsample from the Amazon review dataset [
        <xref ref-type="bibr" rid="ref6 ref8">6, 8</xref>
        ]. Along with users preferences for books,
it contains the book prices which can be used as a second revenue objective (see Section 5.1.2). The recency is not
available for this dataset.
      </p>
      <p>
        We also consider the MovieLens 20M dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In terms of objectives, we use the relevance, revenue, and recency
objectives for the MovieLens 20M dataset. For the revenue objective, we enriched it with prices from the Amazon
review dataset by doing a fuzzy joining on the titles of the movies. For the recency objective, in the dataset for every
given rating, there is a timestamp indicating when the rating was given by the user. We assume that a given movie
became available when the first rating was given for it.
      </p>
      <p>We train for the following combination of objectives: 1) Relevance + Revenue objectives, 2) Relevance + Recency
objectives, 3) Revenue + Recency objectives, and 4) Relevance + Revenue + Recency objectives.
5.3</p>
    </sec>
    <sec id="sec-12">
      <title>Preprocessing</title>
      <p>In our experiments, we consider implicit feedback. We binarize ratings by converting ratings ≥ 3.5 to positive interaction,
and ratings &lt; 3.5 to negative interaction. Then, we split the data in a way that 90% of the users with their interactions
are used as training data, 5% are used as validation data, and the remaining 5% are used as testing data. Finally, we mask
20% of interactions per user in the validation and testing data. The remaining 80% of the interactions are used as input
to the model, and the masked 20% are used as ground truth, to compare the model’s output with.
5.4</p>
    </sec>
    <sec id="sec-13">
      <title>Experimental Setings</title>
      <p>
        We implemented the state-of-the-art variational autoencoder Mult-VAEPR of [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for collaborative filtering, and
augmented the training loss with the objectives described in Section 5.1. Our model contains an encoder and a decoder.
The encoder consists of two linear layers of sizes 600 and 400, and the decoder also has two linear layers, both with a
size of 600. The number of latent features, the bottleneck of the model is 200. We are also normalizing the input before
we forward it through the model. As regularization, we use a dropout of 0.5 to the input. We will release the code.
5.5
      </p>
    </sec>
    <sec id="sec-14">
      <title>Pareto Front Metrics</title>
      <p>It is not straightforward to compare the multi-objective solution from diferent multi-objective algorithms and
optimization strategies. The solutions from the methods of multi-objective optimization are in the form of Pareto sets. An
initial comparison of two and three-dimensional Pareto sets is to plot them and inspect them visually. Although visual
inspection can help us to rank and compare Pareto set solutions, we seek an objective and systematic way. Therefore,
in this section, we present three metrics for measuring the quality of the Pareto set which can help us measure the
performance of the multi-objective algorithms quantitatively.</p>
      <p>
        Hypervolume [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]: One of the ways of measuring the quality of the Pareto set is to measure the area that is
dominated by it. The intuition is, the larger the area the solution can dominate, the better the solution. Using the
hypervolume to compute the area dominated by a solution, this intuition can be extended to more than two dimensions
[
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Since we are interested in increasing the recommender system metrics, we are using the origin as a reference
point for computing the hypervolume.
0.425
0.420
0
l2@0.415
lc0.410
a
e
R
0.405
0.400 28
      </p>
      <p>Vanil a SMSGD
Adamized SMSGD</p>
      <p>29 Reve3n0ue@2031
(a) MovieLens 20M dataset.</p>
      <p>32</p>
      <p>
        Coverage [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]: The coverage is a metric that indicates the fraction of points from one Pareto set that are dominated
by or equal to points from another Pareto set. If a one point 1 is dominated by or equal to another point 2, than it is
said that 2 covers 1. If the coverage is 1.0, that means all the points from the second Pareto set are covered by points
from the first one. Reverse, if the coverage is 0.0, that means none of the points from the second Pareto set are covered by
points from the first set. However, a drawback of the coverage is that it cannot tell us by how much one solution is better
than the other one [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. If 1 is the first Pareto set, 2 is the second Pareto set, and with 1 ≥ 2 we denote that solution
point 1 covers solution point 2, then the coverage metric is defined as: C (1, 2 ) = | {2∈ 2;∃1∈1 :1≥2} |
|2 |
      </p>
      <p>It is important to note that the coverage metric is not symmetric, and both C (1, 2 ) and C (2, 1 ) have to be
examined when evaluating Pareto sets. In our experiments, we report both variants as we apply a pairwise comparison.</p>
      <p>
        Spacing [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]: The spacing is a distance-based metric that measures the spread of a given solution. The bigger the
spacing metric is, the more diverse and the more spread are the solutions in the Pareto set. If having the best solutions
in a Pareto set is important, the diversity of the solutions captures the range of choices available to the decision-makers.
This is a concrete business advantage. If  is the Pareto set,  is the distance to the closest neighbour of the -th point
in the Pareto set, and ¯ is the average of  , then the spacing is computed as: SP ( ) = q |1|−1 Í|= 1| ( − ¯)2
6
      </p>
    </sec>
    <sec id="sec-15">
      <title>RESULTS</title>
    </sec>
    <sec id="sec-16">
      <title>6.1 Two Objectives</title>
      <p>Figure 1 shows the Pareto front of the baseline and our method. From both visualizations we can clearly observe that
Adamizing the gradients substantially improves the performance over the SMSGD algorithm. The Pareto front obtained
with our method clearly dominates the vanilla SMSGD algorithm. On MovieLens, the Pareto fronts are more spread
than in the Amazon dataset case.</p>
      <p>To further inspect and quantify the results, we also present the metrics for measuring the quality of the Pareto set in
Table 1. Our proposed algorithm outperforms the baseline substantially in terms of coverage (as can be seen on the
visualization) also in terms of hypervolume, following the visualiation. However, we observe that the spacing of our
method nearly doubles in the MovieLens dataset, but perform similarly on the Amazon Book dataset.</p>
    </sec>
    <sec id="sec-17">
      <title>6.2 Three Objectives</title>
      <p>For better visualization, we project the three-dimensional Pareto fronts on two objectives. Results are available in
Figure 2. Still, from the plots we observe an improvement in all the three combination of objectives.</p>
      <p>Manuscript submitted to ACM</p>
      <p>The Table 2 quantifies the improvement of our proposed method compared to the vanilla SMSGDA. We can see that
the Adamize trick on our method dominates approximately half the solutions found by the vanilla SMSGDA, while
being slightly more spread over the space. In terms of hypervolume, the vanilla SMSGDA performs slightly better.
However, the diference is less significant than the two objectives case because of the curse of dimensionality.</p>
      <p>Supported by the improvements on two diferent datasets, and using up to three diferent objectives, we can say that
the Adamize trick leads on average to better solutions.</p>
      <p>Vanil a SMSGD</p>
      <p>Adamized SMSGD
0.30 0.3R5ecen0c.y4@0 20 0.45
(c) Recency vs Revenue.</p>
      <p>0.50</p>
    </sec>
    <sec id="sec-18">
      <title>7 CONCLUSION</title>
      <p>In this paper we introduced a novel method for multi-gradient descent that leverages a momentum-based optimizer. We
applied the method on a problem with a growing importance - Multi-objective Recommender Systems. We benchmarked
the novel optimization method against the state-of-the-art multi-gradient descent method and reported the results
on three diferent metrics based on the resulting Pareto front: hypervolume, coverage, and spacing. The results show
that the new Pareto fronts are substantially better from all three perspectives. We complemented the analysis with a
visualization of the Pareto fronts that further emphasizes the gains obtained.</p>
      <p>To the best of our knowledge, we are the first to use a momentum-based optimizer for each objective in a
multiobjective setup. We hope that this will inspire research practitioners to test and produce other ideas in the direction of
using momentum-based optimizers per objective in a multi-objective setup. Improving the gradient-based optimization
could benefit all the multi-objective optimization problems, in all applicable fields.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>David</given-names>
            <surname>Carmel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Elad</given-names>
            <surname>Haramaty</surname>
          </string-name>
          , Arnon Lazerson, and
          <string-name>
            <surname>Liane</surname>
          </string-name>
          Lewin-Eytan.
          <year>2020</year>
          .
          <article-title>Multi-Objective Ranking Optimization for Product Search Using Stochastic Label Aggregation</article-title>
          .
          <source>In Proceedings of The Web Conference</source>
          <year>2020</year>
          .
          <fpage>373</fpage>
          -
          <lpage>383</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Abhijnan</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          , Saptarshi Ghosh, Niloy Ganguly, and Krishna P Gummadi.
          <year>2017</year>
          .
          <article-title>Optimizing the recency-relevancy trade-of in online news recommendations</article-title>
          .
          <source>In Proceedings of the 26th International Conference on World Wide Web</source>
          .
          <fpage>837</fpage>
          -
          <lpage>846</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Jean-Antoine Désidéri</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Multiple-gradient descent algorithm (MGDA) for multiobjective optimization</article-title>
          .
          <source>Comptes Rendus Mathematique</source>
          <volume>350</volume>
          ,
          <fpage>5</fpage>
          -
          <lpage>6</lpage>
          (
          <year>2012</year>
          ),
          <fpage>313</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Yi</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xue</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Maria E</given-names>
            <surname>Orlowska</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Recency-based collaborative filtering</article-title>
          .
          <source>In Proceedings of the 17th Australasian Database Conference-</source>
          Volume
          <volume>49</volume>
          .
          <fpage>99</fpage>
          -
          <lpage>107</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P</given-names>
            <surname>Moreira Gabriel De Souza</surname>
          </string-name>
          ,
          <source>Dietmar Jannach, and Adilson Marques Da Cunha</source>
          .
          <year>2019</year>
          .
          <article-title>Contextual hybrid session-based news recommendation with recurrent neural networks</article-title>
          .
          <source>IEEE Access</source>
          <volume>7</volume>
          (
          <year>2019</year>
          ),
          <fpage>169185</fpage>
          -
          <lpage>169203</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F Maxwell</given-names>
            <surname>Harper and Joseph A Konstan</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>The movielens datasets: History and context</article-title>
          .
          <source>Acm transactions on interactive intelligent systems (tiis) 5</source>
          ,
          <issue>4</issue>
          (
          <year>2015</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Tim</given-names>
            <surname>Hatcher</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>The social responsibility performance outcomes model building socially responsible companies through performance improvement outcomes</article-title>
          .
          <source>Performance Improvement</source>
          <volume>39</volume>
          ,
          <issue>7</issue>
          (
          <year>2000</year>
          ),
          <fpage>18</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Ruining</given-names>
            <surname>He</surname>
          </string-name>
          and
          <string-name>
            <surname>Julian McAuley</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering</article-title>
          .
          <source>In proceedings of the 25th international conference on world wide web. 507-517.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Xiangnan</given-names>
            <surname>He</surname>
          </string-name>
          , Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and
          <string-name>
            <surname>Tat-Seng Chua</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Neural collaborative filtering</article-title>
          .
          <source>In Proceedings of the 26th international conference on world wide web. 173-182.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Geofrey</surname>
            <given-names>Hinton</given-names>
          </string-name>
          ,
          <source>Nitish Srivastava, and Kevin Swersky</source>
          .
          <year>2012</year>
          .
          <article-title>Neural networks for machine learning lecture 6a overview of mini-batch gradient descent</article-title>
          .
          <source>Cited on 14, 8</source>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Alexandros</surname>
            <given-names>Karatzoglou</given-names>
          </string-name>
          , Linas Baltrunas, and
          <string-name>
            <given-names>Yue</given-names>
            <surname>Shi</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Learning to rank for recommender systems</article-title>
          .
          <source>In Proceedings of the 7th ACM conference on Recommender systems. 493-494.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P</given-names>
          </string-name>
          <string-name>
            <surname>Kingma and Jimmy Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Dawen</surname>
            <given-names>Liang</given-names>
          </string-name>
          , Rahul G Krishnan,
          <string-name>
            <surname>Matthew D Hofman</surname>
            , and
            <given-names>Tony</given-names>
          </string-name>
          <string-name>
            <surname>Jebara</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Variational autoencoders for collaborative filtering</article-title>
          .
          <source>In Proceedings of the 2018 World Wide Web Conference</source>
          .
          <volume>689</volume>
          -
          <fpage>698</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Nikola</surname>
            <given-names>Milojkovic</given-names>
          </string-name>
          , Diego Antognini, Giancarlo Bergamin, Boi Faltings, and
          <string-name>
            <given-names>Claudiu</given-names>
            <surname>Musat</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Multi-Gradient Descent for Multi-Objective Recommender Systems</article-title>
          .
          <source>Proceedings of the AAAI</source>
          (
          <year>2020</year>
          ) - Workshop on Interactive and
          <source>Conversational Recommendation Systems (WICRS)</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Tien</surname>
            <given-names>T Nguyen</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pik-Mai Hui</surname>
            ,
            <given-names>F Maxwell</given-names>
          </string-name>
          <string-name>
            <surname>Harper</surname>
          </string-name>
          ,
          <source>Loren Terveen, and Joseph A Konstan</source>
          .
          <year>2014</year>
          .
          <article-title>Exploring the filter bubble: the efect of using recommender systems on content diversity</article-title>
          .
          <source>In Proceedings of the 23rd international conference on World wide web. 677-686.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Tatsuya</surname>
            <given-names>Okabe</given-names>
          </string-name>
          , Yaochu Jin, and
          <string-name>
            <given-names>Bernhard</given-names>
            <surname>Sendhof</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>A critical survey of performance indices for multi-objective optimisation</article-title>
          .
          <source>In The 2003 Congress on Evolutionary Computation</source>
          ,
          <year>2003</year>
          . CEC'
          <volume>03</volume>
          ., Vol.
          <volume>2</volume>
          . IEEE,
          <fpage>878</fpage>
          -
          <lpage>885</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Peitz</surname>
          </string-name>
          and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Dellnitz</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Gradient-based multiobjective optimization with uncertainties</article-title>
          .
          <source>In NEO 2016</source>
          . Springer,
          <fpage>159</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Fabrice</surname>
            <given-names>Poirion</given-names>
          </string-name>
          , Quentin Mercier, and
          <string-name>
            <surname>Jean-Antoine Désidéri</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Descent algorithm for nonsmooth stochastic multiobjective optimization</article-title>
          .
          <source>Computational Optimization and Applications</source>
          <volume>68</volume>
          ,
          <issue>2</issue>
          (
          <year>2017</year>
          ),
          <fpage>317</fpage>
          -
          <lpage>331</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Ozan</given-names>
            <surname>Sener</surname>
          </string-name>
          and
          <string-name>
            <given-names>Vladlen</given-names>
            <surname>Koltun</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Multi-task learning as multi-objective optimization</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          .
          <volume>527</volume>
          -
          <fpage>538</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Nitish</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Improving neural networks with dropout</article-title>
          .
          <source>University of Toronto 182</source>
          ,
          <issue>566</issue>
          (
          <year>2013</year>
          ),
          <fpage>7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Mercedes</given-names>
            <surname>Varona</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Incentives to Encourage Companies to Become Socially Responsible</article-title>
          .
          <source>Nuevas Tendencias</source>
          <volume>103</volume>
          (
          <year>2020</year>
          ),
          <fpage>30</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Jolita</given-names>
            <surname>Vveinhardt</surname>
          </string-name>
          and
          <string-name>
            <given-names>Regina</given-names>
            <surname>Andriukaitiene</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Readiness of companies to become socially responsible: social behaviour of an organization and an employee from a demographic viewpoint. Problems and perspectives in management 12, Iss. 2 (contin</article-title>
          .) (
          <year>2014</year>
          ),
          <fpage>215</fpage>
          -
          <lpage>229</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Eckart</surname>
            <given-names>Zitzler</given-names>
          </string-name>
          , Dimo Brockhof, and
          <string-name>
            <given-names>Lothar</given-names>
            <surname>Thiele</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>The hypervolume indicator revisited: On the design of Pareto-compliant indicators via weighted integration</article-title>
          .
          <source>In International Conference on Evolutionary Multi-Criterion Optimization</source>
          . Springer,
          <fpage>862</fpage>
          -
          <lpage>876</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Eckart</given-names>
            <surname>Zitzler</surname>
          </string-name>
          and
          <string-name>
            <given-names>Lothar</given-names>
            <surname>Thiele</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach</article-title>
          .
          <source>IEEE transactions on Evolutionary Computation</source>
          <volume>3</volume>
          ,
          <issue>4</issue>
          (
          <year>1999</year>
          ),
          <fpage>257</fpage>
          -
          <lpage>271</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>