=Paper=
{{Paper
|id=Vol-3135/darliap_paper2
|storemode=property
|title=NeuTraL: Neural Transfer Learning for Personalized Ranking
|pdfUrl=https://ceur-ws.org/Vol-3135/darliap_paper2.pdf
|volume=Vol-3135
|authors=Rasaq Otunba 
|dblpUrl=https://dblp.org/rec/conf/edbt/Otunba22
}}
==NeuTraL: Neural Transfer Learning for Personalized Ranking==
NeuTraL: Neural Transfer Learning for Personalized
Ranking
Rasaq Otunba
4400 University Drive, Fairfax, Virginia 22030
                                          Abstract
                                          Personalized ranking continues to be an important aspect of many information systems and personalization systems. Neural
                                          networks and deep learning continue to gain popularity because of their success in different fields of artificial intelligence
                                          such as computer vision and natural language processing. Recently, researchers began to apply deep learning to personalized
                                          ranking with success. Most personalization systems exploit historical preference data for users and items in warm-start
                                          scenario. A major challenge in personalized ranking occurs in the cold-start scenario which arises when there is little to no
                                          historical preference information. Content information is sometimes available and it can be used to alleviate the cold-start
                                          problem.
                                          We propose a solution that involves transfer learning from a deep model to a shallow model for both warm-start and cold-start
                                          personalized ranking. We corroborate our proposal with experiments on publicly available datasets in comparison with other
                                          baseline and state-of-the-art techniques.
                                          Keywords
                                          neural networks; deep learning; recommendations; personalization; cold-start; ranking
1. Introduction                                                                                                         β’ We propose a unique approach to extracting pre-
                                                                                                                          trained user latent factors from a state-of-the-art
Personalized ranking with adequate historical prefer-                                                                     (SOTA) personalization model.
ence is referred to as warm-start while recommendation                                                                  β’ The transfer of the pre-trained user latent factors
with inadequate historical preference is referred to as                                                                   to a renowned personalization model for warm-
cold-start. We subsequently refer to personalized rank-                                                                   start and cold-start ranking respectively.
ing as ranking except otherwise clearly stated. We pro-                                                                 β’ We provide thorough evaluation and conduct
pose a machine learning solution called Neural Transfer                                                                   experiments comparing our proposed solutions
Learning for warm-start personalized ranking, otherwise                                                                   with other SOTA and baseline techniques.
referred to as NeuTraL. We then propose a cold-start ver-
sion of NeuTraL referred to as NeuTraL-C. NeuTraL and                                                                 The remainder of this paper is organized as follows:
NeuTraL-C use neural networks and transfer learning                                                                in Section 2, we highlight related work. We provide
for warm-start and cold-start item ranking respectively.                                                           pertinent background and notations for the rest of this
Item cold-start personalized ranking involves ranking                                                              work in Section 3. We describe our approach in Section 5.
cold-start items while user cold-start personalized rank-                                                          In Section 6, we describe our experiments and discussed
ing involves ranking cold-start users. There is also the                                                           the results in section Section 6.3.3. We conclude with
full cold-start entity personalized ranking problem where                                                          potential directions for future work in Section 7.
both the user and item entities have no historical prefer-
ence information. Although we focus on cold-start item
personalized ranking in this work, we believe the concept                                                          2. Related Work
is extensible to both user cold-start and full cold-start per-
sonalized ranking problems. Entity content information                                                             Personalized ranking techniques typically belong in one
is sometimes used to compensate for the lack of historical                                                         of the following categories: collaborative filtering (CF),
preference information by learning from content infor-                                                             content-based or a hybrid of the aforementioned tech-
mation and existing preference information. Ranking can                                                            niques. Different CF techniques ranging from matrix
be done for implicit or explicit feedback [1]. We focus on                                                         factorization (MF) [2, 3] to k-Nearest Neighbor (kNN) [4]
implicit feedback in this work due to its more prevalent                                                           have seen success in personalization systems research.
nature. The contributions made in this work include:                                                               In recent years, deep learning has also been successfully
                                                                                                                   applied for personalization. He et al. replaced the typical
                                                                                                                   dot product of user and item latent features with a deep
Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint
                                                                                                                   learning model in their technique referred to as neural
Conference (March 29-April 1, 2022), Edinburgh, UK
$ rotunba@gmu.edu (R. Otunba)                                                                                      collaborative filtering, NCF [5]. NCF performs better than
                                    Β© 2022 Copyright for this paper by its authors. Use permitted under Creative
                                    Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                   the vanilla MF because the non-linearity of the deep learn-
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                        ing model captures complex interactions between users
and items better. Deep representation models such as
autoencoders and restricted Boltzmann machines (RBM)                                   AπΌ β R|πΌ|Γπ .                        (6)
have been used for personalization [6, 7, 8]. These tech-
                                                                Let aππ’ be the vector of user attributes 1 . . . π for user
niques have been successfully applied and demonstrated
                                                                π’, and aπΌπ be the vector of item attributes 1 . . . π for item
on a variety of real world data, but they are known to
                                                                π, so that ππΌππ is the π-th item attribute value and ππ   ππ is
suffer from the cold-start problem. Content-based tech-
                                                                the π-th user attribute value. ππΌππ = 0 when the attribute
niques are typically used to tackle the cold-start prob-
                                                                is unavailable. Sets π and πΌ are represented by latent
lem by incorporating entity attributes [9, 10]. Entity at-
                                                                feature matrices U and I respectively where
tributes are sometimes combined with CF to compensate
for the weakness in CF [11, 12] for the cold-start scenario.
To alleviate the cold-start problem, some deep learning                                 U β R|π |Γπ                         (7)
techniques have been developed with the use of content
                                                                                                 |πΌ|Γπ
information, e.g., the deep content-based music recom-                                   IβR             ,                  (8)
mendation work proposed by Oord et al. [13]. Most of the
deep learning personalization systems proposed for cold         where π is the number of latent features. User π’ and
start are hybrid in that they combine historical preference     item π are represented by u and i, respectively. Content
and content information [14, 15, 16, 17, 18, 19]. Some of       data would sometimes contain only user attributes, item
the cold-start personalization systems [20] adopt active        attributes or both. User attributes include demographic
learning. However, there are situations where active feed-      information such as age and gender, education level, etc.
back from users for the cold start items are unavailable.       Social network data can also be mined for user attributes.
Transfer learning has also been used in personalization         Item attributes include physical attributes, time of pro-
systems research [21, 22].                                      duction, location, etc. The task of item ranking is to
                                                                estimate the relative ranking of the items for each user.
                                                                We denote the predicted ranking of item π for user π’ as
3. Background & Notations                                       π¦Λπ’π from an inference function π :
The set of users and items are denoted by π and πΌ, re-                           π¦Λπ’π = π (u, aπ       πΌ
                                                                                               π’ , i, aπ , π),              (9)
spectively. A measure of preference is recorded as a
positive feedback from some set π or as a negative feed-        where π denotes the model parameters learned during
back recorded as 0. When explicitly provided, π could           training. Equation 9 shows π¦Λπ’π is a function of the input
be a set of values e.g., {1, 2, ..., 5}. When implicitly pro-   and learned model parameters. Model parameters are
vided, typically π = {0, 1}. The matrix of user-item            typically learned via optimization such that an objective
interactions is denoted by:                                     loss function is minimized or a utility function is maxi-
                                                                mized. Objective loss function minimization is expressed
                  Y β ({0} βͺ π )|π |Γ|πΌ| ,               (1)    as:
where an interaction refers to an observable action by a
user e.g., the purchase of an item. User vector for user π’                        ππΈ = arg min β(π; Y),                   (10)
in π is denoted as π¦π’ . Conversely, item vector for item π                                   π
in π is denoted as π¦ππ . The implicit feedback for a user       where π is learned from observation matrix Y to optimize
π’ β π on an item π β πΌ is:                                      the estimate function ππΈ that predicts π¦Λπ’π . Learning is
                  {οΈ                                            usually done with machine learning techniques such as
                    1, if π’ interacted with π;                  gradient descent (GD) [23] or its variants e.g., Adaptive
            π¦π’π =                                      (2)
                    0, otherwise.                               Moment Estimation (Adam) [24] on carefully sampled
                                                                user-item pairs.
   πΌπ’+ = {set of items interacted with by user π’}.       (3)
                                                        4. NeuTraL: Neural Transfer
                      πΌπ’β = πΌ β πΌπ’+                 (4)    Learning for Personalized
  π + , π β , and π are user sets analogous to the def-    Ranking
initions in Equations 3β 4. Aπ and AπΌ represent the
                                                     We provide further background on pertinent information
m-dimensional user-attribute and n-dimensional item-
                                                     that will aid the understanding of NeuTraL.
attribute matrices, respectively.
                     Aπ β R|π |Γπ ,                      (5)
  User/Item                           Output                        User em-
   ratings                                                                                                                 Output
                                       layer                        bedding
   vector
                                        Knowledge Transfer
                                                                       r    ..
                                                                             .
                                                                                                     Prediction           Predicted
                                                                    Item em-                          function             output
                       ..                                            bedding
                        .
      ..                                 ..                                                                          Training
       .                                  .
                                                                                                                           Actual
                   Hidden                                                                                                  output
                                                                      n     ..
                   layer to
                  be trans-                                                  .
                    ferred
                  after pre-
                  training
                  Auto-Encoder                                                                       MPR
Figure 1: NeuTraL: Left side shows the pre-trained Auto-Encoder with the transfer to MPR
4.1. MPR: Multi-Objective Pairwise                                    βοΈ βοΈ βοΈ
     Ranking                                                                                  β(π¦Λπ’(π,π) ) + β(π¦Λπ (π£,π€) ),      (15)
                                                                      π’βπ πβπΌ + πβπΌ β
MPR is of the pairwise ranking function family where
                                                                                 π’        π’
the optimization task is with respect to the actual and       and the objective function β is the log-sigmoid function:
predicted values for a pair of items by a user. For item
ranking, the pairwise prediction function for a user π’, a                               β(π₯) = ln π(π₯),                          (16)
preferred item π and a less preferred item π is expressed       and
as
                                                                                                      1
                                                                                        π(π₯) =                                   (17)
                  π¦Λπ’(π,π) = π¦Λπ’π β π¦Λπ’π ,             (11)                                        1 + πβπ₯
while the actual value is                                     π¦Λπ’π is estimated from a MF model learned with GD. π¦Λπ’π
                                                              is the dot product of the user latent vector π’ and the item
                  π¦π’(π,π) = π¦π’π β π¦π’π .                (12)   latent vector π.
   Conversely, for user ranking, the pairwise prediction
                                                                                           π¦Λπ’π = π’π Β· π                         (18)
function for an item π preferred by user π£ but not pre-
ferred by user π€ is expressed as                                Assume
                 π¦Λπ (π£,π€) = π¦Λπ π£ β π¦Λπ π€ ,           (13)                          π’ = {π’1 , π’2 , . . . , π’π }                 (19)
while the actual value is                                       and
                 π¦π (π£,π€) = π¦π π£ β π¦π π€ .              (14)                           π = {π1 , π2 , . . . , ππ }.               (20)
MPR combines item ranking and user ranking. The opti-
mization function is expressed as:
   Component π’π of π’ represents user π’βs affinity for an
item factor π. Component ππ of π represents the concen-                              π¦Λπ’0 = π0 (π¦π’ , π’),                 (22)
tration of factor π in item π.
                                                                and π0 is a concatenation function. The nodes vector in
                                                                the hidden layer are:
           π
         π’ Β· π = π’1 * π1 + π’2 * π2 . . . , π’π * ππ      (21)
.                                                                               π¦Λπ’1 = π1 (π1π Β· π¦Λ0π’ + π1 ).            (23)
   Each component product π’π * ππ represents user π’βs
affinity for factor π in item π. We subsequently refer to       π1 is the π x β weight matrix between the input and
this component product as latent vector product (LVP)           hidden layers. π and β are the number of nodes in the
for ease of reference.                                          input and hidden layers respectively. π1 is the bias for
                                                                the hidden layer. π1 is an activation function.
4.2. Transfer Learning                                                             π¦Λπ’2 = π2 (π2π Β· π¦Λ1π’ ).              (24)
Transfer learning [25] is premised on the idea that a           π2 is the β x π weight matrix between the hidden and
related pre-trained model can serve as an initializer for       output layers. π2 is an activation function. We use sig-
a main model. This initialization can be beneficial by          moid activation functions since they produced optimal
speeding up learning and/or improving accuracy on the           results. π1 , π2 and π1 are model parameters. There
main task as seen in Figure 4. Transfer learning is similar     are also hyper-parameters such as learning rate, batch
to multi-task learning (MTL) with the main difference           size and objective function that should be tuned during
being the sequential versus simultaneous nature of the          training with validation. We use the binary cross-entropy
two techniques, respectively. Transfer learning has been        cost function.
successful in image processing [26] and natural language
processing [27] among other areas of machine learning.
                                                                   β π¦Λπ’(π,π) ln π¦π’ππ β (1 β π¦Λπ’(π,π) ) ln(1 β π¦π’ππ ).   (25)
4.3. Auto-Encoders & Personalization
                                                                and backpropagation to update the model parameters.
Auto-encoders have been successfully applied in person-
alization systems [7, 6]. Auto-encoders derive their name       4.4. NeuTraL Algorithm
from the ability to encode input data with un-supervised
learning. The utility of auto-encoders include dimension-       The development of NeuTraL as depicted in fig:NeuTraL
ality reduction of input while ignoring noise in the input      begins with the supposition that a more representative
optimally. For the purpose of personalization, entity vec-      user embedding model could improve performance in
tor data is passed as input with missing entries. The goal      the MF for personalized ranking. A pre-trained neural
is to recover the original input in the output including        network model may be appropriate since we are aware
the missing entries. To the best of our knowledge, the          of the success of deep learning models in personalization
pioneer research work in this area is AutoRec [7]. User         systems. It has also been shown that neural networks are
vectors π¦π’ or item vectors π¦π can serve as input where          better at modelling complex non-linearity in user-item
each vector component is the actual preference value or         interactions than MF models [5]. We chose CDAE as our
a missing entry. However, the authors of AutoRec stated         pre-training model based on its proven improvement over
that user vector inputs performed better than item vector       AutoRec. User latent features in MF can be considered a
inputs, and we observed the same in our experiments.            form of dimensionality reduction for the user preference
Perhaps this is due to the peculiar characteristics of the      vector in π . A close look at both CDAE and MF reveals
datasets used, e.g., number of users and items, ratings         that the hidden layer nodes of CDAE are analogous to
per item and ratings per user. Wu et al. presented a            user latent features as smaller dimension versions of the
more sophisticated auto-encoder personalization tech-           original user vectors in π . This analogy implies we can
nique, Collaborative Denoising Auto-Encoders (CDAE)             use a pre-trained |π | π₯ π matrix πΆ of hidden layer node
[6] which incorporates denoising with dropout [28] and          values as the user latent feature matrix model which
an extra identifier input. Dropout can be seen as a form        forms the basis for our contribution. We subsequently
of noise introduction [29].                                     refer to πΆ as the transfer matrix. In other words, we
Deep learning techniques have the advantage of being            transfer user vector ππ’ from πΆ as the latent vector for
able to model linear and non-linear complex interactions        user u. We leave out the algorithm for NeuTraL since it
between users and items. Auto-encoders for personal-            is essentially the same as the MPR algorithm with the
ization are depicted in Figure 1. We denote the nodes in        use of the pre-trained user embedding from CDAE.
the input layer as π¦Λ0π’ , hidden layer as π¦Λ1π’ and the output
layer as π¦Λ2π’ where
  User/Item                        Output                             User em-
   ratings                                                                                                       Output
                                    layer                             bedding
   vector
                                     Knowledge Transfer
                                                                           k    ..
                                                                                 .
                                                                                              Prediction        Predicted
                                                                                               function          output
                      ..
                       .                                          Mapper
                                                                  matrix                                   Training
      ..                              ..
       .                               .
                                                                                                                 Actual
                   Hidden                                                                                        output
                   layer to
                  be trans-                         k      ..         ..        ..
                    ferred                                  .          .         .
                  after pre-
                  training
                                                                      t
                                                                       Item at-
                 Auto-Encoder                                          tributes
                                                                           t
                                                                                ..
                                                                                 .
                                                                               ATM-MPR
Figure 2: NeuTraL-C: Left side shows the pre-trained Auto-Encoder with the transfer to ATM-MPR
5. NeuTraL-C: Neural Transfer                                   attributes that can be exploited for recommendations.
                                                                An item Attribute-to-Feature Mapping (ATM) as a frame-
   Learning for Cold-Start                                      work capable of providing item latent features from item
   Personalized Ranking                                         attributes i.e., a function that accepts item attributes as
                                                                input and produces item latent features as output. The
We provide further background on pertinent information          output can then be used in conjunction with user latent
that will aid the understanding of NeuTraL-C as depicted        features for prediction. We consider the ATM technique
in fig:NeuTraL-C.                                               presented by Gantner et al [12] referred to as ATM-BPR
                                                                in this work. ATM-MPR is an extension of the ATM-BPR
5.1. Item Attribute-to-Feature Mappings                         technique for cold-start personalization.
Cold-start items have little to no historical preference
                                                          5.1.1. ATM-MPR
information to exploit for personalized ranking. Hence
recommending cold-start items pose a different challenge. ATM-MPR adds cold-start capability to MPR by learning
However, both warm-start and cold-start items have item a shallow linear model of latent features and attributes.
The main differences between MPR and ATM-MPR is the        Algorithm 1 NeuTraL-C(π, π, π΄)
derivation of the item latent vector π where                1: Output: Optimized matrices U and M
                                                            2: initialize U with the extracted hidden layer matrix C
                       π = β³(ππΌπ ),                   (26)     from CDAE
                                                            3: initialize πΌ, π and M
and β³ is a mapping function.
                                                            4: repeat
                                                            5:    draw π’, π, π from π, πΌπ’+ , πΌπ’β uniformly
                         πΌ
                   β³(ππ ) = π Β· ππ ,πΌ
                                                      (27)
                                                            6:    π’ β π’ β π * π ππ’-C-ππ π π’
where π is a mapper matrix to be learned similar to how           π β π β π * π ππ’-C-ππ π π wrt ππΌπ and ππΌπ
π and πΌ are learned in MPR with GD. MPR optimizes the       7:    draw π, π£, π€ from πΌ, ππ+ , ππβ uniformly
NeuTraL-C optimization criterion which is the same as       8:    π β π β π * π ππ’-C-ππ π π wrt π£ and π€
neutral-opt.                                                      π£ β π£ β π * π ππ’-C-ππ π π£
However, the respective prediction functions for user             π€ β π€ β π * π ππ’-C-ππ π π€
ranking and item ranking in NeuTraL-C are different.        9: until convergence or maximum number of iterations
We subsequently describe the item ranking prediction
function but the user ranking prediction function is anal- 10: return U, M
ogous. The item ranking prediction function is expressed
as:
                                                                 6.1. Experimental Repeatability
     π¦Λπ’(π,π) = (π’   π
                         Β· π Β· ππΌπ ) β (π’π Β· π Β· ππΌπ ).   (28)Experiment Artifacts (software, datasets, etc.) for this
                                                              work are available on demand. These artifacts will be
With transfer learning, the prediction function becomes: made publicly available with publication. All of the tech-
                                                              niques use GD and/or Adam for training as is the case in
        π¦Λπ’(π,π) = (πππ’ Β· π Β· ππΌπ ) β (πππ’ Β· π Β· ππΌπ ).  (29) NeuTraL where we use Adam for pretraining CDAE but
                                                              use GD for actual training in the ATM-BPR framework.
                 π¦Λπ’(π,π) ππ = πππ’ (ππΌπ β ππΌπ ).         (30) The benchmarks will converge differently during train-
                                                              ing based on hyperparameters but 1 factor that affects
Hence, π is updated in GD with the following expres- the space and time requirements during each epoch is
sion:                                                         the size of model parameters. Avoidance of bias forms
                                                              the basis for model design and other hyperparameter se-
            π = π + πΌ (π ππ’-C-ππ π π ) ,                 (31) lections throughout our experiments. We use one hidden
                                                              layer in the deep models. We use 100 factors in the MF
                   (οΈ                                   )οΈ    models. We also have the number of nodes in the deep
                      πβ(π¦Λπ’(π,π) ) ππ¦Λπ’(π,π)                 learning model amount to 100. We used the tower archi-
 π =π +πΌ                           Β·           β ππ Β· π ,
                       ππ¦Λπ’(π,π)       ππ                     tecture for the deep learning models. We used learning
                                                         (32) rates between 0.00001 β 0.01 and batch sizes of 10000.
   and ππ is a regularization hyper-parameter.                We tuned model hyperparameters and stopped training
                                                              early with validation.
5.2. NeuTraL-C Algorithm
                                                                 6.2. Evaluation metrics
The NeuTraL-C algorithm is listed in alg-neutral-c
                                                        Evaluation is done with 5-fold cross validation. We use
                                                        3 popular information retrieval metrics: MRR, NDCG
6. Experiments                                          and AUC which are described further in subsequent sub-
                                                        sections. We evaluate the techniques on their ability to
We proceed to address the following research questions: rank items relative to 9 and 99 other items. The rank-
                                                        ing metrics relative to 9 other items are denoted as @10
    β’ how does NeuTraL compare with other SOTA e.g., MRR@10 measures MRR score for a technique when
      warm-start item personalization systems.          ranking 1 of 10 items for a user.
    β’ how does NeuTraL-C compare with other SOTA
      cold-start item personalization systems.
  We begin by describing our experiment setup. We
subsequently describe our experiments on warm-start
personalized ranking followed by cold-start.
Table 1                                                    Table 2
Datasets                                                   Movielens results on warm-start items
       Dataset        #Users    #Items    #Ratings             Metrics     IPop    NCF     BPR     MPR     NeuTraL
    Movielens 1M      6,040      3,706    1,000,209
     Eachmovie        72,916     1,628    2,811,983           MRR@10       0.246   0.409   0.400   0.421    0.437
      Pinterest       55,187     9,916    1,500,809          NDCG@10       0.310   0.485   0.480   0.497    0.515
     Goodreads        10,000     5,000     647,458
                                                                MRR        0.270   0.424   0.415   0.435    0.451
                                                               NDCG        0.417   0.548   0.542   0.557    0.570
6.3. Experiments for warm-start ranking
                                                                AUC        0.853   0.921   0.923   0.924    0.929
6.3.1. Datasets
We performed experiments on four publicly available
datasets. A summary of these datasets is provided in Table 3
Table 1. The datasets contain explicit ratings for users Pinterest results on warm-start items
on items but we convert the ratings to implicit feedback
by treating ratings greater than 0 as positive feedback.     Metrics       IPop NCF BPR MPR NeuTraL
Our focus in this work is implicit feedback but we believe  MRR@10 0.111 0.475 0.465 0.487     0.492
NeuTraL is applicable to explicit feedback.
                                                             NDCG@10       0.151   0.566   0.559   0.578    0.584
     β’ Movielens 1M: Movielens dataset of different        MRR         0.138 0.483 0.475           0.496    0.501
       datasets [30] are made publicly available by the
       GroupLens Research lab at the University of Min-   NDCG         0.298 0.600 0.595           0.611    0.615
       nesota. We use the Movielens 1M dataset. The        AUC         0.724 0.947 0.955           0.958    0.960
       data is extracted from the Movielens website
       which is a free website that provides personal-
       ized movie recommendation to users.
     β’ Eachmovies dataset: This dataset [31] is made Table 4
       available by the Digital Equipment Corporation Books results on warm-start items
       (DEC) Systems Research Center at Compaq. The       Metrics      IPop NCF BPR                MPR     NeuTraL
       research center ran a CF service for experimen-
       tal purposes and made the data available for re-  MRR@10 0.087 0.170 0.167                  0.239    0.245
       search.                                           NDCG@10 0.114 0.224 0.217                 0.302    0.309
     β’ Goodreads dataset: This dataset [32] was col-
                                                           MRR         0.112 0.197 0.193           0.262    0.268
       lected from goodreads.com, a book social network
       and recommendation website.                        NDCG         0.266 0.353 0.348           0.410    0.415
     β’ Pinterest Dataset: This is a dataset of implicit    AUC         0.590 0.793 0.770           0.829    0.834
       feedback representing whether a user pinned an
       image on their board on the pinterest platform at
       https://www.pinterest.com.
                                                                β’ Multi-objective pairwise ranking (MPR) [33]:
6.3.2. Benchmarks
                                                                  MPR is a MTL technique that combines item rank-
We compare our NeuTraL technique with 3 SOTA cold-                ing and user ranking tasks. MTL learns from his-
start personalization systems and a baseline item popu-           torical preference data from item and user rank-
larity (IPop) technique. IPop recommends items based              ing perspectives. MTL was demonstrated to able
on popularity. The benchmarks will converge differently           to improve item ranking accuracy by learning
during training based on hyperparameters but 1 factor             from both perspectives.
that affects the space and time requirements during each        β’ Neural Collaborative Filtering (NCF) [5]: NCF is
epoch is the size of model parameters. We select model            an ensemble recommender that combines MF and
parameters to avoid bias throughout our experiments.              deep learning. NCF was demonstrated to achieve
The SOTA benchmarks used are described below:                     superior performance compared to other SOTA
                                                                  techniques.
     β’ BPR: we described BPR in bpr.
Table 5                                                      6.5.1. Benchmarks
Eachmovies results on warm-start items
                                                             We compare our NeuTraL technique with 4 state-of-the-
    Metrics      IPop    NCF     BPR     MPR     NeuTraL     art cold-start personalization systems. NeuTraL-C, Dro-
                                                             pouNet and ATM-BPR require pre-training. The bench-
   MRR@10        0.123   0.284   0.261   0.275    0.293
                                                             marks used are described below:
  NDCG@10        0.159   0.357   0.329   0.349    0.368
                                                                  β’ Multi-layer perceptron (MLP): The MLP baseline
     MRR         0.149   0.305   0.284   0.296    0.313             used here predicts output from interactions be-
    NDCG         0.303   0.449   0.430   0.442    0.456             tween user embedding and item attributes with
                                                                    deep learning. The first hidden layer is the in-
     AUC         0.646   0.861   0.841   0.857    0.862
                                                                    put combination layer that combines user embed-
                                                                    ding input and item attributes. The combination
                                                                    model is the piece-wise product since this has
                                                                    been demonstrated to outperform concatenation
6.3.3. Results                                                      or a dot product [34]. The dot product also doesnβt
                                                                    allow us assign different weights to the combined
We record the best average results observed dur-                    nodes. The output from this combination layer
ing experiments for each dataset and depict them in                 are propagated through extra hidden layers. More
movielens-table,eachmovie-table. NeuTraL significantly              hidden layers can be added as needed before the
out-performs the other techniques based on a Wilcoxon               final output.
signed-rank test with a π-value < 0.01. The winning
                                                                  β’ ATM-BPR The ATM-BPR technique used a base-
algorithm per metric is emboldened in each row of all
                                                                    line here is described in atm-bpr except the pre-
tables. We assume a margin of error of 0.005, hence
                                                                    trained user embedding is extracted from BPR
the winning algorithm has to be greater than the next
                                                                    instead of an CDAE recommender which is used
winner by at least a margin of 0.005. All techniques are
                                                                    in NeuTraL-C.
emboldened in the case of a tie on a metric. Techniques
                                                                  β’ DropoutNet: Addressing Cold Start in Recom-
within the margin of error of the highest score are also
                                                                    mender Systems DropoutNet [22] is a state-of-the-
emboldened.
                                                                    art deep learning based personalization system.
                                                                    DropoutNet is analogous to NeuTraL and ATM-
6.4. Experiments for cold-start ranking                             BPR. DropoutNet adopts a different transfer learn-
                                                                    ing procedure compared to NeuTraL. Dropout-
6.5. Datasets                                                       Net transfers a pre-trained shallow model to a
We performed experiments on 3 of the 4 publicly avail-              deep model while NeuTraL transfers a pre-trained
able datasets used for warm-start experiments in sec-               deep model to a shallow model. We use the
tion warm-start-datasets. We used the datasets with                 MLP model described here as the deep learning
item attributes, hence their suitability for our experi-            model. DropoutNet allows the use of different pre-
ments. A summary of these datasets is provided in warm-             trained models but we use pre-trained user latent
datasetstable. The 3 datasets used for cold-start person-           features from CDAE similar to NeuTraL-C i.e. the
alization experiments are highlighted below:                        DropoutNet implementation used here is a com-
                                                                    bination of the extracted user latent factors from
     β’ Movielens 1M: Item attributes in the dataset in-             CDAE and MLP. Although DropoutNet is primar-
       clude release year and genre. The genre attribute            ily a cold start recommender but it is expected to
       is one-hot encoded into 18 dimensions because                perform relatively well on warm start recommen-
       we have 18 possible genres. The year is an addi-             dations with the appropriate dropout rate. We
       tional dimension.                                            use a maximum input dropout rate of 1.00 for our
     β’ Eachmovies dataset: The items/movies in this                 experiments with DropoutNet to maximize per-
       dataset are a subset of the items in the Movielens           formance on cold-start because that is the focus
       dataset, hence we are able to us the same attribute          of this research work. DropoutNet also allows
       feature engineering as described for Movielens.              inference transform but we do not apply it in our
     β’ Goodreads dataset: We use the genres as book at-             experiments because we do not consider the case
       tributes for cold-start personalization. The genre           of incremental item preference data collection as
       attribute is one-hot encoded into 10 dimensions              described in their work. We refer to DropoutNet
       because we have 18 possible genres.                          as D-Net to conserve space in the results tables.
                                                                  β’ W&D: Wide & Deep Learning for Recommender
                                                                    Systems W&D [19] combines generalization and
Table 6
Movielens results on cold-start items
                              Metrics    W&D      MLP     ATM-BPR     D-Net   NeuTraL-C
                             MRR@10       0.043   0.050      0.070    0.053      0.083
                            NDCG@10       0.053   0.059      0.100    0.063      0.117
                                MRR       0.083   0.089      0.097    0.093      0.109
                               NDCG       0.244   0.249      0.257    0.252      0.269
                                AUC       0.604   0.610      0.629    0.617      0.656
Table 7
Goodreads results on cold-start items
                              Metrics    W&D      MLP     ATM-BPR     D-Net   NeuTraL-C
                             MRR@10       0.030   0.036      0.057    0.054      0.077
                            NDCG@10       0.037   0.045      0.088    0.067      0.114
                                MRR       0.067   0.076      0.083    0.101      0.107
                               NDCG       0.228   0.238      0.245    0.264      0.271
                                AUC       0.570   0.603      0.588    0.672      0.689
        memorization capabilities of recommender sys-        in each row of all tables. We assume a margin of error
        tems for more robust personalization. They used      of 0.005, hence the winning algorithm has to be greater
        deep learning for its demonstrated superior gener-   than the next winner by at least a margin of 0.005. All
        alization capability. However, deep learning tends   techniques are emboldened in the case of a tie on a met-
        to over-generalize when the input is too sparse      ric. Techniques within the margin of error of the highest
        and high-rank. On the other hand, generalized        score are also emboldened.
        linear models are highly capable of memorization
        of feature interactions through cross product fea-   6.6. Discussion
        ture transformations. Hence, the combination of
        a deep learning and a cross product model (wide)   We begin our discussion with the results of the warm-
        in W&D for personalization.                        start experiments. We stated that NeuTraL performed
                                                           best overall because of its highest number of wins which
                                                           corresponds to the number of times a technique has the
6.5.2. Evaluation metrics for cold-start                   highest score per dataset. We also validated this observa-
                                                           tion with a significance test. IPop has the worst perfor-
We measured how well a recommender system is able to mance overall. This is not surprising since it is merely a
rank a preferred cold-start item relative to other items. baseline technique that ranks items based on popularity.
The evaluation is similar to the evaluation for warm-start The ranking produced by IPop is not personalized as it
items. The main difference is the absence of test items in does not take personal attributes, context or historical
the training dataset for cold-start personalized ranking. preference into account. We expect a decent personal-
                                                           ized ranking technique to out-perform IPop. This is the
6.5.3. Results                                             case as least performing personalized ranking technique
We record the best results observed during experiments is BPR but it ourperforms IPop. NCF performs better
for each dataset and depict them in movielens-table-cold- than BPR. This was already demonstrated by the creators
start,eachmovies-table-cold-start. NeuTraL-C performs of NCF in their research work [5]. NCF combines both
best overall and we subsequently discuss the results fur- deep learning (MLP) and piecewise product of interac-
ther. The winning algorithm per metric is emboldened tions between user and item embeddings in a generalized
Table 8
Eachmovie results on cold-start items
                               Metrics      W&D      MLP     ATM-BPR      D-Net    NeuTraL-C
                              MRR@10        0.031    0.032       0.052     0.032      0.055
                             NDCG@10        0.037    0.038       0.072     0.038      0.068
                                MRR         0.065    0.065       0.076     0.065      0.075
                               NDCG         0.221    0.222       0.232     0.221      0.237
                                AUC         0.490    0.492       0.507     0.481      0.525
matrix factorization (GMF). BPR uses a dot product of            instance, the transferred user embedding is propagated
user and item embeddings to represent the interactions.          through hidden layers before combination with the item
Dot product assigns equal weights to the LVPs as de-             attributes. The output of the hidden layers is a tainted
scribed in dot-product while the GMF component of NCF            version of the user embedding. The mapping learned by
learns different weights for the LVPs with a neural net-         DropoutNet is between this tainted version and the item
work. The MLP component of NCF also learns different             attributes. We believe this is the reason for a poorer per-
weights for user and item embedding combinations. This           formance compared to ATM-BPR and NeuTraL-C. It is not
results in more complex representation of interactions           too surprising that MLP performed less than DropoutNet
between users and items and better performance. MPR              since it is DropoutNet without transfer learning. Once
out-performs NCF. The MTL nature of MPR gives it an              again, this shows the effectivenes of transfer learning.
advantage. NeuTraLβs superior performance butrresses             WD performed the least of all cold-start personalization
the effectiveness of transfer learning since it is essentially   systems. It does not use transfer learning and we be-
MPR combined with transfer learning but it outperforms           lieve the complexity of deep learning in WD deteriorated
MPR. We surmise that transfer learning improved the              performance due to overfitting.
performance of NeuTraL. We also believe that the type of            A common theme throughout or experiments is the
pre-trained model that is transferred is significant. Our        benefit of our neural transfer learning approach. We
experiment here reveals that the extraction mechanism            believe that the transferred user embedding is more rep-
from an autoencoder based model like CDAE is effective.          resentative of the users as latent factors compared to the
   We subsequently discuss the results of our experiments        user embedding in the other models. We show a chart
on cold-start personalization. We stated that NeuTraL-C          of loss minimization in NeuTraL with and without trans-
performed best overall because of its highest number of          fer learning in 3 on the Movielens data. 3 shows the
wins which corresponds to the number of times a tech-            speed-up achieved with transfer learning in the form of
nique has the highest score per dataset. We also validated       lower initial loss. 3 also shows the overall lower loss with
this observation with a significance test. ATM-BPR is            training. We know that ATM-BPR and DropoutNet adopt
the next best performing technique. Both ATM-BPR and             transfer learning as well but are outperformed by Neu-
NeuTraL-C adopt transfer learning. However, NeuTraL-             TraL. As stated earlier in section:cdae, dropout is a vital
C uses a different pre-trained model. NeuTraL-C uses a           component of CDAE, hence we investigated the effect
pre-trained model extracted from CDAE as described in            of dropout when pre-training on the final results. The
section:NeuTraL while ATM-BPR uses pre-trained user              results show that dropout slightly enhances the effect of
embedding from BPR. This shows that it is not enough             the transferred user embedding in NeuTraL.
to just apply transfer learning but the meticulousness of
implementation is as important. The type of pre-trained
model is pertinent in such design. NeuTraL-C and ATM-            7. Conclusion
BPR also differ in how they learn the "mapping func-
                                                                 We presented a novel personalization system based on
tion". NeuTraL-C uses MPR while ATM-BPR uses BPR.
                                                                 transfer learning from a state-of-the-art deep personaliza-
DropoutNet performs next best to ATM-BPR. Dropout-
                                                                 tion system to a linear cold-start personalization model.
Net also uses transfer learning. We used user embed-
                                                                 This system is applicable to warm-start and cold-start
ding from CDAE in DropoutNet. However, it uses deep
                                                                 items and users. The results of our experiments show the
learning to learn the interaction between the transferred
                                                                 effectiveness of our proposed method and we discussed
embedding and item attributes. The complex nature of
                                                                 the results. Although the results are promising, there is
DropoutNet deteriorated performance somewhat. For
           Β·104               without transfer learning             426β434. URL: http://doi.acm.org/10.1145/1401890.
       2
                               with transfer learning               1401944. doi:10.1145/1401890.1401944.
                                                                [5] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, T.-S. Chua,
                                                                    Neural collaborative filtering, in: Proceedings of
 1.5                                                                the 26th International Conference on World Wide
                                                                    Web, WWW β17, International World Wide Web
                                                                    Conferences Steering Committee, 2017, pp. 173β
                                                                    182. URL: https://doi.org/10.1145/3038912.3052569.
loss
       1
                                                                    doi:10.1145/3038912.3052569.
                                                                [6] Y. Wu, C. DuBois, A. X. Zheng, M. Ester, Col-
                                                                    laborative denoising auto-encoders for top-n rec-
 0.5                                                                ommender systems, in: Proceedings of the Ninth
                                                                    ACM International Conference on Web Search and
                                                                    Data Mining, WSDM β16, ACM, New York, NY,
                                                                    USA, 2016, pp. 153β162. URL: http://doi.acm.org/
                  5          10            15             20        10.1145/2835776.2835837. doi:10.1145/2835776.
                            epoch                                   2835837.
                                                                [7] S. Sedhain, A. K. Menon, S. Sanner, L. Xie, Au-
Figure 3: Effect of Transfer Learning with NeuTraL-C on             torec: Autoencoders meet collaborative filtering,
Movielens dataset.                                                  in: Proceedings of the 24th International Confer-
                                                                    ence on World Wide Web, WWW β15 Compan-
                                                                    ion, ACM, New York, NY, USA, 2015, pp. 111β112.
                                                                    URL: http://doi.acm.org/10.1145/2740908.2742726.
room for future work and improvements. Potential future
                                                                    doi:10.1145/2740908.2742726.
research work include the extension of our techniques
                                                                [8] Y. Zheng, B. Tang, W. Ding, H. Zhou, A neu-
to user cold-start, full cold-start and warm-start rank-
                                                                    ral autoregressive approach to collaborative filter-
ing. Other potential future work includes investigation
                                                                    ing, in: Proceedings of the 33rd International Con-
of additional attributes and optimum fusion strategy of
                                                                    ference on International Conference on Machine
those attributes. We believe experimentation with more
                                                                    Learning - Volume 48, ICMLβ16, JMLR.org, 2016,
datasets and context attributes such as time and location
                                                                    pp. 764β773. URL: http://dl.acm.org/citation.cfm?
would also be worthwhile.
                                                                    id=3045390.3045472.
                                                                [9] M. Bianchi, F. Cesaro, F. Ciceri, M. Dagrada, A. Gas-
References                                                          parin, D. Grattarola, I. Inajjar, A. M. Metelli, L. Cella,
                                                                    Content-based approaches for cold-start job rec-
 [1] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering          ommendations, in: Proceedings of the Recom-
     for implicit feedback datasets, in: 2008 Eighth IEEE           mender Systems Challenge 2017, RecSys Challenge
     International Conference on Data Mining, 2008, pp.             β17, ACM, New York, NY, USA, 2017, pp. 6:1β6:5.
     263β272. doi:10.1109/ICDM.2008.22.                             URL: http://doi.acm.org/10.1145/3124791.3124793.
 [2] Y. Koren, R. Bell, C. Volinsky, Matrix factorization           doi:10.1145/3124791.3124793.
     techniques for recommender systems, Computer              [10] A. I. Schein, A. Popescul, L. H. Ungar, D. M. Pen-
     42 (2009) 30β37. URL: http://dx.doi.org/10.1109/MC.            nock, Methods and metrics for cold-start recom-
     2009.263. doi:10.1109/MC.2009.263.                             mendations, in: SIGIR β02, 2002.
 [3] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-      [11] A. Arampatzis, G. Kalamatianos, Suggesting points-
     Thieme, Bpr: Bayesian personalized ranking from                of-interest via content-based, collaborative, and hy-
     implicit feedback, in: Proceedings of the Twenty-              brid fusion methods in mobile devices, ACM Trans.
     Fifth Conference on Uncertainty in Artificial In-              Inf. Syst. 36 (2017) 23:1β23:28. URL: http://doi.acm.
     telligence, UAI β09, AUAI Press, Arlington, Vir-               org/10.1145/3125620. doi:10.1145/3125620.
     ginia, United States, 2009, pp. 452β461. URL: http:       [12] Z. Gantner, L. Drumond, C. Freudenthaler, S. Rendle,
     //dl.acm.org/citation.cfm?id=1795114.1795167.                  L. Schmidt-Thieme, Learning attribute-to-feature
 [4] Y. Koren, Factorization meets the neighborhood: A              mappings for cold-start recommendations, in: 2010
     multifaceted collaborative filtering model, in: Pro-           IEEE International Conference on Data Mining,
     ceedings of the 14th ACM SIGKDD International                  2010, pp. 176β185. doi:10.1109/ICDM.2010.129.
     Conference on Knowledge Discovery and Data Min-           [13] A. v. d. Oord, S. Dieleman, B. Schrauwen, Deep
     ing, KDD β08, ACM, New York, NY, USA, 2008, pp.                content-based music recommendation, in: Pro-
                                                                    ceedings of the 26th International Conference on
     Neural Information Processing Systems - Volume              knowledge, in: 2013 IEEE International Conference
     2, NIPSβ13, Curran Associates Inc., USA, 2013, pp.          on Multimedia and Expo (ICME), 2013, pp. 1β6.
     2643β2651. URL: http://dl.acm.org/citation.cfm?id=     [22] M. Volkovs, G. Yu, T. Poutanen, Dropoutnet: Ad-
     2999792.2999907.                                            dressing cold start in recommender systems, in:
[14] P. Covington, J. Adams, E. Sargin, Deep neural net-         I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
     works for youtube recommendations, in: Proceed-             R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Ad-
     ings of the 10th ACM Conference on Recommender              vances in Neural Information Processing Systems
     Systems, New York, NY, USA, 2016.                           30, Curran Associates, Inc., 2017, pp. 4957β4966.
[15] T. T. Nguyen, H. W. Lauw, Collaborative topic re-      [23] C. Burges, T. Shaked, E. Renshaw, A. Lazier,
     gression with denoising autoencoder for content             M. Deeds, N. Hamilton, G. Hullender, Learning to
     and community co-representation, in: Proceed-               rank using gradient descent, in: Proceedings of the
     ings of the 2017 ACM on Conference on Infor-                22Nd International Conference on Machine Learn-
     mation and Knowledge Management, CIKM β17,                  ing, ICML β05, ACM, New York, NY, USA, 2005,
     ACM, New York, NY, USA, 2017, pp. 2231β2234.                pp. 89β96. URL: http://doi.acm.org/10.1145/1102351.
     URL: http://doi.acm.org/10.1145/3132847.3133128.            1102363. doi:10.1145/1102351.1102363.
     doi:10.1145/3132847.3133128.                           [24] D. P. Kingma, J. Ba, Adam: A method for
[16] H. Wang, N. Wang, D.-Y. Yeung, Collaborative deep           stochastic optimization., CoRR abs/1412.6980
     learning for recommender systems, in: Proceedings           (2014). URL: http://dblp.uni-trier.de/db/journals/
     of the 21th ACM SIGKDD International Conference             corr/corr1412.html#KingmaB14.
     on Knowledge Discovery and Data Mining, KDD            [25] L. Torrey, J. Shavlik, Transfer learning, 2009.
     β15, ACM, New York, NY, USA, 2015, pp. 1235β1244.      [26] A. Quattoni, Transfer learning algorithms for image
     URL: http://doi.acm.org/10.1145/2783258.2783273.            classification, Ph.D. thesis, Citeseer, 2009.
     doi:10.1145/2783258.2783273.                           [27] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone,
[17] G. Sottocornola, F. Stella, M. Zanker, F. Canonaco,         Q. De Laroussilhe, A. Gesmundo, M. Attariyan,
     Towards a deep learning model for hybrid rec-               S. Gelly, Parameter-efficient transfer learning for
     ommendation, in: Proceedings of the Interna-                nlp, arXiv preprint arXiv:1902.00751 (2019).
     tional Conference on Web Intelligence, WI β17,         [28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever,
     ACM, New York, NY, USA, 2017, pp. 1260β1264.                R. Salakhutdinov, Dropout: A simple way to pre-
     URL: http://doi.acm.org/10.1145/3106426.3110321.            vent neural networks from overfitting, J. Mach.
     doi:10.1145/3106426.3110321.                                Learn. Res. 15 (2014) 1929β1958. URL: http://dl.acm.
[18] W. Niu, J. Caverlee, H. Lu, Neural personalized             org/citation.cfm?id=2627435.2670313.
     ranking for image recommendation, in: Proceed-         [29] C. M. Bishop, Training with noise is equivalent
     ings of the Eleventh ACM International Confer-              to tikhonov regularization, Neural computation 7
     ence on Web Search and Data Mining, WSDM β18,               (1995) 108β116.
     Association for Computing Machinery, New York,         [30] F. M. Harper, J. A. Konstan, The movielens datasets:
     NY, USA, 2018, p. 423β431. URL: https://doi.org/            History and context, ACM Trans. Interact. Intell.
     10.1145/3159652.3159728. doi:10.1145/3159652.               Syst. 5 (2015) 19:1β19:19. URL: http://doi.acm.org/
     3159728.                                                    10.1145/2827872. doi:10.1145/2827872.
[19] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chan-   [31] P. McJones, Eachmovie Collaborative Filter-
     dra, H. Aradhye, G. Anderson, G. Corrado, W. Chai,          ing Dataset, DEC Systems Research Cen-
     M. Ispir, et al., Wide deep learning for recom-             ter,http://www.research.compaq.com/src/eachmovie/,
     mender systems, in: Proceedings of the 1st Work-            1997.
     shop on Deep Learning for Recommender Systems,         [32] M. Wan, J. J. McAuley, Item recommendation on
     DLRS 2016, Association for Computing Machin-                monotonic behavior chains, in: S. Pera, M. D.
     ery, New York, NY, USA, 2016, p. 7β10. URL: https:          Ekstrand, X. Amatriain, J. OβDonovan (Eds.), Pro-
     //doi.org/10.1145/2988450.2988454. doi:10.1145/             ceedings of the 12th ACM Conference on Rec-
     2988450.2988454.                                            ommender Systems, RecSys 2018, Vancouver, BC,
[20] Y. Zhu, J. Lin, S. He, B. Wang, Z. Guan, H. Liu,            Canada, October 2-7, 2018, ACM, 2018, pp. 86β
     D. Cai, Addressing the item cold-start problem by           94. URL: https://doi.org/10.1145/3240323.3240369.
     attribute-driven active learning, IEEE Transactions         doi:10.1145/3240323.3240369.
     on Knowledge and Data Engineering 32 (2020) 631β       [33] R. Otunba, R. A. Rufai, J. Lin, Mpr: Multi-objective
     644.                                                        pairwise ranking, in: Proceedings of the Eleventh
[21] Ming Yan, Jitao Sang, Tao Mei, Changsheng Xu,               ACM Conference on Recommender Systems, Rec-
     Friend transfer: Cold-start friend recommenda-              Sys β17, Association for Computing Machinery,
     tion with cross-platform transfer learning of social        New York, NY, USA, 2017, p. 170β178. URL: https:
     //doi.org/10.1145/3109859.3109903. doi:10.1145/
     3109859.3109903.
[34] R. Otunba, R. A. Rufai, J. Lin, Deep stacked ensem-
     ble recommender, in: Proceedings of the 31st In-
     ternational Conference on Scientific and Statistical
     Database Management, SSDBM β19, Association for
     Computing Machinery, New York, NY, USA, 2019,
     p. 197β201. URL: https://doi.org/10.1145/3335783.
     3335809. doi:10.1145/3335783.3335809.