=Paper= {{Paper |id=Vol-3135/darliap_paper2 |storemode=property |title=NeuTraL: Neural Transfer Learning for Personalized Ranking |pdfUrl=https://ceur-ws.org/Vol-3135/darliap_paper2.pdf |volume=Vol-3135 |authors=Rasaq Otunba |dblpUrl=https://dblp.org/rec/conf/edbt/Otunba22 }} ==NeuTraL: Neural Transfer Learning for Personalized Ranking== https://ceur-ws.org/Vol-3135/darliap_paper2.pdf

NeuTraL: Neural Transfer Learning for Personalized
Ranking
Rasaq Otunba
4400 University Drive, Fairfax, Virginia 22030

Abstract
Personalized ranking continues to be an important aspect of many information systems and personalization systems. Neural
networks and deep learning continue to gain popularity because of their success in different fields of artificial intelligence
such as computer vision and natural language processing. Recently, researchers began to apply deep learning to personalized
ranking with success. Most personalization systems exploit historical preference data for users and items in warm-start
scenario. A major challenge in personalized ranking occurs in the cold-start scenario which arises when there is little to no
historical preference information. Content information is sometimes available and it can be used to alleviate the cold-start
problem.
We propose a solution that involves transfer learning from a deep model to a shallow model for both warm-start and cold-start
personalized ranking. We corroborate our proposal with experiments on publicly available datasets in comparison with other
baseline and state-of-the-art techniques.

Keywords
neural networks; deep learning; recommendations; personalization; cold-start; ranking

1. Introduction • We propose a unique approach to extracting pre-
trained user latent factors from a state-of-the-art
Personalized ranking with adequate historical prefer- (SOTA) personalization model.
ence is referred to as warm-start while recommendation • The transfer of the pre-trained user latent factors
with inadequate historical preference is referred to as to a renowned personalization model for warm-
cold-start. We subsequently refer to personalized rank- start and cold-start ranking respectively.
ing as ranking except otherwise clearly stated. We pro- • We provide thorough evaluation and conduct
pose a machine learning solution called Neural Transfer experiments comparing our proposed solutions
Learning for warm-start personalized ranking, otherwise with other SOTA and baseline techniques.
referred to as NeuTraL. We then propose a cold-start ver-
sion of NeuTraL referred to as NeuTraL-C. NeuTraL and The remainder of this paper is organized as follows:
NeuTraL-C use neural networks and transfer learning in Section 2, we highlight related work. We provide
for warm-start and cold-start item ranking respectively. pertinent background and notations for the rest of this
Item cold-start personalized ranking involves ranking work in Section 3. We describe our approach in Section 5.
cold-start items while user cold-start personalized rank- In Section 6, we describe our experiments and discussed
ing involves ranking cold-start users. There is also the the results in section Section 6.3.3. We conclude with
full cold-start entity personalized ranking problem where potential directions for future work in Section 7.
both the user and item entities have no historical prefer-
ence information. Although we focus on cold-start item
personalized ranking in this work, we believe the concept 2. Related Work
is extensible to both user cold-start and full cold-start per-
sonalized ranking problems. Entity content information Personalized ranking techniques typically belong in one
is sometimes used to compensate for the lack of historical of the following categories: collaborative filtering (CF),
preference information by learning from content infor- content-based or a hybrid of the aforementioned tech-
mation and existing preference information. Ranking can niques. Different CF techniques ranging from matrix
be done for implicit or explicit feedback [1]. We focus on factorization (MF) [2, 3] to k-Nearest Neighbor (kNN) [4]
implicit feedback in this work due to its more prevalent have seen success in personalization systems research.
nature. The contributions made in this work include: In recent years, deep learning has also been successfully
applied for personalization. He et al. replaced the typical
dot product of user and item latent features with a deep
Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint
learning model in their technique referred to as neural
Conference (March 29-April 1, 2022), Edinburgh, UK
$ rotunba@gmu.edu (R. Otunba) collaborative filtering, NCF [5]. NCF performs better than
© 2022 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
the vanilla MF because the non-linearity of the deep learn-
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) ing model captures complex interactions between users
and items better. Deep representation models such as
autoencoders and restricted Boltzmann machines (RBM) A𝐼 ∈ R|𝐼|×𝑛 . (6)
have been used for personalization [6, 7, 8]. These tech-
Let a𝑈𝑢 be the vector of user attributes 1 . . . 𝑚 for user
niques have been successfully applied and demonstrated
𝑢, and a𝐼𝑖 be the vector of item attributes 1 . . . 𝑛 for item
on a variety of real world data, but they are known to
𝑖, so that 𝑎𝐼𝑖𝑘 is the 𝑘-th item attribute value and 𝑎𝑈 𝑖𝑘 is
suffer from the cold-start problem. Content-based tech-
the 𝑘-th user attribute value. 𝑎𝐼𝑖𝑘 = 0 when the attribute
niques are typically used to tackle the cold-start prob-
is unavailable. Sets 𝑈 and 𝐼 are represented by latent
lem by incorporating entity attributes [9, 10]. Entity at-
feature matrices U and I respectively where
tributes are sometimes combined with CF to compensate
for the weakness in CF [11, 12] for the cold-start scenario.
To alleviate the cold-start problem, some deep learning U ∈ R|𝑈 |×𝑟 (7)
techniques have been developed with the use of content
|𝐼|×𝑟
information, e.g., the deep content-based music recom- I∈R , (8)
mendation work proposed by Oord et al. [13]. Most of the
deep learning personalization systems proposed for cold where 𝑟 is the number of latent features. User 𝑢 and
start are hybrid in that they combine historical preference item 𝑖 are represented by u and i, respectively. Content
and content information [14, 15, 16, 17, 18, 19]. Some of data would sometimes contain only user attributes, item
the cold-start personalization systems [20] adopt active attributes or both. User attributes include demographic
learning. However, there are situations where active feed- information such as age and gender, education level, etc.
back from users for the cold start items are unavailable. Social network data can also be mined for user attributes.
Transfer learning has also been used in personalization Item attributes include physical attributes, time of pro-
systems research [21, 22]. duction, location, etc. The task of item ranking is to
estimate the relative ranking of the items for each user.
We denote the predicted ranking of item 𝑖 for user 𝑢 as
3. Background & Notations 𝑦ˆ𝑢𝑖 from an inference function 𝑓 :

The set of users and items are denoted by 𝑈 and 𝐼, re- 𝑦ˆ𝑢𝑖 = 𝑓 (u, a𝑈 𝐼
𝑢 , i, a𝑖 , 𝜃), (9)
spectively. A measure of preference is recorded as a
positive feedback from some set 𝑃 or as a negative feed- where 𝜃 denotes the model parameters learned during
back recorded as 0. When explicitly provided, 𝑃 could training. Equation 9 shows 𝑦ˆ𝑢𝑖 is a function of the input
be a set of values e.g., {1, 2, ..., 5}. When implicitly pro- and learned model parameters. Model parameters are
vided, typically 𝑃 = {0, 1}. The matrix of user-item typically learned via optimization such that an objective
interactions is denoted by: loss function is minimized or a utility function is maxi-
mized. Objective loss function minimization is expressed
Y ∈ ({0} ∪ 𝑃 )|𝑈 |×|𝐼| , (1) as:

where an interaction refers to an observable action by a
user e.g., the purchase of an item. User vector for user 𝑢 𝜃𝐸 = arg min ℒ(𝜃; Y), (10)
in 𝑌 is denoted as 𝑦𝑢 . Conversely, item vector for item 𝑖 𝜃

in 𝑌 is denoted as 𝑦𝑖𝑇 . The implicit feedback for a user where 𝜃 is learned from observation matrix Y to optimize
𝑢 ∈ 𝑈 on an item 𝑖 ∈ 𝐼 is: the estimate function 𝜃𝐸 that predicts 𝑦ˆ𝑢𝑖 . Learning is
{︃ usually done with machine learning techniques such as
1, if 𝑢 interacted with 𝑖; gradient descent (GD) [23] or its variants e.g., Adaptive
𝑦𝑢𝑖 = (2)
0, otherwise. Moment Estimation (Adam) [24] on carefully sampled
user-item pairs.
𝐼𝑢+ = {set of items interacted with by user 𝑢}. (3)
4. NeuTraL: Neural Transfer
𝐼𝑢− = 𝐼 − 𝐼𝑢+ (4) Learning for Personalized
𝑈 + , 𝑈 − , and 𝑈 are user sets analogous to the def- Ranking
initions in Equations 3− 4. A𝑈 and A𝐼 represent the
We provide further background on pertinent information
m-dimensional user-attribute and n-dimensional item-
that will aid the understanding of NeuTraL.
attribute matrices, respectively.

A𝑈 ∈ R|𝑈 |×𝑚 , (5)
User/Item Output User em-
ratings Output
layer bedding
vector

Knowledge Transfer

r ..
.

Prediction Predicted
Item em- function output
.. bedding
.
.. .. Training
. .
Actual
Hidden output
n ..
layer to
be trans- .
ferred
after pre-
training

Auto-Encoder MPR
Figure 1: NeuTraL: Left side shows the pre-trained Auto-Encoder with the transfer to MPR

4.1. MPR: Multi-Objective Pairwise ∑︁ ∑︁ ∑︁
Ranking ℒ(𝑦ˆ𝑢(𝑖,𝑗) ) + ℒ(𝑦ˆ𝑓 (𝑣,𝑤) ), (15)
𝑢∈𝑈 𝑖∈𝐼 + 𝑗∈𝐼 −
MPR is of the pairwise ranking function family where
𝑢 𝑢

the optimization task is with respect to the actual and and the objective function ℒ is the log-sigmoid function:
predicted values for a pair of items by a user. For item
ranking, the pairwise prediction function for a user 𝑢, a ℒ(𝑥) = ln 𝜎(𝑥), (16)
preferred item 𝑖 and a less preferred item 𝑗 is expressed and
as
1
𝜎(𝑥) = (17)
𝑦ˆ𝑢(𝑖,𝑗) = 𝑦ˆ𝑢𝑖 − 𝑦ˆ𝑢𝑗 , (11) 1 + 𝑒−𝑥
while the actual value is 𝑦ˆ𝑢𝑖 is estimated from a MF model learned with GD. 𝑦ˆ𝑢𝑖
is the dot product of the user latent vector 𝑢 and the item
𝑦𝑢(𝑖,𝑗) = 𝑦𝑢𝑖 − 𝑦𝑢𝑗 . (12) latent vector 𝑖.
Conversely, for user ranking, the pairwise prediction
𝑦ˆ𝑢𝑖 = 𝑢𝑇 · 𝑖 (18)
function for an item 𝑓 preferred by user 𝑣 but not pre-
ferred by user 𝑤 is expressed as Assume

𝑦ˆ𝑓 (𝑣,𝑤) = 𝑦ˆ𝑓 𝑣 − 𝑦ˆ𝑓 𝑤 , (13) 𝑢 = {𝑢1 , 𝑢2 , . . . , 𝑢𝑘 } (19)
while the actual value is and

𝑦𝑓 (𝑣,𝑤) = 𝑦𝑓 𝑣 − 𝑦𝑓 𝑤 . (14) 𝑖 = {𝑖1 , 𝑖2 , . . . , 𝑖𝑘 }. (20)
MPR combines item ranking and user ranking. The opti-
mization function is expressed as:
Component 𝑢𝑘 of 𝑢 represents user 𝑢’s affinity for an
item factor 𝑘. Component 𝑖𝑘 of 𝑖 represents the concen- 𝑦ˆ𝑢0 = 𝑓0 (𝑦𝑢 , 𝑢), (22)
tration of factor 𝑘 in item 𝑖.
and 𝑓0 is a concatenation function. The nodes vector in
the hidden layer are:
𝑇
𝑢 · 𝑖 = 𝑢1 * 𝑖1 + 𝑢2 * 𝑖2 . . . , 𝑢𝑘 * 𝑖𝑘 (21)
. 𝑦ˆ𝑢1 = 𝑓1 (𝑊1𝑇 · 𝑦ˆ0𝑢 + 𝑏1 ). (23)
Each component product 𝑢𝑘 * 𝑖𝑘 represents user 𝑢’s
affinity for factor 𝑘 in item 𝑖. We subsequently refer to 𝑊1 is the 𝑔 x ℎ weight matrix between the input and
this component product as latent vector product (LVP) hidden layers. 𝑔 and ℎ are the number of nodes in the
for ease of reference. input and hidden layers respectively. 𝑏1 is the bias for
the hidden layer. 𝑓1 is an activation function.

4.2. Transfer Learning 𝑦ˆ𝑢2 = 𝑓2 (𝑊2𝑇 · 𝑦ˆ1𝑢 ). (24)
Transfer learning [25] is premised on the idea that a 𝑊2 is the ℎ x 𝑔 weight matrix between the hidden and
related pre-trained model can serve as an initializer for output layers. 𝑓2 is an activation function. We use sig-
a main model. This initialization can be beneficial by moid activation functions since they produced optimal
speeding up learning and/or improving accuracy on the results. 𝑊1 , 𝑊2 and 𝑏1 are model parameters. There
main task as seen in Figure 4. Transfer learning is similar are also hyper-parameters such as learning rate, batch
to multi-task learning (MTL) with the main difference size and objective function that should be tuned during
being the sequential versus simultaneous nature of the training with validation. We use the binary cross-entropy
two techniques, respectively. Transfer learning has been cost function.
successful in image processing [26] and natural language
processing [27] among other areas of machine learning.
− 𝑦ˆ𝑢(𝑖,𝑗) ln 𝑦𝑢𝑖𝑗 − (1 − 𝑦ˆ𝑢(𝑖,𝑗) ) ln(1 − 𝑦𝑢𝑖𝑗 ). (25)
4.3. Auto-Encoders & Personalization
and backpropagation to update the model parameters.
Auto-encoders have been successfully applied in person-
alization systems [7, 6]. Auto-encoders derive their name 4.4. NeuTraL Algorithm
from the ability to encode input data with un-supervised
learning. The utility of auto-encoders include dimension- The development of NeuTraL as depicted in fig:NeuTraL
ality reduction of input while ignoring noise in the input begins with the supposition that a more representative
optimally. For the purpose of personalization, entity vec- user embedding model could improve performance in
tor data is passed as input with missing entries. The goal the MF for personalized ranking. A pre-trained neural
is to recover the original input in the output including network model may be appropriate since we are aware
the missing entries. To the best of our knowledge, the of the success of deep learning models in personalization
pioneer research work in this area is AutoRec [7]. User systems. It has also been shown that neural networks are
vectors 𝑦𝑢 or item vectors 𝑦𝑖 can serve as input where better at modelling complex non-linearity in user-item
each vector component is the actual preference value or interactions than MF models [5]. We chose CDAE as our
a missing entry. However, the authors of AutoRec stated pre-training model based on its proven improvement over
that user vector inputs performed better than item vector AutoRec. User latent features in MF can be considered a
inputs, and we observed the same in our experiments. form of dimensionality reduction for the user preference
Perhaps this is due to the peculiar characteristics of the vector in 𝑌 . A close look at both CDAE and MF reveals
datasets used, e.g., number of users and items, ratings that the hidden layer nodes of CDAE are analogous to
per item and ratings per user. Wu et al. presented a user latent features as smaller dimension versions of the
more sophisticated auto-encoder personalization tech- original user vectors in 𝑌 . This analogy implies we can
nique, Collaborative Denoising Auto-Encoders (CDAE) use a pre-trained |𝑈 | 𝑥 𝑘 matrix 𝐶 of hidden layer node
[6] which incorporates denoising with dropout [28] and values as the user latent feature matrix model which
an extra identifier input. Dropout can be seen as a form forms the basis for our contribution. We subsequently
of noise introduction [29]. refer to 𝐶 as the transfer matrix. In other words, we
Deep learning techniques have the advantage of being transfer user vector 𝑐𝑢 from 𝐶 as the latent vector for
able to model linear and non-linear complex interactions user u. We leave out the algorithm for NeuTraL since it
between users and items. Auto-encoders for personal- is essentially the same as the MPR algorithm with the
ization are depicted in Figure 1. We denote the nodes in use of the pre-trained user embedding from CDAE.
the input layer as 𝑦ˆ0𝑢 , hidden layer as 𝑦ˆ1𝑢 and the output
layer as 𝑦ˆ2𝑢 where
User/Item Output User em-
ratings Output
layer bedding
vector

Knowledge Transfer

k ..
.

Prediction Predicted
function output
..
. Mapper
matrix Training
.. ..
. .
Actual
Hidden output
layer to
be trans- k .. .. ..
ferred . . .
after pre-
training
t

Item at-
Auto-Encoder tributes

t
..
.

ATM-MPR
Figure 2: NeuTraL-C: Left side shows the pre-trained Auto-Encoder with the transfer to ATM-MPR

5. NeuTraL-C: Neural Transfer attributes that can be exploited for recommendations.
An item Attribute-to-Feature Mapping (ATM) as a frame-
Learning for Cold-Start work capable of providing item latent features from item
Personalized Ranking attributes i.e., a function that accepts item attributes as
input and produces item latent features as output. The
We provide further background on pertinent information output can then be used in conjunction with user latent
that will aid the understanding of NeuTraL-C as depicted features for prediction. We consider the ATM technique
in fig:NeuTraL-C. presented by Gantner et al [12] referred to as ATM-BPR
in this work. ATM-MPR is an extension of the ATM-BPR
5.1. Item Attribute-to-Feature Mappings technique for cold-start personalization.
Cold-start items have little to no historical preference
5.1.1. ATM-MPR
information to exploit for personalized ranking. Hence
recommending cold-start items pose a different challenge. ATM-MPR adds cold-start capability to MPR by learning
However, both warm-start and cold-start items have item a shallow linear model of latent features and attributes.
The main differences between MPR and ATM-MPR is the Algorithm 1 NeuTraL-C(𝑈, 𝑀, 𝐴)
derivation of the item latent vector 𝑖 where 1: Output: Optimized matrices U and M
2: initialize U with the extracted hidden layer matrix C
𝑖 = ℳ(𝑎𝐼𝑖 ), (26) from CDAE
3: initialize 𝛼, 𝜂 and M
and ℳ is a mapping function.
4: repeat
5: draw 𝑢, 𝑖, 𝑗 from 𝑈, 𝐼𝑢+ , 𝐼𝑢− uniformly
𝐼
ℳ(𝑎𝑖 ) = 𝑀 · 𝑎𝑖 ,𝐼
(27)
6: 𝑢 ← 𝑢 − 𝜂 * 𝑁 𝑒𝑢-C-𝑂𝑃 𝑇 𝑢
where 𝑀 is a mapper matrix to be learned similar to how 𝑀 ← 𝑀 − 𝜂 * 𝑁 𝑒𝑢-C-𝑂𝑃 𝑇 𝑀 wrt 𝑎𝐼𝑖 and 𝑎𝐼𝑗
𝑈 and 𝐼 are learned in MPR with GD. MPR optimizes the 7: draw 𝑓, 𝑣, 𝑤 from 𝐼, 𝑈𝑘+ , 𝑈𝑘− uniformly
NeuTraL-C optimization criterion which is the same as 8: 𝑀 ← 𝑀 − 𝜂 * 𝑁 𝑒𝑢-C-𝑂𝑃 𝑇 𝑀 wrt 𝑣 and 𝑤
neutral-opt. 𝑣 ← 𝑣 − 𝜂 * 𝑁 𝑒𝑢-C-𝑂𝑃 𝑇 𝑣
However, the respective prediction functions for user 𝑤 ← 𝑤 − 𝜂 * 𝑁 𝑒𝑢-C-𝑂𝑃 𝑇 𝑤
ranking and item ranking in NeuTraL-C are different. 9: until convergence or maximum number of iterations
We subsequently describe the item ranking prediction
function but the user ranking prediction function is anal- 10: return U, M
ogous. The item ranking prediction function is expressed
as:
6.1. Experimental Repeatability
𝑦ˆ𝑢(𝑖,𝑗) = (𝑢 𝑇
· 𝑀 · 𝑎𝐼𝑖 ) − (𝑢𝑇 · 𝑀 · 𝑎𝐼𝑗 ). (28)Experiment Artifacts (software, datasets, etc.) for this
work are available on demand. These artifacts will be
With transfer learning, the prediction function becomes: made publicly available with publication. All of the tech-
niques use GD and/or Adam for training as is the case in
𝑦ˆ𝑢(𝑖,𝑗) = (𝑐𝑇𝑢 · 𝑀 · 𝑎𝐼𝑖 ) − (𝑐𝑇𝑢 · 𝑀 · 𝑎𝐼𝑗 ). (29) NeuTraL where we use Adam for pretraining CDAE but
use GD for actual training in the ATM-BPR framework.
𝑦ˆ𝑢(𝑖,𝑗) 𝑚𝑖 = 𝑐𝑇𝑢 (𝑎𝐼𝑖 − 𝑎𝐼𝑗 ). (30) The benchmarks will converge differently during train-
ing based on hyperparameters but 1 factor that affects
Hence, 𝑀 is updated in GD with the following expres- the space and time requirements during each epoch is
sion: the size of model parameters. Avoidance of bias forms
the basis for model design and other hyperparameter se-
𝑀 = 𝑀 + 𝛼 (𝑁 𝑒𝑢-C-𝑂𝑃 𝑇 𝑀 ) , (31) lections throughout our experiments. We use one hidden
layer in the deep models. We use 100 factors in the MF
(︃ )︃ models. We also have the number of nodes in the deep
𝜕ℒ(𝑦ˆ𝑢(𝑖,𝑗) ) 𝜕𝑦ˆ𝑢(𝑖,𝑗) learning model amount to 100. We used the tower archi-
𝑀 =𝑀 +𝛼 · − 𝜆𝑀 · 𝑀 ,
𝜕𝑦ˆ𝑢(𝑖,𝑗) 𝜕𝑀 tecture for the deep learning models. We used learning
(32) rates between 0.00001 − 0.01 and batch sizes of 10000.
and 𝜆𝑀 is a regularization hyper-parameter. We tuned model hyperparameters and stopped training
early with validation.
5.2. NeuTraL-C Algorithm
6.2. Evaluation metrics
The NeuTraL-C algorithm is listed in alg-neutral-c
Evaluation is done with 5-fold cross validation. We use
3 popular information retrieval metrics: MRR, NDCG
6. Experiments and AUC which are described further in subsequent sub-
sections. We evaluate the techniques on their ability to
We proceed to address the following research questions: rank items relative to 9 and 99 other items. The rank-
ing metrics relative to 9 other items are denoted as @10
• how does NeuTraL compare with other SOTA e.g., MRR@10 measures MRR score for a technique when
warm-start item personalization systems. ranking 1 of 10 items for a user.
• how does NeuTraL-C compare with other SOTA
cold-start item personalization systems.

We begin by describing our experiment setup. We
subsequently describe our experiments on warm-start
personalized ranking followed by cold-start.
Table 1 Table 2
Datasets Movielens results on warm-start items
Dataset #Users #Items #Ratings Metrics IPop NCF BPR MPR NeuTraL
Movielens 1M 6,040 3,706 1,000,209
Eachmovie 72,916 1,628 2,811,983 MRR@10 0.246 0.409 0.400 0.421 0.437
Pinterest 55,187 9,916 1,500,809 NDCG@10 0.310 0.485 0.480 0.497 0.515
Goodreads 10,000 5,000 647,458
MRR 0.270 0.424 0.415 0.435 0.451
NDCG 0.417 0.548 0.542 0.557 0.570
6.3. Experiments for warm-start ranking
AUC 0.853 0.921 0.923 0.924 0.929
6.3.1. Datasets
We performed experiments on four publicly available
datasets. A summary of these datasets is provided in Table 3
Table 1. The datasets contain explicit ratings for users Pinterest results on warm-start items
on items but we convert the ratings to implicit feedback
by treating ratings greater than 0 as positive feedback. Metrics IPop NCF BPR MPR NeuTraL
Our focus in this work is implicit feedback but we believe MRR@10 0.111 0.475 0.465 0.487 0.492
NeuTraL is applicable to explicit feedback.
NDCG@10 0.151 0.566 0.559 0.578 0.584
• Movielens 1M: Movielens dataset of different MRR 0.138 0.483 0.475 0.496 0.501
datasets [30] are made publicly available by the
GroupLens Research lab at the University of Min- NDCG 0.298 0.600 0.595 0.611 0.615
nesota. We use the Movielens 1M dataset. The AUC 0.724 0.947 0.955 0.958 0.960
data is extracted from the Movielens website
which is a free website that provides personal-
ized movie recommendation to users.
• Eachmovies dataset: This dataset [31] is made Table 4
available by the Digital Equipment Corporation Books results on warm-start items
(DEC) Systems Research Center at Compaq. The Metrics IPop NCF BPR MPR NeuTraL
research center ran a CF service for experimen-
tal purposes and made the data available for re- MRR@10 0.087 0.170 0.167 0.239 0.245
search. NDCG@10 0.114 0.224 0.217 0.302 0.309
• Goodreads dataset: This dataset [32] was col-
MRR 0.112 0.197 0.193 0.262 0.268
lected from goodreads.com, a book social network
and recommendation website. NDCG 0.266 0.353 0.348 0.410 0.415
• Pinterest Dataset: This is a dataset of implicit AUC 0.590 0.793 0.770 0.829 0.834
feedback representing whether a user pinned an
image on their board on the pinterest platform at
https://www.pinterest.com.

• Multi-objective pairwise ranking (MPR) [33]:
6.3.2. Benchmarks
MPR is a MTL technique that combines item rank-
We compare our NeuTraL technique with 3 SOTA cold- ing and user ranking tasks. MTL learns from his-
start personalization systems and a baseline item popu- torical preference data from item and user rank-
larity (IPop) technique. IPop recommends items based ing perspectives. MTL was demonstrated to able
on popularity. The benchmarks will converge differently to improve item ranking accuracy by learning
during training based on hyperparameters but 1 factor from both perspectives.
that affects the space and time requirements during each • Neural Collaborative Filtering (NCF) [5]: NCF is
epoch is the size of model parameters. We select model an ensemble recommender that combines MF and
parameters to avoid bias throughout our experiments. deep learning. NCF was demonstrated to achieve
The SOTA benchmarks used are described below: superior performance compared to other SOTA
techniques.
• BPR: we described BPR in bpr.
Table 5 6.5.1. Benchmarks
Eachmovies results on warm-start items
We compare our NeuTraL technique with 4 state-of-the-
Metrics IPop NCF BPR MPR NeuTraL art cold-start personalization systems. NeuTraL-C, Dro-
pouNet and ATM-BPR require pre-training. The bench-
MRR@10 0.123 0.284 0.261 0.275 0.293
marks used are described below:
NDCG@10 0.159 0.357 0.329 0.349 0.368
• Multi-layer perceptron (MLP): The MLP baseline
MRR 0.149 0.305 0.284 0.296 0.313 used here predicts output from interactions be-
NDCG 0.303 0.449 0.430 0.442 0.456 tween user embedding and item attributes with
deep learning. The first hidden layer is the in-
AUC 0.646 0.861 0.841 0.857 0.862
put combination layer that combines user embed-
ding input and item attributes. The combination
model is the piece-wise product since this has
been demonstrated to outperform concatenation
6.3.3. Results or a dot product [34]. The dot product also doesn’t
allow us assign different weights to the combined
We record the best average results observed dur- nodes. The output from this combination layer
ing experiments for each dataset and depict them in are propagated through extra hidden layers. More
movielens-table,eachmovie-table. NeuTraL significantly hidden layers can be added as needed before the
out-performs the other techniques based on a Wilcoxon final output.
signed-rank test with a 𝑝-value < 0.01. The winning
• ATM-BPR The ATM-BPR technique used a base-
algorithm per metric is emboldened in each row of all
line here is described in atm-bpr except the pre-
tables. We assume a margin of error of 0.005, hence
trained user embedding is extracted from BPR
the winning algorithm has to be greater than the next
instead of an CDAE recommender which is used
winner by at least a margin of 0.005. All techniques are
in NeuTraL-C.
emboldened in the case of a tie on a metric. Techniques
• DropoutNet: Addressing Cold Start in Recom-
within the margin of error of the highest score are also
mender Systems DropoutNet [22] is a state-of-the-
emboldened.
art deep learning based personalization system.
DropoutNet is analogous to NeuTraL and ATM-
6.4. Experiments for cold-start ranking BPR. DropoutNet adopts a different transfer learn-
ing procedure compared to NeuTraL. Dropout-
6.5. Datasets Net transfers a pre-trained shallow model to a
We performed experiments on 3 of the 4 publicly avail- deep model while NeuTraL transfers a pre-trained
able datasets used for warm-start experiments in sec- deep model to a shallow model. We use the
tion warm-start-datasets. We used the datasets with MLP model described here as the deep learning
item attributes, hence their suitability for our experi- model. DropoutNet allows the use of different pre-
ments. A summary of these datasets is provided in warm- trained models but we use pre-trained user latent
datasetstable. The 3 datasets used for cold-start person- features from CDAE similar to NeuTraL-C i.e. the
alization experiments are highlighted below: DropoutNet implementation used here is a com-
bination of the extracted user latent factors from
• Movielens 1M: Item attributes in the dataset in- CDAE and MLP. Although DropoutNet is primar-
clude release year and genre. The genre attribute ily a cold start recommender but it is expected to
is one-hot encoded into 18 dimensions because perform relatively well on warm start recommen-
we have 18 possible genres. The year is an addi- dations with the appropriate dropout rate. We
tional dimension. use a maximum input dropout rate of 1.00 for our
• Eachmovies dataset: The items/movies in this experiments with DropoutNet to maximize per-
dataset are a subset of the items in the Movielens formance on cold-start because that is the focus
dataset, hence we are able to us the same attribute of this research work. DropoutNet also allows
feature engineering as described for Movielens. inference transform but we do not apply it in our
• Goodreads dataset: We use the genres as book at- experiments because we do not consider the case
tributes for cold-start personalization. The genre of incremental item preference data collection as
attribute is one-hot encoded into 10 dimensions described in their work. We refer to DropoutNet
because we have 18 possible genres. as D-Net to conserve space in the results tables.
• W&D: Wide & Deep Learning for Recommender
Systems W&D [19] combines generalization and
Table 6
Movielens results on cold-start items
Metrics W&D MLP ATM-BPR D-Net NeuTraL-C

MRR@10 0.043 0.050 0.070 0.053 0.083
NDCG@10 0.053 0.059 0.100 0.063 0.117
MRR 0.083 0.089 0.097 0.093 0.109
NDCG 0.244 0.249 0.257 0.252 0.269
AUC 0.604 0.610 0.629 0.617 0.656

Table 7
Goodreads results on cold-start items
Metrics W&D MLP ATM-BPR D-Net NeuTraL-C

MRR@10 0.030 0.036 0.057 0.054 0.077
NDCG@10 0.037 0.045 0.088 0.067 0.114
MRR 0.067 0.076 0.083 0.101 0.107
NDCG 0.228 0.238 0.245 0.264 0.271
AUC 0.570 0.603 0.588 0.672 0.689

memorization capabilities of recommender sys- in each row of all tables. We assume a margin of error
tems for more robust personalization. They used of 0.005, hence the winning algorithm has to be greater
deep learning for its demonstrated superior gener- than the next winner by at least a margin of 0.005. All
alization capability. However, deep learning tends techniques are emboldened in the case of a tie on a met-
to over-generalize when the input is too sparse ric. Techniques within the margin of error of the highest
and high-rank. On the other hand, generalized score are also emboldened.
linear models are highly capable of memorization
of feature interactions through cross product fea- 6.6. Discussion
ture transformations. Hence, the combination of
a deep learning and a cross product model (wide) We begin our discussion with the results of the warm-
in W&D for personalization. start experiments. We stated that NeuTraL performed
best overall because of its highest number of wins which
corresponds to the number of times a technique has the
6.5.2. Evaluation metrics for cold-start highest score per dataset. We also validated this observa-
tion with a significance test. IPop has the worst perfor-
We measured how well a recommender system is able to mance overall. This is not surprising since it is merely a
rank a preferred cold-start item relative to other items. baseline technique that ranks items based on popularity.
The evaluation is similar to the evaluation for warm-start The ranking produced by IPop is not personalized as it
items. The main difference is the absence of test items in does not take personal attributes, context or historical
the training dataset for cold-start personalized ranking. preference into account. We expect a decent personal-
ized ranking technique to out-perform IPop. This is the
6.5.3. Results case as least performing personalized ranking technique
We record the best results observed during experiments is BPR but it ourperforms IPop. NCF performs better
for each dataset and depict them in movielens-table-cold- than BPR. This was already demonstrated by the creators
start,eachmovies-table-cold-start. NeuTraL-C performs of NCF in their research work [5]. NCF combines both
best overall and we subsequently discuss the results fur- deep learning (MLP) and piecewise product of interac-
ther. The winning algorithm per metric is emboldened tions between user and item embeddings in a generalized
Table 8
Eachmovie results on cold-start items
Metrics W&D MLP ATM-BPR D-Net NeuTraL-C

MRR@10 0.031 0.032 0.052 0.032 0.055
NDCG@10 0.037 0.038 0.072 0.038 0.068
MRR 0.065 0.065 0.076 0.065 0.075
NDCG 0.221 0.222 0.232 0.221 0.237
AUC 0.490 0.492 0.507 0.481 0.525

matrix factorization (GMF). BPR uses a dot product of instance, the transferred user embedding is propagated
user and item embeddings to represent the interactions. through hidden layers before combination with the item
Dot product assigns equal weights to the LVPs as de- attributes. The output of the hidden layers is a tainted
scribed in dot-product while the GMF component of NCF version of the user embedding. The mapping learned by
learns different weights for the LVPs with a neural net- DropoutNet is between this tainted version and the item
work. The MLP component of NCF also learns different attributes. We believe this is the reason for a poorer per-
weights for user and item embedding combinations. This formance compared to ATM-BPR and NeuTraL-C. It is not
results in more complex representation of interactions too surprising that MLP performed less than DropoutNet
between users and items and better performance. MPR since it is DropoutNet without transfer learning. Once
out-performs NCF. The MTL nature of MPR gives it an again, this shows the effectivenes of transfer learning.
advantage. NeuTraL’s superior performance butrresses WD performed the least of all cold-start personalization
the effectiveness of transfer learning since it is essentially systems. It does not use transfer learning and we be-
MPR combined with transfer learning but it outperforms lieve the complexity of deep learning in WD deteriorated
MPR. We surmise that transfer learning improved the performance due to overfitting.
performance of NeuTraL. We also believe that the type of A common theme throughout or experiments is the
pre-trained model that is transferred is significant. Our benefit of our neural transfer learning approach. We
experiment here reveals that the extraction mechanism believe that the transferred user embedding is more rep-
from an autoencoder based model like CDAE is effective. resentative of the users as latent factors compared to the
We subsequently discuss the results of our experiments user embedding in the other models. We show a chart
on cold-start personalization. We stated that NeuTraL-C of loss minimization in NeuTraL with and without trans-
performed best overall because of its highest number of fer learning in 3 on the Movielens data. 3 shows the
wins which corresponds to the number of times a tech- speed-up achieved with transfer learning in the form of
nique has the highest score per dataset. We also validated lower initial loss. 3 also shows the overall lower loss with
this observation with a significance test. ATM-BPR is training. We know that ATM-BPR and DropoutNet adopt
the next best performing technique. Both ATM-BPR and transfer learning as well but are outperformed by Neu-
NeuTraL-C adopt transfer learning. However, NeuTraL- TraL. As stated earlier in section:cdae, dropout is a vital
C uses a different pre-trained model. NeuTraL-C uses a component of CDAE, hence we investigated the effect
pre-trained model extracted from CDAE as described in of dropout when pre-training on the final results. The
section:NeuTraL while ATM-BPR uses pre-trained user results show that dropout slightly enhances the effect of
embedding from BPR. This shows that it is not enough the transferred user embedding in NeuTraL.
to just apply transfer learning but the meticulousness of
implementation is as important. The type of pre-trained
model is pertinent in such design. NeuTraL-C and ATM- 7. Conclusion
BPR also differ in how they learn the "mapping func-
We presented a novel personalization system based on
tion". NeuTraL-C uses MPR while ATM-BPR uses BPR.
transfer learning from a state-of-the-art deep personaliza-
DropoutNet performs next best to ATM-BPR. Dropout-
tion system to a linear cold-start personalization model.
Net also uses transfer learning. We used user embed-
This system is applicable to warm-start and cold-start
ding from CDAE in DropoutNet. However, it uses deep
items and users. The results of our experiments show the
learning to learn the interaction between the transferred
effectiveness of our proposed method and we discussed
embedding and item attributes. The complex nature of
the results. Although the results are promising, there is
DropoutNet deteriorated performance somewhat. For
·104 without transfer learning 426–434. URL: http://doi.acm.org/10.1145/1401890.
2
with transfer learning 1401944. doi:10.1145/1401890.1401944.
[5] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, T.-S. Chua,
Neural collaborative filtering, in: Proceedings of
1.5 the 26th International Conference on World Wide
Web, WWW ’17, International World Wide Web
Conferences Steering Committee, 2017, pp. 173–
182. URL: https://doi.org/10.1145/3038912.3052569.
loss

1
doi:10.1145/3038912.3052569.
[6] Y. Wu, C. DuBois, A. X. Zheng, M. Ester, Col-
laborative denoising auto-encoders for top-n rec-
0.5 ommender systems, in: Proceedings of the Ninth
ACM International Conference on Web Search and
Data Mining, WSDM ’16, ACM, New York, NY,
USA, 2016, pp. 153–162. URL: http://doi.acm.org/
5 10 15 20 10.1145/2835776.2835837. doi:10.1145/2835776.
epoch 2835837.
[7] S. Sedhain, A. K. Menon, S. Sanner, L. Xie, Au-
Figure 3: Effect of Transfer Learning with NeuTraL-C on torec: Autoencoders meet collaborative filtering,
Movielens dataset. in: Proceedings of the 24th International Confer-
ence on World Wide Web, WWW ’15 Compan-
ion, ACM, New York, NY, USA, 2015, pp. 111–112.
URL: http://doi.acm.org/10.1145/2740908.2742726.
room for future work and improvements. Potential future
doi:10.1145/2740908.2742726.
research work include the extension of our techniques
[8] Y. Zheng, B. Tang, W. Ding, H. Zhou, A neu-
to user cold-start, full cold-start and warm-start rank-
ral autoregressive approach to collaborative filter-
ing. Other potential future work includes investigation
ing, in: Proceedings of the 33rd International Con-
of additional attributes and optimum fusion strategy of
ference on International Conference on Machine
those attributes. We believe experimentation with more
Learning - Volume 48, ICML’16, JMLR.org, 2016,
datasets and context attributes such as time and location
pp. 764–773. URL: http://dl.acm.org/citation.cfm?
would also be worthwhile.
id=3045390.3045472.
[9] M. Bianchi, F. Cesaro, F. Ciceri, M. Dagrada, A. Gas-
References parin, D. Grattarola, I. Inajjar, A. M. Metelli, L. Cella,
Content-based approaches for cold-start job rec-
[1] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering ommendations, in: Proceedings of the Recom-
for implicit feedback datasets, in: 2008 Eighth IEEE mender Systems Challenge 2017, RecSys Challenge
International Conference on Data Mining, 2008, pp. ’17, ACM, New York, NY, USA, 2017, pp. 6:1–6:5.
263–272. doi:10.1109/ICDM.2008.22. URL: http://doi.acm.org/10.1145/3124791.3124793.
[2] Y. Koren, R. Bell, C. Volinsky, Matrix factorization doi:10.1145/3124791.3124793.
techniques for recommender systems, Computer [10] A. I. Schein, A. Popescul, L. H. Ungar, D. M. Pen-
42 (2009) 30–37. URL: http://dx.doi.org/10.1109/MC. nock, Methods and metrics for cold-start recom-
2009.263. doi:10.1109/MC.2009.263. mendations, in: SIGIR ’02, 2002.
[3] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt- [11] A. Arampatzis, G. Kalamatianos, Suggesting points-
Thieme, Bpr: Bayesian personalized ranking from of-interest via content-based, collaborative, and hy-
implicit feedback, in: Proceedings of the Twenty- brid fusion methods in mobile devices, ACM Trans.
Fifth Conference on Uncertainty in Artificial In- Inf. Syst. 36 (2017) 23:1–23:28. URL: http://doi.acm.
telligence, UAI ’09, AUAI Press, Arlington, Vir- org/10.1145/3125620. doi:10.1145/3125620.
ginia, United States, 2009, pp. 452–461. URL: http: [12] Z. Gantner, L. Drumond, C. Freudenthaler, S. Rendle,
//dl.acm.org/citation.cfm?id=1795114.1795167. L. Schmidt-Thieme, Learning attribute-to-feature
[4] Y. Koren, Factorization meets the neighborhood: A mappings for cold-start recommendations, in: 2010
multifaceted collaborative filtering model, in: Pro- IEEE International Conference on Data Mining,
ceedings of the 14th ACM SIGKDD International 2010, pp. 176–185. doi:10.1109/ICDM.2010.129.
Conference on Knowledge Discovery and Data Min- [13] A. v. d. Oord, S. Dieleman, B. Schrauwen, Deep
ing, KDD ’08, ACM, New York, NY, USA, 2008, pp. content-based music recommendation, in: Pro-
ceedings of the 26th International Conference on
Neural Information Processing Systems - Volume knowledge, in: 2013 IEEE International Conference
2, NIPS’13, Curran Associates Inc., USA, 2013, pp. on Multimedia and Expo (ICME), 2013, pp. 1–6.
2643–2651. URL: http://dl.acm.org/citation.cfm?id= [22] M. Volkovs, G. Yu, T. Poutanen, Dropoutnet: Ad-
2999792.2999907. dressing cold start in recommender systems, in:
[14] P. Covington, J. Adams, E. Sargin, Deep neural net- I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
works for youtube recommendations, in: Proceed- R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Ad-
ings of the 10th ACM Conference on Recommender vances in Neural Information Processing Systems
Systems, New York, NY, USA, 2016. 30, Curran Associates, Inc., 2017, pp. 4957–4966.
[15] T. T. Nguyen, H. W. Lauw, Collaborative topic re- [23] C. Burges, T. Shaked, E. Renshaw, A. Lazier,
gression with denoising autoencoder for content M. Deeds, N. Hamilton, G. Hullender, Learning to
and community co-representation, in: Proceed- rank using gradient descent, in: Proceedings of the
ings of the 2017 ACM on Conference on Infor- 22Nd International Conference on Machine Learn-
mation and Knowledge Management, CIKM ’17, ing, ICML ’05, ACM, New York, NY, USA, 2005,
ACM, New York, NY, USA, 2017, pp. 2231–2234. pp. 89–96. URL: http://doi.acm.org/10.1145/1102351.
URL: http://doi.acm.org/10.1145/3132847.3133128. 1102363. doi:10.1145/1102351.1102363.
doi:10.1145/3132847.3133128. [24] D. P. Kingma, J. Ba, Adam: A method for
[16] H. Wang, N. Wang, D.-Y. Yeung, Collaborative deep stochastic optimization., CoRR abs/1412.6980
learning for recommender systems, in: Proceedings (2014). URL: http://dblp.uni-trier.de/db/journals/
of the 21th ACM SIGKDD International Conference corr/corr1412.html#KingmaB14.
on Knowledge Discovery and Data Mining, KDD [25] L. Torrey, J. Shavlik, Transfer learning, 2009.
’15, ACM, New York, NY, USA, 2015, pp. 1235–1244. [26] A. Quattoni, Transfer learning algorithms for image
URL: http://doi.acm.org/10.1145/2783258.2783273. classification, Ph.D. thesis, Citeseer, 2009.
doi:10.1145/2783258.2783273. [27] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone,
[17] G. Sottocornola, F. Stella, M. Zanker, F. Canonaco, Q. De Laroussilhe, A. Gesmundo, M. Attariyan,
Towards a deep learning model for hybrid rec- S. Gelly, Parameter-efficient transfer learning for
ommendation, in: Proceedings of the Interna- nlp, arXiv preprint arXiv:1902.00751 (2019).
tional Conference on Web Intelligence, WI ’17, [28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever,
ACM, New York, NY, USA, 2017, pp. 1260–1264. R. Salakhutdinov, Dropout: A simple way to pre-
URL: http://doi.acm.org/10.1145/3106426.3110321. vent neural networks from overfitting, J. Mach.
doi:10.1145/3106426.3110321. Learn. Res. 15 (2014) 1929–1958. URL: http://dl.acm.
[18] W. Niu, J. Caverlee, H. Lu, Neural personalized org/citation.cfm?id=2627435.2670313.
ranking for image recommendation, in: Proceed- [29] C. M. Bishop, Training with noise is equivalent
ings of the Eleventh ACM International Confer- to tikhonov regularization, Neural computation 7
ence on Web Search and Data Mining, WSDM ’18, (1995) 108–116.
Association for Computing Machinery, New York, [30] F. M. Harper, J. A. Konstan, The movielens datasets:
NY, USA, 2018, p. 423–431. URL: https://doi.org/ History and context, ACM Trans. Interact. Intell.
10.1145/3159652.3159728. doi:10.1145/3159652. Syst. 5 (2015) 19:1–19:19. URL: http://doi.acm.org/
3159728. 10.1145/2827872. doi:10.1145/2827872.
[19] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chan- [31] P. McJones, Eachmovie Collaborative Filter-
dra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, ing Dataset, DEC Systems Research Cen-
M. Ispir, et al., Wide deep learning for recom- ter,http://www.research.compaq.com/src/eachmovie/,
mender systems, in: Proceedings of the 1st Work- 1997.
shop on Deep Learning for Recommender Systems, [32] M. Wan, J. J. McAuley, Item recommendation on
DLRS 2016, Association for Computing Machin- monotonic behavior chains, in: S. Pera, M. D.
ery, New York, NY, USA, 2016, p. 7–10. URL: https: Ekstrand, X. Amatriain, J. O’Donovan (Eds.), Pro-
//doi.org/10.1145/2988450.2988454. doi:10.1145/ ceedings of the 12th ACM Conference on Rec-
2988450.2988454. ommender Systems, RecSys 2018, Vancouver, BC,
[20] Y. Zhu, J. Lin, S. He, B. Wang, Z. Guan, H. Liu, Canada, October 2-7, 2018, ACM, 2018, pp. 86–
D. Cai, Addressing the item cold-start problem by 94. URL: https://doi.org/10.1145/3240323.3240369.
attribute-driven active learning, IEEE Transactions doi:10.1145/3240323.3240369.
on Knowledge and Data Engineering 32 (2020) 631– [33] R. Otunba, R. A. Rufai, J. Lin, Mpr: Multi-objective
644. pairwise ranking, in: Proceedings of the Eleventh
[21] Ming Yan, Jitao Sang, Tao Mei, Changsheng Xu, ACM Conference on Recommender Systems, Rec-
Friend transfer: Cold-start friend recommenda- Sys ’17, Association for Computing Machinery,
tion with cross-platform transfer learning of social New York, NY, USA, 2017, p. 170–178. URL: https:
//doi.org/10.1145/3109859.3109903. doi:10.1145/
3109859.3109903.
[34] R. Otunba, R. A. Rufai, J. Lin, Deep stacked ensem-
ble recommender, in: Proceedings of the 31st In-
ternational Conference on Scientific and Statistical
Database Management, SSDBM ’19, Association for
Computing Machinery, New York, NY, USA, 2019,
p. 197–201. URL: https://doi.org/10.1145/3335783.
3335809. doi:10.1145/3335783.3335809.