NeuTraL: Neural Transfer Learning for Personalized Ranking Rasaq Otunba 4400 University Drive, Fairfax, Virginia 22030 Abstract Personalized ranking continues to be an important aspect of many information systems and personalization systems. Neural networks and deep learning continue to gain popularity because of their success in different fields of artificial intelligence such as computer vision and natural language processing. Recently, researchers began to apply deep learning to personalized ranking with success. Most personalization systems exploit historical preference data for users and items in warm-start scenario. A major challenge in personalized ranking occurs in the cold-start scenario which arises when there is little to no historical preference information. Content information is sometimes available and it can be used to alleviate the cold-start problem. We propose a solution that involves transfer learning from a deep model to a shallow model for both warm-start and cold-start personalized ranking. We corroborate our proposal with experiments on publicly available datasets in comparison with other baseline and state-of-the-art techniques. Keywords neural networks; deep learning; recommendations; personalization; cold-start; ranking 1. Introduction β€’ We propose a unique approach to extracting pre- trained user latent factors from a state-of-the-art Personalized ranking with adequate historical prefer- (SOTA) personalization model. ence is referred to as warm-start while recommendation β€’ The transfer of the pre-trained user latent factors with inadequate historical preference is referred to as to a renowned personalization model for warm- cold-start. We subsequently refer to personalized rank- start and cold-start ranking respectively. ing as ranking except otherwise clearly stated. We pro- β€’ We provide thorough evaluation and conduct pose a machine learning solution called Neural Transfer experiments comparing our proposed solutions Learning for warm-start personalized ranking, otherwise with other SOTA and baseline techniques. referred to as NeuTraL. We then propose a cold-start ver- sion of NeuTraL referred to as NeuTraL-C. NeuTraL and The remainder of this paper is organized as follows: NeuTraL-C use neural networks and transfer learning in Section 2, we highlight related work. We provide for warm-start and cold-start item ranking respectively. pertinent background and notations for the rest of this Item cold-start personalized ranking involves ranking work in Section 3. We describe our approach in Section 5. cold-start items while user cold-start personalized rank- In Section 6, we describe our experiments and discussed ing involves ranking cold-start users. There is also the the results in section Section 6.3.3. We conclude with full cold-start entity personalized ranking problem where potential directions for future work in Section 7. both the user and item entities have no historical prefer- ence information. Although we focus on cold-start item personalized ranking in this work, we believe the concept 2. Related Work is extensible to both user cold-start and full cold-start per- sonalized ranking problems. Entity content information Personalized ranking techniques typically belong in one is sometimes used to compensate for the lack of historical of the following categories: collaborative filtering (CF), preference information by learning from content infor- content-based or a hybrid of the aforementioned tech- mation and existing preference information. Ranking can niques. Different CF techniques ranging from matrix be done for implicit or explicit feedback [1]. We focus on factorization (MF) [2, 3] to k-Nearest Neighbor (kNN) [4] implicit feedback in this work due to its more prevalent have seen success in personalization systems research. nature. The contributions made in this work include: In recent years, deep learning has also been successfully applied for personalization. He et al. replaced the typical dot product of user and item latent features with a deep Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint learning model in their technique referred to as neural Conference (March 29-April 1, 2022), Edinburgh, UK $ rotunba@gmu.edu (R. Otunba) collaborative filtering, NCF [5]. NCF performs better than Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). the vanilla MF because the non-linearity of the deep learn- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) ing model captures complex interactions between users and items better. Deep representation models such as autoencoders and restricted Boltzmann machines (RBM) A𝐼 ∈ R|𝐼|×𝑛 . (6) have been used for personalization [6, 7, 8]. These tech- Let aπ‘ˆπ‘’ be the vector of user attributes 1 . . . π‘š for user niques have been successfully applied and demonstrated 𝑒, and a𝐼𝑖 be the vector of item attributes 1 . . . 𝑛 for item on a variety of real world data, but they are known to 𝑖, so that π‘ŽπΌπ‘–π‘˜ is the π‘˜-th item attribute value and π‘Žπ‘ˆ π‘–π‘˜ is suffer from the cold-start problem. Content-based tech- the π‘˜-th user attribute value. π‘ŽπΌπ‘–π‘˜ = 0 when the attribute niques are typically used to tackle the cold-start prob- is unavailable. Sets π‘ˆ and 𝐼 are represented by latent lem by incorporating entity attributes [9, 10]. Entity at- feature matrices U and I respectively where tributes are sometimes combined with CF to compensate for the weakness in CF [11, 12] for the cold-start scenario. To alleviate the cold-start problem, some deep learning U ∈ R|π‘ˆ |Γ—π‘Ÿ (7) techniques have been developed with the use of content |𝐼|Γ—π‘Ÿ information, e.g., the deep content-based music recom- I∈R , (8) mendation work proposed by Oord et al. [13]. Most of the deep learning personalization systems proposed for cold where π‘Ÿ is the number of latent features. User 𝑒 and start are hybrid in that they combine historical preference item 𝑖 are represented by u and i, respectively. Content and content information [14, 15, 16, 17, 18, 19]. Some of data would sometimes contain only user attributes, item the cold-start personalization systems [20] adopt active attributes or both. User attributes include demographic learning. However, there are situations where active feed- information such as age and gender, education level, etc. back from users for the cold start items are unavailable. Social network data can also be mined for user attributes. Transfer learning has also been used in personalization Item attributes include physical attributes, time of pro- systems research [21, 22]. duction, location, etc. The task of item ranking is to estimate the relative ranking of the items for each user. We denote the predicted ranking of item 𝑖 for user 𝑒 as 3. Background & Notations 𝑦ˆ𝑒𝑖 from an inference function 𝑓 : The set of users and items are denoted by π‘ˆ and 𝐼, re- 𝑦ˆ𝑒𝑖 = 𝑓 (u, aπ‘ˆ 𝐼 𝑒 , i, a𝑖 , πœƒ), (9) spectively. A measure of preference is recorded as a positive feedback from some set 𝑃 or as a negative feed- where πœƒ denotes the model parameters learned during back recorded as 0. When explicitly provided, 𝑃 could training. Equation 9 shows 𝑦ˆ𝑒𝑖 is a function of the input be a set of values e.g., {1, 2, ..., 5}. When implicitly pro- and learned model parameters. Model parameters are vided, typically 𝑃 = {0, 1}. The matrix of user-item typically learned via optimization such that an objective interactions is denoted by: loss function is minimized or a utility function is maxi- mized. Objective loss function minimization is expressed Y ∈ ({0} βˆͺ 𝑃 )|π‘ˆ |Γ—|𝐼| , (1) as: where an interaction refers to an observable action by a user e.g., the purchase of an item. User vector for user 𝑒 πœƒπΈ = arg min β„’(πœƒ; Y), (10) in π‘Œ is denoted as 𝑦𝑒 . Conversely, item vector for item 𝑖 πœƒ in π‘Œ is denoted as 𝑦𝑖𝑇 . The implicit feedback for a user where πœƒ is learned from observation matrix Y to optimize 𝑒 ∈ π‘ˆ on an item 𝑖 ∈ 𝐼 is: the estimate function πœƒπΈ that predicts 𝑦ˆ𝑒𝑖 . Learning is {οΈƒ usually done with machine learning techniques such as 1, if 𝑒 interacted with 𝑖; gradient descent (GD) [23] or its variants e.g., Adaptive 𝑦𝑒𝑖 = (2) 0, otherwise. Moment Estimation (Adam) [24] on carefully sampled user-item pairs. 𝐼𝑒+ = {set of items interacted with by user 𝑒}. (3) 4. NeuTraL: Neural Transfer πΌπ‘’βˆ’ = 𝐼 βˆ’ 𝐼𝑒+ (4) Learning for Personalized π‘ˆ + , π‘ˆ βˆ’ , and π‘ˆ are user sets analogous to the def- Ranking initions in Equations 3βˆ’ 4. Aπ‘ˆ and A𝐼 represent the We provide further background on pertinent information m-dimensional user-attribute and n-dimensional item- that will aid the understanding of NeuTraL. attribute matrices, respectively. Aπ‘ˆ ∈ R|π‘ˆ |Γ—π‘š , (5) User/Item Output User em- ratings Output layer bedding vector Knowledge Transfer r .. . Prediction Predicted Item em- function output .. bedding . .. .. Training . . Actual Hidden output n .. layer to be trans- . ferred after pre- training Auto-Encoder MPR Figure 1: NeuTraL: Left side shows the pre-trained Auto-Encoder with the transfer to MPR 4.1. MPR: Multi-Objective Pairwise βˆ‘οΈ βˆ‘οΈ βˆ‘οΈ Ranking β„’(𝑦ˆ𝑒(𝑖,𝑗) ) + β„’(𝑦ˆ𝑓 (𝑣,𝑀) ), (15) π‘’βˆˆπ‘ˆ π‘–βˆˆπΌ + π‘—βˆˆπΌ βˆ’ MPR is of the pairwise ranking function family where 𝑒 𝑒 the optimization task is with respect to the actual and and the objective function β„’ is the log-sigmoid function: predicted values for a pair of items by a user. For item ranking, the pairwise prediction function for a user 𝑒, a β„’(π‘₯) = ln 𝜎(π‘₯), (16) preferred item 𝑖 and a less preferred item 𝑗 is expressed and as 1 𝜎(π‘₯) = (17) 𝑦ˆ𝑒(𝑖,𝑗) = 𝑦ˆ𝑒𝑖 βˆ’ 𝑦ˆ𝑒𝑗 , (11) 1 + π‘’βˆ’π‘₯ while the actual value is 𝑦ˆ𝑒𝑖 is estimated from a MF model learned with GD. 𝑦ˆ𝑒𝑖 is the dot product of the user latent vector 𝑒 and the item 𝑦𝑒(𝑖,𝑗) = 𝑦𝑒𝑖 βˆ’ 𝑦𝑒𝑗 . (12) latent vector 𝑖. Conversely, for user ranking, the pairwise prediction 𝑦ˆ𝑒𝑖 = 𝑒𝑇 Β· 𝑖 (18) function for an item 𝑓 preferred by user 𝑣 but not pre- ferred by user 𝑀 is expressed as Assume 𝑦ˆ𝑓 (𝑣,𝑀) = 𝑦ˆ𝑓 𝑣 βˆ’ 𝑦ˆ𝑓 𝑀 , (13) 𝑒 = {𝑒1 , 𝑒2 , . . . , π‘’π‘˜ } (19) while the actual value is and 𝑦𝑓 (𝑣,𝑀) = 𝑦𝑓 𝑣 βˆ’ 𝑦𝑓 𝑀 . (14) 𝑖 = {𝑖1 , 𝑖2 , . . . , π‘–π‘˜ }. (20) MPR combines item ranking and user ranking. The opti- mization function is expressed as: Component π‘’π‘˜ of 𝑒 represents user 𝑒’s affinity for an item factor π‘˜. Component π‘–π‘˜ of 𝑖 represents the concen- 𝑦ˆ𝑒0 = 𝑓0 (𝑦𝑒 , 𝑒), (22) tration of factor π‘˜ in item 𝑖. and 𝑓0 is a concatenation function. The nodes vector in the hidden layer are: 𝑇 𝑒 Β· 𝑖 = 𝑒1 * 𝑖1 + 𝑒2 * 𝑖2 . . . , π‘’π‘˜ * π‘–π‘˜ (21) . 𝑦ˆ𝑒1 = 𝑓1 (π‘Š1𝑇 Β· 𝑦ˆ0𝑒 + 𝑏1 ). (23) Each component product π‘’π‘˜ * π‘–π‘˜ represents user 𝑒’s affinity for factor π‘˜ in item 𝑖. We subsequently refer to π‘Š1 is the 𝑔 x β„Ž weight matrix between the input and this component product as latent vector product (LVP) hidden layers. 𝑔 and β„Ž are the number of nodes in the for ease of reference. input and hidden layers respectively. 𝑏1 is the bias for the hidden layer. 𝑓1 is an activation function. 4.2. Transfer Learning 𝑦ˆ𝑒2 = 𝑓2 (π‘Š2𝑇 Β· 𝑦ˆ1𝑒 ). (24) Transfer learning [25] is premised on the idea that a π‘Š2 is the β„Ž x 𝑔 weight matrix between the hidden and related pre-trained model can serve as an initializer for output layers. 𝑓2 is an activation function. We use sig- a main model. This initialization can be beneficial by moid activation functions since they produced optimal speeding up learning and/or improving accuracy on the results. π‘Š1 , π‘Š2 and 𝑏1 are model parameters. There main task as seen in Figure 4. Transfer learning is similar are also hyper-parameters such as learning rate, batch to multi-task learning (MTL) with the main difference size and objective function that should be tuned during being the sequential versus simultaneous nature of the training with validation. We use the binary cross-entropy two techniques, respectively. Transfer learning has been cost function. successful in image processing [26] and natural language processing [27] among other areas of machine learning. βˆ’ 𝑦ˆ𝑒(𝑖,𝑗) ln 𝑦𝑒𝑖𝑗 βˆ’ (1 βˆ’ 𝑦ˆ𝑒(𝑖,𝑗) ) ln(1 βˆ’ 𝑦𝑒𝑖𝑗 ). (25) 4.3. Auto-Encoders & Personalization and backpropagation to update the model parameters. Auto-encoders have been successfully applied in person- alization systems [7, 6]. Auto-encoders derive their name 4.4. NeuTraL Algorithm from the ability to encode input data with un-supervised learning. The utility of auto-encoders include dimension- The development of NeuTraL as depicted in fig:NeuTraL ality reduction of input while ignoring noise in the input begins with the supposition that a more representative optimally. For the purpose of personalization, entity vec- user embedding model could improve performance in tor data is passed as input with missing entries. The goal the MF for personalized ranking. A pre-trained neural is to recover the original input in the output including network model may be appropriate since we are aware the missing entries. To the best of our knowledge, the of the success of deep learning models in personalization pioneer research work in this area is AutoRec [7]. User systems. It has also been shown that neural networks are vectors 𝑦𝑒 or item vectors 𝑦𝑖 can serve as input where better at modelling complex non-linearity in user-item each vector component is the actual preference value or interactions than MF models [5]. We chose CDAE as our a missing entry. However, the authors of AutoRec stated pre-training model based on its proven improvement over that user vector inputs performed better than item vector AutoRec. User latent features in MF can be considered a inputs, and we observed the same in our experiments. form of dimensionality reduction for the user preference Perhaps this is due to the peculiar characteristics of the vector in π‘Œ . A close look at both CDAE and MF reveals datasets used, e.g., number of users and items, ratings that the hidden layer nodes of CDAE are analogous to per item and ratings per user. Wu et al. presented a user latent features as smaller dimension versions of the more sophisticated auto-encoder personalization tech- original user vectors in π‘Œ . This analogy implies we can nique, Collaborative Denoising Auto-Encoders (CDAE) use a pre-trained |π‘ˆ | π‘₯ π‘˜ matrix 𝐢 of hidden layer node [6] which incorporates denoising with dropout [28] and values as the user latent feature matrix model which an extra identifier input. Dropout can be seen as a form forms the basis for our contribution. We subsequently of noise introduction [29]. refer to 𝐢 as the transfer matrix. In other words, we Deep learning techniques have the advantage of being transfer user vector 𝑐𝑒 from 𝐢 as the latent vector for able to model linear and non-linear complex interactions user u. We leave out the algorithm for NeuTraL since it between users and items. Auto-encoders for personal- is essentially the same as the MPR algorithm with the ization are depicted in Figure 1. We denote the nodes in use of the pre-trained user embedding from CDAE. the input layer as 𝑦ˆ0𝑒 , hidden layer as 𝑦ˆ1𝑒 and the output layer as 𝑦ˆ2𝑒 where User/Item Output User em- ratings Output layer bedding vector Knowledge Transfer k .. . Prediction Predicted function output .. . Mapper matrix Training .. .. . . Actual Hidden output layer to be trans- k .. .. .. ferred . . . after pre- training t Item at- Auto-Encoder tributes t .. . ATM-MPR Figure 2: NeuTraL-C: Left side shows the pre-trained Auto-Encoder with the transfer to ATM-MPR 5. NeuTraL-C: Neural Transfer attributes that can be exploited for recommendations. An item Attribute-to-Feature Mapping (ATM) as a frame- Learning for Cold-Start work capable of providing item latent features from item Personalized Ranking attributes i.e., a function that accepts item attributes as input and produces item latent features as output. The We provide further background on pertinent information output can then be used in conjunction with user latent that will aid the understanding of NeuTraL-C as depicted features for prediction. We consider the ATM technique in fig:NeuTraL-C. presented by Gantner et al [12] referred to as ATM-BPR in this work. ATM-MPR is an extension of the ATM-BPR 5.1. Item Attribute-to-Feature Mappings technique for cold-start personalization. Cold-start items have little to no historical preference 5.1.1. ATM-MPR information to exploit for personalized ranking. Hence recommending cold-start items pose a different challenge. ATM-MPR adds cold-start capability to MPR by learning However, both warm-start and cold-start items have item a shallow linear model of latent features and attributes. The main differences between MPR and ATM-MPR is the Algorithm 1 NeuTraL-C(π‘ˆ, 𝑀, 𝐴) derivation of the item latent vector 𝑖 where 1: Output: Optimized matrices U and M 2: initialize U with the extracted hidden layer matrix C 𝑖 = β„³(π‘ŽπΌπ‘– ), (26) from CDAE 3: initialize 𝛼, πœ‚ and M and β„³ is a mapping function. 4: repeat 5: draw 𝑒, 𝑖, 𝑗 from π‘ˆ, 𝐼𝑒+ , πΌπ‘’βˆ’ uniformly 𝐼 β„³(π‘Žπ‘– ) = 𝑀 Β· π‘Žπ‘– ,𝐼 (27) 6: 𝑒 ← 𝑒 βˆ’ πœ‚ * 𝑁 𝑒𝑒-C-𝑂𝑃 𝑇 𝑒 where 𝑀 is a mapper matrix to be learned similar to how 𝑀 ← 𝑀 βˆ’ πœ‚ * 𝑁 𝑒𝑒-C-𝑂𝑃 𝑇 𝑀 wrt π‘ŽπΌπ‘– and π‘ŽπΌπ‘— π‘ˆ and 𝐼 are learned in MPR with GD. MPR optimizes the 7: draw 𝑓, 𝑣, 𝑀 from 𝐼, π‘ˆπ‘˜+ , π‘ˆπ‘˜βˆ’ uniformly NeuTraL-C optimization criterion which is the same as 8: 𝑀 ← 𝑀 βˆ’ πœ‚ * 𝑁 𝑒𝑒-C-𝑂𝑃 𝑇 𝑀 wrt 𝑣 and 𝑀 neutral-opt. 𝑣 ← 𝑣 βˆ’ πœ‚ * 𝑁 𝑒𝑒-C-𝑂𝑃 𝑇 𝑣 However, the respective prediction functions for user 𝑀 ← 𝑀 βˆ’ πœ‚ * 𝑁 𝑒𝑒-C-𝑂𝑃 𝑇 𝑀 ranking and item ranking in NeuTraL-C are different. 9: until convergence or maximum number of iterations We subsequently describe the item ranking prediction function but the user ranking prediction function is anal- 10: return U, M ogous. The item ranking prediction function is expressed as: 6.1. Experimental Repeatability 𝑦ˆ𝑒(𝑖,𝑗) = (𝑒 𝑇 Β· 𝑀 Β· π‘ŽπΌπ‘– ) βˆ’ (𝑒𝑇 Β· 𝑀 Β· π‘ŽπΌπ‘— ). (28)Experiment Artifacts (software, datasets, etc.) for this work are available on demand. These artifacts will be With transfer learning, the prediction function becomes: made publicly available with publication. All of the tech- niques use GD and/or Adam for training as is the case in 𝑦ˆ𝑒(𝑖,𝑗) = (𝑐𝑇𝑒 Β· 𝑀 Β· π‘ŽπΌπ‘– ) βˆ’ (𝑐𝑇𝑒 Β· 𝑀 Β· π‘ŽπΌπ‘— ). (29) NeuTraL where we use Adam for pretraining CDAE but use GD for actual training in the ATM-BPR framework. 𝑦ˆ𝑒(𝑖,𝑗) π‘šπ‘– = 𝑐𝑇𝑒 (π‘ŽπΌπ‘– βˆ’ π‘ŽπΌπ‘— ). (30) The benchmarks will converge differently during train- ing based on hyperparameters but 1 factor that affects Hence, 𝑀 is updated in GD with the following expres- the space and time requirements during each epoch is sion: the size of model parameters. Avoidance of bias forms the basis for model design and other hyperparameter se- 𝑀 = 𝑀 + 𝛼 (𝑁 𝑒𝑒-C-𝑂𝑃 𝑇 𝑀 ) , (31) lections throughout our experiments. We use one hidden layer in the deep models. We use 100 factors in the MF (οΈƒ )οΈƒ models. We also have the number of nodes in the deep πœ•β„’(𝑦ˆ𝑒(𝑖,𝑗) ) πœ•π‘¦Λ†π‘’(𝑖,𝑗) learning model amount to 100. We used the tower archi- 𝑀 =𝑀 +𝛼 Β· βˆ’ πœ†π‘€ Β· 𝑀 , πœ•π‘¦Λ†π‘’(𝑖,𝑗) πœ•π‘€ tecture for the deep learning models. We used learning (32) rates between 0.00001 βˆ’ 0.01 and batch sizes of 10000. and πœ†π‘€ is a regularization hyper-parameter. We tuned model hyperparameters and stopped training early with validation. 5.2. NeuTraL-C Algorithm 6.2. Evaluation metrics The NeuTraL-C algorithm is listed in alg-neutral-c Evaluation is done with 5-fold cross validation. We use 3 popular information retrieval metrics: MRR, NDCG 6. Experiments and AUC which are described further in subsequent sub- sections. We evaluate the techniques on their ability to We proceed to address the following research questions: rank items relative to 9 and 99 other items. The rank- ing metrics relative to 9 other items are denoted as @10 β€’ how does NeuTraL compare with other SOTA e.g., MRR@10 measures MRR score for a technique when warm-start item personalization systems. ranking 1 of 10 items for a user. β€’ how does NeuTraL-C compare with other SOTA cold-start item personalization systems. We begin by describing our experiment setup. We subsequently describe our experiments on warm-start personalized ranking followed by cold-start. Table 1 Table 2 Datasets Movielens results on warm-start items Dataset #Users #Items #Ratings Metrics IPop NCF BPR MPR NeuTraL Movielens 1M 6,040 3,706 1,000,209 Eachmovie 72,916 1,628 2,811,983 MRR@10 0.246 0.409 0.400 0.421 0.437 Pinterest 55,187 9,916 1,500,809 NDCG@10 0.310 0.485 0.480 0.497 0.515 Goodreads 10,000 5,000 647,458 MRR 0.270 0.424 0.415 0.435 0.451 NDCG 0.417 0.548 0.542 0.557 0.570 6.3. Experiments for warm-start ranking AUC 0.853 0.921 0.923 0.924 0.929 6.3.1. Datasets We performed experiments on four publicly available datasets. A summary of these datasets is provided in Table 3 Table 1. The datasets contain explicit ratings for users Pinterest results on warm-start items on items but we convert the ratings to implicit feedback by treating ratings greater than 0 as positive feedback. Metrics IPop NCF BPR MPR NeuTraL Our focus in this work is implicit feedback but we believe MRR@10 0.111 0.475 0.465 0.487 0.492 NeuTraL is applicable to explicit feedback. NDCG@10 0.151 0.566 0.559 0.578 0.584 β€’ Movielens 1M: Movielens dataset of different MRR 0.138 0.483 0.475 0.496 0.501 datasets [30] are made publicly available by the GroupLens Research lab at the University of Min- NDCG 0.298 0.600 0.595 0.611 0.615 nesota. We use the Movielens 1M dataset. The AUC 0.724 0.947 0.955 0.958 0.960 data is extracted from the Movielens website which is a free website that provides personal- ized movie recommendation to users. β€’ Eachmovies dataset: This dataset [31] is made Table 4 available by the Digital Equipment Corporation Books results on warm-start items (DEC) Systems Research Center at Compaq. The Metrics IPop NCF BPR MPR NeuTraL research center ran a CF service for experimen- tal purposes and made the data available for re- MRR@10 0.087 0.170 0.167 0.239 0.245 search. NDCG@10 0.114 0.224 0.217 0.302 0.309 β€’ Goodreads dataset: This dataset [32] was col- MRR 0.112 0.197 0.193 0.262 0.268 lected from goodreads.com, a book social network and recommendation website. NDCG 0.266 0.353 0.348 0.410 0.415 β€’ Pinterest Dataset: This is a dataset of implicit AUC 0.590 0.793 0.770 0.829 0.834 feedback representing whether a user pinned an image on their board on the pinterest platform at https://www.pinterest.com. β€’ Multi-objective pairwise ranking (MPR) [33]: 6.3.2. Benchmarks MPR is a MTL technique that combines item rank- We compare our NeuTraL technique with 3 SOTA cold- ing and user ranking tasks. MTL learns from his- start personalization systems and a baseline item popu- torical preference data from item and user rank- larity (IPop) technique. IPop recommends items based ing perspectives. MTL was demonstrated to able on popularity. The benchmarks will converge differently to improve item ranking accuracy by learning during training based on hyperparameters but 1 factor from both perspectives. that affects the space and time requirements during each β€’ Neural Collaborative Filtering (NCF) [5]: NCF is epoch is the size of model parameters. We select model an ensemble recommender that combines MF and parameters to avoid bias throughout our experiments. deep learning. NCF was demonstrated to achieve The SOTA benchmarks used are described below: superior performance compared to other SOTA techniques. β€’ BPR: we described BPR in bpr. Table 5 6.5.1. Benchmarks Eachmovies results on warm-start items We compare our NeuTraL technique with 4 state-of-the- Metrics IPop NCF BPR MPR NeuTraL art cold-start personalization systems. NeuTraL-C, Dro- pouNet and ATM-BPR require pre-training. The bench- MRR@10 0.123 0.284 0.261 0.275 0.293 marks used are described below: NDCG@10 0.159 0.357 0.329 0.349 0.368 β€’ Multi-layer perceptron (MLP): The MLP baseline MRR 0.149 0.305 0.284 0.296 0.313 used here predicts output from interactions be- NDCG 0.303 0.449 0.430 0.442 0.456 tween user embedding and item attributes with deep learning. The first hidden layer is the in- AUC 0.646 0.861 0.841 0.857 0.862 put combination layer that combines user embed- ding input and item attributes. The combination model is the piece-wise product since this has been demonstrated to outperform concatenation 6.3.3. Results or a dot product [34]. The dot product also doesn’t allow us assign different weights to the combined We record the best average results observed dur- nodes. The output from this combination layer ing experiments for each dataset and depict them in are propagated through extra hidden layers. More movielens-table,eachmovie-table. NeuTraL significantly hidden layers can be added as needed before the out-performs the other techniques based on a Wilcoxon final output. signed-rank test with a 𝑝-value < 0.01. The winning β€’ ATM-BPR The ATM-BPR technique used a base- algorithm per metric is emboldened in each row of all line here is described in atm-bpr except the pre- tables. We assume a margin of error of 0.005, hence trained user embedding is extracted from BPR the winning algorithm has to be greater than the next instead of an CDAE recommender which is used winner by at least a margin of 0.005. All techniques are in NeuTraL-C. emboldened in the case of a tie on a metric. Techniques β€’ DropoutNet: Addressing Cold Start in Recom- within the margin of error of the highest score are also mender Systems DropoutNet [22] is a state-of-the- emboldened. art deep learning based personalization system. DropoutNet is analogous to NeuTraL and ATM- 6.4. Experiments for cold-start ranking BPR. DropoutNet adopts a different transfer learn- ing procedure compared to NeuTraL. Dropout- 6.5. Datasets Net transfers a pre-trained shallow model to a We performed experiments on 3 of the 4 publicly avail- deep model while NeuTraL transfers a pre-trained able datasets used for warm-start experiments in sec- deep model to a shallow model. We use the tion warm-start-datasets. We used the datasets with MLP model described here as the deep learning item attributes, hence their suitability for our experi- model. DropoutNet allows the use of different pre- ments. A summary of these datasets is provided in warm- trained models but we use pre-trained user latent datasetstable. The 3 datasets used for cold-start person- features from CDAE similar to NeuTraL-C i.e. the alization experiments are highlighted below: DropoutNet implementation used here is a com- bination of the extracted user latent factors from β€’ Movielens 1M: Item attributes in the dataset in- CDAE and MLP. Although DropoutNet is primar- clude release year and genre. The genre attribute ily a cold start recommender but it is expected to is one-hot encoded into 18 dimensions because perform relatively well on warm start recommen- we have 18 possible genres. The year is an addi- dations with the appropriate dropout rate. We tional dimension. use a maximum input dropout rate of 1.00 for our β€’ Eachmovies dataset: The items/movies in this experiments with DropoutNet to maximize per- dataset are a subset of the items in the Movielens formance on cold-start because that is the focus dataset, hence we are able to us the same attribute of this research work. DropoutNet also allows feature engineering as described for Movielens. inference transform but we do not apply it in our β€’ Goodreads dataset: We use the genres as book at- experiments because we do not consider the case tributes for cold-start personalization. The genre of incremental item preference data collection as attribute is one-hot encoded into 10 dimensions described in their work. We refer to DropoutNet because we have 18 possible genres. as D-Net to conserve space in the results tables. β€’ W&D: Wide & Deep Learning for Recommender Systems W&D [19] combines generalization and Table 6 Movielens results on cold-start items Metrics W&D MLP ATM-BPR D-Net NeuTraL-C MRR@10 0.043 0.050 0.070 0.053 0.083 NDCG@10 0.053 0.059 0.100 0.063 0.117 MRR 0.083 0.089 0.097 0.093 0.109 NDCG 0.244 0.249 0.257 0.252 0.269 AUC 0.604 0.610 0.629 0.617 0.656 Table 7 Goodreads results on cold-start items Metrics W&D MLP ATM-BPR D-Net NeuTraL-C MRR@10 0.030 0.036 0.057 0.054 0.077 NDCG@10 0.037 0.045 0.088 0.067 0.114 MRR 0.067 0.076 0.083 0.101 0.107 NDCG 0.228 0.238 0.245 0.264 0.271 AUC 0.570 0.603 0.588 0.672 0.689 memorization capabilities of recommender sys- in each row of all tables. We assume a margin of error tems for more robust personalization. They used of 0.005, hence the winning algorithm has to be greater deep learning for its demonstrated superior gener- than the next winner by at least a margin of 0.005. All alization capability. However, deep learning tends techniques are emboldened in the case of a tie on a met- to over-generalize when the input is too sparse ric. Techniques within the margin of error of the highest and high-rank. On the other hand, generalized score are also emboldened. linear models are highly capable of memorization of feature interactions through cross product fea- 6.6. Discussion ture transformations. Hence, the combination of a deep learning and a cross product model (wide) We begin our discussion with the results of the warm- in W&D for personalization. start experiments. We stated that NeuTraL performed best overall because of its highest number of wins which corresponds to the number of times a technique has the 6.5.2. Evaluation metrics for cold-start highest score per dataset. We also validated this observa- tion with a significance test. IPop has the worst perfor- We measured how well a recommender system is able to mance overall. This is not surprising since it is merely a rank a preferred cold-start item relative to other items. baseline technique that ranks items based on popularity. The evaluation is similar to the evaluation for warm-start The ranking produced by IPop is not personalized as it items. The main difference is the absence of test items in does not take personal attributes, context or historical the training dataset for cold-start personalized ranking. preference into account. We expect a decent personal- ized ranking technique to out-perform IPop. This is the 6.5.3. Results case as least performing personalized ranking technique We record the best results observed during experiments is BPR but it ourperforms IPop. NCF performs better for each dataset and depict them in movielens-table-cold- than BPR. This was already demonstrated by the creators start,eachmovies-table-cold-start. NeuTraL-C performs of NCF in their research work [5]. NCF combines both best overall and we subsequently discuss the results fur- deep learning (MLP) and piecewise product of interac- ther. The winning algorithm per metric is emboldened tions between user and item embeddings in a generalized Table 8 Eachmovie results on cold-start items Metrics W&D MLP ATM-BPR D-Net NeuTraL-C MRR@10 0.031 0.032 0.052 0.032 0.055 NDCG@10 0.037 0.038 0.072 0.038 0.068 MRR 0.065 0.065 0.076 0.065 0.075 NDCG 0.221 0.222 0.232 0.221 0.237 AUC 0.490 0.492 0.507 0.481 0.525 matrix factorization (GMF). BPR uses a dot product of instance, the transferred user embedding is propagated user and item embeddings to represent the interactions. through hidden layers before combination with the item Dot product assigns equal weights to the LVPs as de- attributes. The output of the hidden layers is a tainted scribed in dot-product while the GMF component of NCF version of the user embedding. The mapping learned by learns different weights for the LVPs with a neural net- DropoutNet is between this tainted version and the item work. The MLP component of NCF also learns different attributes. We believe this is the reason for a poorer per- weights for user and item embedding combinations. This formance compared to ATM-BPR and NeuTraL-C. It is not results in more complex representation of interactions too surprising that MLP performed less than DropoutNet between users and items and better performance. MPR since it is DropoutNet without transfer learning. Once out-performs NCF. The MTL nature of MPR gives it an again, this shows the effectivenes of transfer learning. advantage. NeuTraL’s superior performance butrresses WD performed the least of all cold-start personalization the effectiveness of transfer learning since it is essentially systems. It does not use transfer learning and we be- MPR combined with transfer learning but it outperforms lieve the complexity of deep learning in WD deteriorated MPR. We surmise that transfer learning improved the performance due to overfitting. performance of NeuTraL. We also believe that the type of A common theme throughout or experiments is the pre-trained model that is transferred is significant. Our benefit of our neural transfer learning approach. We experiment here reveals that the extraction mechanism believe that the transferred user embedding is more rep- from an autoencoder based model like CDAE is effective. resentative of the users as latent factors compared to the We subsequently discuss the results of our experiments user embedding in the other models. We show a chart on cold-start personalization. We stated that NeuTraL-C of loss minimization in NeuTraL with and without trans- performed best overall because of its highest number of fer learning in 3 on the Movielens data. 3 shows the wins which corresponds to the number of times a tech- speed-up achieved with transfer learning in the form of nique has the highest score per dataset. We also validated lower initial loss. 3 also shows the overall lower loss with this observation with a significance test. ATM-BPR is training. We know that ATM-BPR and DropoutNet adopt the next best performing technique. Both ATM-BPR and transfer learning as well but are outperformed by Neu- NeuTraL-C adopt transfer learning. However, NeuTraL- TraL. As stated earlier in section:cdae, dropout is a vital C uses a different pre-trained model. NeuTraL-C uses a component of CDAE, hence we investigated the effect pre-trained model extracted from CDAE as described in of dropout when pre-training on the final results. The section:NeuTraL while ATM-BPR uses pre-trained user results show that dropout slightly enhances the effect of embedding from BPR. This shows that it is not enough the transferred user embedding in NeuTraL. to just apply transfer learning but the meticulousness of implementation is as important. The type of pre-trained model is pertinent in such design. NeuTraL-C and ATM- 7. Conclusion BPR also differ in how they learn the "mapping func- We presented a novel personalization system based on tion". NeuTraL-C uses MPR while ATM-BPR uses BPR. transfer learning from a state-of-the-art deep personaliza- DropoutNet performs next best to ATM-BPR. Dropout- tion system to a linear cold-start personalization model. Net also uses transfer learning. We used user embed- This system is applicable to warm-start and cold-start ding from CDAE in DropoutNet. However, it uses deep items and users. The results of our experiments show the learning to learn the interaction between the transferred effectiveness of our proposed method and we discussed embedding and item attributes. The complex nature of the results. Although the results are promising, there is DropoutNet deteriorated performance somewhat. For Β·104 without transfer learning 426–434. URL: http://doi.acm.org/10.1145/1401890. 2 with transfer learning 1401944. doi:10.1145/1401890.1401944. [5] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, T.-S. Chua, Neural collaborative filtering, in: Proceedings of 1.5 the 26th International Conference on World Wide Web, WWW ’17, International World Wide Web Conferences Steering Committee, 2017, pp. 173– 182. URL: https://doi.org/10.1145/3038912.3052569. loss 1 doi:10.1145/3038912.3052569. [6] Y. Wu, C. DuBois, A. X. Zheng, M. Ester, Col- laborative denoising auto-encoders for top-n rec- 0.5 ommender systems, in: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, ACM, New York, NY, USA, 2016, pp. 153–162. URL: http://doi.acm.org/ 5 10 15 20 10.1145/2835776.2835837. doi:10.1145/2835776. epoch 2835837. [7] S. Sedhain, A. K. Menon, S. Sanner, L. Xie, Au- Figure 3: Effect of Transfer Learning with NeuTraL-C on torec: Autoencoders meet collaborative filtering, Movielens dataset. in: Proceedings of the 24th International Confer- ence on World Wide Web, WWW ’15 Compan- ion, ACM, New York, NY, USA, 2015, pp. 111–112. URL: http://doi.acm.org/10.1145/2740908.2742726. room for future work and improvements. Potential future doi:10.1145/2740908.2742726. research work include the extension of our techniques [8] Y. Zheng, B. Tang, W. Ding, H. Zhou, A neu- to user cold-start, full cold-start and warm-start rank- ral autoregressive approach to collaborative filter- ing. Other potential future work includes investigation ing, in: Proceedings of the 33rd International Con- of additional attributes and optimum fusion strategy of ference on International Conference on Machine those attributes. We believe experimentation with more Learning - Volume 48, ICML’16, JMLR.org, 2016, datasets and context attributes such as time and location pp. 764–773. URL: http://dl.acm.org/citation.cfm? would also be worthwhile. id=3045390.3045472. [9] M. Bianchi, F. Cesaro, F. Ciceri, M. Dagrada, A. Gas- References parin, D. Grattarola, I. Inajjar, A. M. Metelli, L. Cella, Content-based approaches for cold-start job rec- [1] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering ommendations, in: Proceedings of the Recom- for implicit feedback datasets, in: 2008 Eighth IEEE mender Systems Challenge 2017, RecSys Challenge International Conference on Data Mining, 2008, pp. ’17, ACM, New York, NY, USA, 2017, pp. 6:1–6:5. 263–272. doi:10.1109/ICDM.2008.22. URL: http://doi.acm.org/10.1145/3124791.3124793. [2] Y. Koren, R. Bell, C. Volinsky, Matrix factorization doi:10.1145/3124791.3124793. techniques for recommender systems, Computer [10] A. I. Schein, A. Popescul, L. H. Ungar, D. M. Pen- 42 (2009) 30–37. URL: http://dx.doi.org/10.1109/MC. nock, Methods and metrics for cold-start recom- 2009.263. doi:10.1109/MC.2009.263. mendations, in: SIGIR ’02, 2002. [3] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt- [11] A. Arampatzis, G. Kalamatianos, Suggesting points- Thieme, Bpr: Bayesian personalized ranking from of-interest via content-based, collaborative, and hy- implicit feedback, in: Proceedings of the Twenty- brid fusion methods in mobile devices, ACM Trans. Fifth Conference on Uncertainty in Artificial In- Inf. Syst. 36 (2017) 23:1–23:28. URL: http://doi.acm. telligence, UAI ’09, AUAI Press, Arlington, Vir- org/10.1145/3125620. doi:10.1145/3125620. ginia, United States, 2009, pp. 452–461. URL: http: [12] Z. Gantner, L. Drumond, C. Freudenthaler, S. Rendle, //dl.acm.org/citation.cfm?id=1795114.1795167. L. Schmidt-Thieme, Learning attribute-to-feature [4] Y. Koren, Factorization meets the neighborhood: A mappings for cold-start recommendations, in: 2010 multifaceted collaborative filtering model, in: Pro- IEEE International Conference on Data Mining, ceedings of the 14th ACM SIGKDD International 2010, pp. 176–185. doi:10.1109/ICDM.2010.129. Conference on Knowledge Discovery and Data Min- [13] A. v. d. Oord, S. Dieleman, B. Schrauwen, Deep ing, KDD ’08, ACM, New York, NY, USA, 2008, pp. content-based music recommendation, in: Pro- ceedings of the 26th International Conference on Neural Information Processing Systems - Volume knowledge, in: 2013 IEEE International Conference 2, NIPS’13, Curran Associates Inc., USA, 2013, pp. on Multimedia and Expo (ICME), 2013, pp. 1–6. 2643–2651. URL: http://dl.acm.org/citation.cfm?id= [22] M. Volkovs, G. Yu, T. Poutanen, Dropoutnet: Ad- 2999792.2999907. dressing cold start in recommender systems, in: [14] P. Covington, J. Adams, E. Sargin, Deep neural net- I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, works for youtube recommendations, in: Proceed- R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Ad- ings of the 10th ACM Conference on Recommender vances in Neural Information Processing Systems Systems, New York, NY, USA, 2016. 30, Curran Associates, Inc., 2017, pp. 4957–4966. [15] T. T. Nguyen, H. W. Lauw, Collaborative topic re- [23] C. Burges, T. Shaked, E. Renshaw, A. Lazier, gression with denoising autoencoder for content M. Deeds, N. Hamilton, G. Hullender, Learning to and community co-representation, in: Proceed- rank using gradient descent, in: Proceedings of the ings of the 2017 ACM on Conference on Infor- 22Nd International Conference on Machine Learn- mation and Knowledge Management, CIKM ’17, ing, ICML ’05, ACM, New York, NY, USA, 2005, ACM, New York, NY, USA, 2017, pp. 2231–2234. pp. 89–96. URL: http://doi.acm.org/10.1145/1102351. URL: http://doi.acm.org/10.1145/3132847.3133128. 1102363. doi:10.1145/1102351.1102363. doi:10.1145/3132847.3133128. [24] D. P. Kingma, J. Ba, Adam: A method for [16] H. Wang, N. Wang, D.-Y. Yeung, Collaborative deep stochastic optimization., CoRR abs/1412.6980 learning for recommender systems, in: Proceedings (2014). URL: http://dblp.uni-trier.de/db/journals/ of the 21th ACM SIGKDD International Conference corr/corr1412.html#KingmaB14. on Knowledge Discovery and Data Mining, KDD [25] L. Torrey, J. Shavlik, Transfer learning, 2009. ’15, ACM, New York, NY, USA, 2015, pp. 1235–1244. [26] A. Quattoni, Transfer learning algorithms for image URL: http://doi.acm.org/10.1145/2783258.2783273. classification, Ph.D. thesis, Citeseer, 2009. doi:10.1145/2783258.2783273. [27] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, [17] G. Sottocornola, F. Stella, M. Zanker, F. Canonaco, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, Towards a deep learning model for hybrid rec- S. Gelly, Parameter-efficient transfer learning for ommendation, in: Proceedings of the Interna- nlp, arXiv preprint arXiv:1902.00751 (2019). tional Conference on Web Intelligence, WI ’17, [28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, ACM, New York, NY, USA, 2017, pp. 1260–1264. R. Salakhutdinov, Dropout: A simple way to pre- URL: http://doi.acm.org/10.1145/3106426.3110321. vent neural networks from overfitting, J. Mach. doi:10.1145/3106426.3110321. Learn. Res. 15 (2014) 1929–1958. URL: http://dl.acm. [18] W. Niu, J. Caverlee, H. Lu, Neural personalized org/citation.cfm?id=2627435.2670313. ranking for image recommendation, in: Proceed- [29] C. M. Bishop, Training with noise is equivalent ings of the Eleventh ACM International Confer- to tikhonov regularization, Neural computation 7 ence on Web Search and Data Mining, WSDM ’18, (1995) 108–116. Association for Computing Machinery, New York, [30] F. M. Harper, J. A. Konstan, The movielens datasets: NY, USA, 2018, p. 423–431. URL: https://doi.org/ History and context, ACM Trans. Interact. Intell. 10.1145/3159652.3159728. doi:10.1145/3159652. Syst. 5 (2015) 19:1–19:19. URL: http://doi.acm.org/ 3159728. 10.1145/2827872. doi:10.1145/2827872. [19] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chan- [31] P. McJones, Eachmovie Collaborative Filter- dra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, ing Dataset, DEC Systems Research Cen- M. Ispir, et al., Wide deep learning for recom- ter,http://www.research.compaq.com/src/eachmovie/, mender systems, in: Proceedings of the 1st Work- 1997. shop on Deep Learning for Recommender Systems, [32] M. Wan, J. J. McAuley, Item recommendation on DLRS 2016, Association for Computing Machin- monotonic behavior chains, in: S. Pera, M. D. ery, New York, NY, USA, 2016, p. 7–10. URL: https: Ekstrand, X. Amatriain, J. O’Donovan (Eds.), Pro- //doi.org/10.1145/2988450.2988454. doi:10.1145/ ceedings of the 12th ACM Conference on Rec- 2988450.2988454. ommender Systems, RecSys 2018, Vancouver, BC, [20] Y. Zhu, J. Lin, S. He, B. Wang, Z. Guan, H. Liu, Canada, October 2-7, 2018, ACM, 2018, pp. 86– D. Cai, Addressing the item cold-start problem by 94. URL: https://doi.org/10.1145/3240323.3240369. attribute-driven active learning, IEEE Transactions doi:10.1145/3240323.3240369. on Knowledge and Data Engineering 32 (2020) 631– [33] R. Otunba, R. A. Rufai, J. Lin, Mpr: Multi-objective 644. pairwise ranking, in: Proceedings of the Eleventh [21] Ming Yan, Jitao Sang, Tao Mei, Changsheng Xu, ACM Conference on Recommender Systems, Rec- Friend transfer: Cold-start friend recommenda- Sys ’17, Association for Computing Machinery, tion with cross-platform transfer learning of social New York, NY, USA, 2017, p. 170–178. URL: https: //doi.org/10.1145/3109859.3109903. doi:10.1145/ 3109859.3109903. [34] R. Otunba, R. A. Rufai, J. Lin, Deep stacked ensem- ble recommender, in: Proceedings of the 31st In- ternational Conference on Scientific and Statistical Database Management, SSDBM ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 197–201. URL: https://doi.org/10.1145/3335783. 3335809. doi:10.1145/3335783.3335809.