=Paper= {{Paper |id=Vol-2660/ialatecml_paper2 |storemode=property |title=On the Transferability of Deep Neural Networks for Recommender System |pdfUrl=https://ceur-ws.org/Vol-2660/ialatecml_paper2.pdf |volume=Vol-2660 |authors=Duc Nguyen,Hao Niu,Kei Yonekawa,Mori Kurokawa,Chihiro Ono,Daichi Amagata,Takuya Maekawa,Takahiro Hara |dblpUrl=https://dblp.org/rec/conf/pkdd/NguyenNYKOAMH20 }} ==On the Transferability of Deep Neural Networks for Recommender System== https://ceur-ws.org/Vol-2660/ialatecml_paper2.pdf
On the Transferability of Deep Neural Networks
          for Recommender System

    Duc Nguyen1 , Hao Niu1 , Kei Yonekawa1 , Mori Kurokawa1 , Chihiro Ono1 ,
          Daichi Amagata2 , Takuya Maekawa2 , and Takahiro Hara2
                          1
                             KDDI Research Inc., Japan
     {du-nguyen,ha-niu, ke-yonekawa, mo-kurokawa, ono}@kddi-research.jp
                            2
                              Osaka University, Japan
              {amagata.daichi, maekawa, hara}@ist.osaka-u.ac.jp



        Abstract. Recommender system is an essential component in many
        practical applications and services. Recently, significant progress has
        been made to improve performance of recommender system utilizing deep
        learning. However, current recommender systems suffers from the long-
        standing data sparsity problem, especially in domains with little data.
        With the ability to transfer knowledge across domains, transfer learn-
        ing is a potential approach to deal with the data sparsity problem in
        recommender system. In this paper, we carry out an investigation on
        the transferability of deep neural networks for recommender system. We
        show that network-based transfer learning can improve recommendation
        performance on target domains by up to 20%. In addition, our investiga-
        tion reveals that transferring the layers close to the output leads to better
        transfer performance. The transfer performance is also found to be de-
        pendent on the similarities between data distributions of the source and
        target domains. Meanwhile, target domain characteristics such as size
        and sparsity have little impacts on the transfer performance.


Keywords: Transfer Learning, Recommender System, Neural networks


1     Introduction
With the explosive growth of information available on the Internet, it is challeng-
ing for users to find their desired products/services. Thus, recommender systems
(RSs) play a central role in enhancing user experience, especially in online news
services, E-commerce websites, and online advertising [24]. The main task of RSs
is to provide suggestions for items (e.g., news, books, movies, event tickets, etc.)
to individual users. RSs enable the so-called personalized experience, which is the
key to the successes of many Internet companies like Amazon [28], Netflix [8].
    Starting with the Netflix Prize [3], significant progress has been made in
recommender system research [33]. The past few years have also witnessed the
great success of deep learning in many application domains, especially in com-
puter vision and natural language processing [15]. In this trend, in the past few
years, deep learning has been studied extensively for recommender system such

    © 2020 for this paper by its authors. Use permitted under CC BY 4.0.
2 On the
      Duc  Nguyen et al.of Deep Neural Networks for Recommender System
         Transferability                                                           23

as in [1, 4, 10, 13, 25, 26, 34]. Although these deep learning-based methods are
effective in improving the performance of recommender system, they are mostly
based on information (e.g., ratings, reviews) in a single domain. As a result,
these methods inevitably suffer from the data sparsity problem because each
item is usually rated or reviewed by a few users [24]. Moreover, current applica-
tions should be able to react quickly to new situations such as new products or
new users. Therefore, techniques to reuse knowledge across times, domains, and
tasks are highly desirable.
    Transfer learning is a machine learning technique capable of transferring
knowledge learned in a domain (source domain) to another related domain (tar-
get domain) [22]. Thus, it can be used to deal with the data sparsity problem in
recommender system as well as to increase system’s ability to adapt to new situ-
ations. Existing works on transfer learning for recommender system apply either
instance-based [5, 7, 16, 23] or feature-based [19, 35] approaches, in which data
samples/features from one or more source domains are transferred to a target
domain. One of the main problems of instance-based and feature-based transfers
is that they require access to data of other source domains. In other words, data
sharing between domains is necessary. Nevertheless, inter-domain data sharing
has become more and more difficult nowadays due to data regulations such as
GDPR [29], especially if the shared data contains user-relevant information.
    To improve the performance of recommender systems, it is still desirable to
be able to transfer knowledge across domains even if shared data is not avail-
able. In such circumstances, network-based transfer learning, which transfers
features of model (e.g., parameters, structure, etc.) learned on a source domain,
is a potential approach. Although network-based transfer has been studied ex-
tensively in the literature, previous works mainly focus either on computer vi-
sion [9,14,21,27,30,32] or natural language processing [6,11,12,18,31]. In context
of recommender systems, despite the fact that deep neural network-based models
have shown their superiority, there is still no existing work on the transferability
of those deep neural networks.
    In this paper, we focus on answering the following three questions in order
to understand the transferability of neural network for recommender system.

 – Q1: Does network-based transfer learning lead to better recommendation
   performance on the target domain?
 – Q2: How to transfer a neural network for the best transfer performance?
 – Q3: What are the factors affecting the transfer performance?

Although network-based transfer learning has been found to be effective in many
computer vision and natural language processing(NLP) tasks, there is still a lack
of understanding on the transferability of neural networks for recommender tasks
(i.e., Q1). In computer vision and NLP tasks, those layers close to the input
are found to be highly transferable, whereas those close to the output are task-
specific [21]. Yet, it is still unknown which layers can be effectively transferred in
recommendation tasks (i.e., Q2). It is also important to understand how different
factors affect the transfer performance (i.e., Q3).
24 On the
       DucTransferability
            Nguyen, Hao of Deep
                          Niu, KeiNeural Networks
                                   Yonekawa, Mori for Recommender
                                                   Kurokawa, et al. System         3

    In this paper, we investigate the transferability of deep neural networks for
recommender system, focusing on top-N item recommendation task. For that
purpose, a recommender system built on Multi-layer Perceptron (MLP) neural
network is used as the base network. The base network consists of an embedding
layer and an interaction function consisting of multiple fully connected layers.
Then, we examine various options to transfer the knowledge of the base net-
work to a target domain. Extensive evaluation with eighteen real-world datasets
demonstrate that transferring the interaction function layers can improve rec-
ommendation performance on the target domain by up to 20%. Especially, our
evaluation reveals that, unlike deep neural networks for computer visions and
NLP tasks, those layers close to the output are more transferable than those
close to the input in deep neural networks for recommender system. To the best
of our knowledge, this is the first work on transferablity of deep neural networks
for recommender system.
    The remaining of the paper is organized as follows. Section 2 surveys related
works. The base network and transfer options are described in Section 3. The
evaluation is given in Section 4. Finally, the paper is concluded in Section 5.


2   Related Work

In recent years, deep learning-based methods have been studied extensively for
recommender systems. These methods mainly focus on replacing one or more
components in conventional methods by deep neural networks. For instance,
in [10], instead of using the dot product as in traditional matrix factorization [2],
the interaction function is learned by a MLP network. In [25, 36], Autoencoder
is utilized to learn the user/item embeddings. In [1], Gate Recurrent Unit is
used to exploit the order of words in sentences, which is shown to outperform a
simple average of word embeddings for text recommendation. Other deep neural
network architectures such as Generative Adversarial Network (GAN) [4] and
Attention Model [26] have also been used in recommender system. A comprehen-
sive survey of deep learning-based methods can be found in [33]. In this paper,
we adopt the MLP as the base network due to its simplicity. Other deep neural
networks will be studied in our future work.
    In the literature, transfer learning has been used to tackle the data sparsity
problem in recommender system. Most transfer learning methods in previous
studies are either instance-based or feature-based. In [5], training samples of a
source domain are directly used to train the recommendation model at the target
domain. In [16], users/items in a source domain are clustered to construct a
codebook, which is then transferred to a target domain. In [19,35], the user/item
feature vectors learned on a source domain are transferred to the target domain
by means of a mapping function. Some other studies leverage multi-task learning
to enable dual knowledge transfer across domains such as [13, 34]. However,
instance-/feature-based transfers and multi-task learning require sharing data
between domains. In contrast, our work focuses on network-based transfer, and
thus does not require data sharing across domains. Such a property is especially
4 On the
      Duc  Nguyen et al.of Deep Neural Networks for Recommender System
         Transferability                                                         25




      Fig. 1: Base network architecture for top-N item recommendation task.


important considering fact that more data regulations are being imposed on user
data [29].
    Network-based transfer learning has been studied in contexts of computer
vision and natural language processing research. In [32], it is found that the first
layers of Convolutional Neural Network (CNN) are highly transferable. The fol-
lowing works lead to the developments of various transfer techniques for image
classification task. In [21], the output layer of a pre-trained CNN is replaced
by an adaptation layer, while the remaining layers are transferred to the tar-
get domain. In [9], only the convolutional layers are transferred, while all the
fully-connected layers of CNN networks are fine-tuned with learning rate deter-
mined by Bayesian Optimization. A recent evaluation [14] found that there is
a strong correlation between ImageNet accuracy and transfer accuracy among
popular image classification networks. To improve the performance of factoid
question answering (QA) tasks on small datasets, the model parameters trained
on a large dataset are used to initialize the target model’s weights, with a mod-
ified loss function to avoid catastrophic forgetting [31]. In [12], an universal
language modeling fine tuning (ULMFiT) is presented, featuring discriminative
fine-tuning, slanted trianglar learning rates, and gradual unfreezing. In [11], an
adapter-based parameter efficient transfer learning for NLP is proposed.


3     Network-based Transfer Learning
In this section, the top-N item recommendation task is defined and a neural
network-based approach is introduced. Then, we describe how to transfer the a
pre-trained network from a source domain to a target domain.

3.1    Top-N item recommendation task
Along with rating prediction [3], top-N item recommendation is one of most
important tasks in recommender systems. Suppose that we need to recommend
N items to individual users of a particular domain (e.g., online book stores,
26 On the
       DucTransferability
            Nguyen, Hao of Deep
                          Niu, KeiNeural Networks
                                   Yonekawa, Mori for Recommender
                                                   Kurokawa, et al. System       5

e-commercial websites). Let U and I respectively denotes the sets of users and
items. We define the variables {Rui } to represent user-item interactions as fol-
lows.                (
                       1     if user u has interacted with item i
              Rui =
                       0      otherwise
Here, an interaction can be a purchase or rating of the item, a click on the item’s
advertisement, or a visit to the item’s website. In this paper, we assume that
only implicit feedback is available. Thus, user-item interactions are represented
by binary values. The set of items that a user u has interacted with in the past is
denoted by Iu , i.e., Iu = {i|Rui = 1}. The top-N item recommendation problem
can be formulated as follows.
    For a user u ∈ U, determine N items {i1 , i2 , .., iN } ∈ I \ Iu that have the
highest likelihoods that the user u will interact with.
    Existing methods for top-N item recommendation task can be classified into
two main groups, namely content-based, and collaborative filtering. Content-
based methods simply calculate the similarity between candidate items and the
items the user has interacted with, then select top-N items with highest simi-
larity scores. On the other hand, collaborative filtering predicts the interaction
score by using preference from many users. In this paper, we focus on model-
based CF to predict the value Rui for every item i ∈ I \ Iu . The model is built
on top of a neural network and will be described in the next section.

3.2   Neural Network Model
In this paper, we follow the NeuMF framework proposed in [10] to build the
base network as follows. Each user/item is characterized by a latent vector or
embedding. The user-item interactions are modeled by an interaction function.
Similar to [10], the interaction function is a Multi-layer Perceptron network,
which is learned during training.
   Figure 1 shows the architecture of the base network used in this paper.
As aforementioned, each user u ∈ U is characterized by an embedding vector
pu ∈ Rdu , where du is the user embedding size. Similarly, each item i ∈ I is
mapped to an item embedding vector qi ∈ Rdi where di is the length of the item
embedding vectors. In this paper, we assume that the user and item embeddings
have the same size, i.e., du = di . Given an interaction between user u and item i,
the corresponding user and item embeddings are aggregated by the aggregator,
forming Xui , which is the input of the interaction function. In this paper, the
aggregator simply concatenates the user and item embedding vectors as follows.
                                  Xui = [pu , qi ]                             (1)
The interaction function consists of K fully connected layers FC-k (1 ≤ k ≤ K).
Let sk denote the size of layer FC-k. The output yk ∈ Rsk of layer FC-k (1 ≤
k ≤ K) is given by,
                          (
                            fk (Xui ∗ Wk + bk )     if k = 1
                    yk =                                                    (2)
                            fk (yk−1 ∗ Wk + bk )    if k > 1
6 On the
      Duc  Nguyen et al.of Deep Neural Networks for Recommender System
         Transferability                                                       27

where fk , Wk ∈ Rsk−1 ×sk , and bk ∈ Rsk respectively denotes the activation
function, weight, and bias of layer FC-k. The outermost FC layer (i.e., FC-K) is
also referred to as output layer. The base network parameter set θ includes the
user and item embeddings and the layers’ weights and biases.

                    θ = {{pu }u∈U , {qi }i∈I , {Wk , bk }1≤k≤K }              (3)

The parameter set θ is learned so as to minimize a loss L, which is a function of
the predicted interaction and the actual ones.
                                   1     X
                                                    e
                         min               L(Rui , Rui )                      (4)
                           θ   |U| × |I|
                                        (u,i)

          e
Where Rui   is the predicted interaction interaction of user u and item i. In this
paper, since the interaction values are binary, we adopt the binary cross-entropy
loss function.

3.3   Network-based Transfer Learning Mechanism
In this paper, we are interested in the transferability of deep neural networks
learned on a source domain to improve performance on a target domain. As
aforementioned, since we assume that data sharing is not available, instance-
based transfer is not applicable since it requires transferring of data instances
from the source domain to the target domain. Feature-based transfer (e.g., [19])
requires prior knowledge of shared users/items, which is unknown in this case,
and so cannot be applied. Thus, a network-based transfer approach [23] is used.
Given a target domain DT and a learning task T , the goal here is to improve
performance on DT by transferring knowledge of the pre-trained network learned
on a source domain DS .
    The key assumption of network-based transfer approach is that the neural
networks of the source and target domains should share some parameters. Let θS
and θT respectively denotes the parameter set of the source and target networks.
Then, the parameter sets can be decomposed into two sub-sets, one contains
shared parameters (i.e., θ0 ) and another contains domain-specific parameters
(i.e., vS and vT ) as follows.

                                  θ S = θ 0 ∪ vS                              (5)

                                  θ T = θ 0 ∪ vT                              (6)
The common parameters θ0 are learned on the source domain and then trans-
ferred to the target domain. During training at the target domain, the common
(transferred) parameters are frozen, whereas domain-specific parameters (vT )
are learned.
    Since user/item linkages are not allowed in our problem setting, the user
and item embedding vectors are non-transferable, and so they are in domain-
specific parameter set vT . Transferable parameters consists of the weights and
28 On the
       DucTransferability
            Nguyen, Hao of Deep
                          Niu, KeiNeural Networks
                                   Yonekawa, Mori for Recommender
                                                   Kurokawa, et al. System        7


      Table 1: Transfer settings of fully-connected layers of the base network.
                          Setting Layers to transfer
                          Config-1 FC-1, FC-2, FC-3, FC-4
                          Config-2   FC-1, FC-2, FC-3
                          Config-3       FC-1, FC-2
                          Config-4         FC-1
                          Config-5   FC-2, FC-3, FC-4
                          Config-6       FC-3, FC-4
                          Config-7         FC-4



biases of individual fully-connected (FC) layers of the interaction function. In
this paper, we perform transfer in layer basis, in which all parameters of a given
layer are transferred as a whole. More fine-grain transfer options are reserved for
our future work. We consider different transfer configurations as will described
in the next section.


4     Evaluation

4.1     Experiment Setup

Base Neural Network Parameters The user and item embedding sizes are
both set to 32. The interaction function consists of K = 4 fully connected layers
with the sizes of 64, 32, 16, and 8. It should be noted that the size of the first
hidden layer of the interaction function network is equal to the sum of the user
and item embedding sizes. We compare performance in terms of Hit Ratio (HR)
with a baseline in which the base network are trained from scratch using only
data in the target domain. For both the transfer options and baseline, Adam
optimizer is used. The learning rate is set to 0.001. The batch size is 256. The
number of epoch is 100. For each method/option, we run the experiment ten
times and report the average values.


Transfer Configurations To investigate the transfer learning performance,
we consider seven transfer configurations of the base neural network as shown
in Table 1. The configurations differs based on which fully-connected layers are
being transferred. It should be noted that the user/item embeddings are not
transferable.


Evaluation Protocol To evaluate the proposed method, we follow the leave-
one-out evaluation protocol [10]. Specifically, for a user, a test item is randomly
chosen among the items that the user have interacted with. In addition, 99 neg-
ative items, which have not been interacted by the user, are randomly selected.
The predicted scores for the test and negative items are calculated. Then, the
test item is ranked against the negative ones based on the predicted scores. The
8 On the
      Duc  Nguyen et al.of Deep Neural Networks for Recommender System
         Transferability                                                         29

performance metric of hit ratio (HR) is computed as follows. Let hu denote the
hit position (rank) of the test item of user u against the negative items. HR@N
is defined as:
                              1 X
                  HR@N =             max(0, 1 − b(hu /(N + 1)c)              (7)
                             |U|
                                 u∈U

Here, b.c is the floor function. The HR has the range in [0, 1] where a higher value
indicates better performance. In this paper, we use HR@10 as the performance
metric.


Datasets In our evaluation, eighteen real-world datasets from Amazon Review
database [20] are used. The original datasets are preprocessed by removing users
and items with less than 20 interactions. Statistics of all datasets are shown in
Table 2. We train the base network from the scratch by randomly initializing
weights and evaluate the performance on each dataset (i.e., baseline). As can
be seen in Table 2, the three datasets of Book, Movie, and Kindle have the
highest recommendation performance. Thus, those datasets are chosen as the
source domains. The remaining fifteen datasets are taken as target domains.



Table 2: Statistics and baseline (non-transfer) performance of eighteen datasets
used in our experiment. The three datasets of Book, Movie, Kindle are taken as
source domains.
      ID   Dataset    #users #items #ratings sparsity (%) baseline (HR@10)
       1    Book      46276 148785 2453521        99.96           0.66
       2    Movie      8396 25839 449685          99.79           0.64
       3    Kindle    13742 21883 566622          99.81           0.61
       4    Sport      2826 14016 109229          99.72           0.25
       5  Clothing 11975 69009 743040             99.91           0.18
       6      CD       5019 12847 193170          99.70           0.49
       7      Pet      2153 7591    95753         99.41           0.26
       8 DigitalMusic 230    2116    7146         98.53           0.20
       9    Home       2002 9962    73672         99.63           0.13
      10     Toy       2181 9577    66321         99.68           0.34
      11 Videogame 514       1996   16496         98.39           0.31
      12      Art       531  3492   17804         99.04           0.27
      13 Automotive 2196 13435 87418              99.70           0.15
      14 Cellphone      146  1726    3813         98.49           0.21
      15     Food      1451 6134    59899         99.33           0.21
      16 Instrument 173      1271    7185         96.73           0.17
      17    Office      543  2768   19309         98.72           0.29
      18   Garden       213  1877    5957         98.51           0.20
30 On the
       DucTransferability
            Nguyen, Hao of Deep
                          Niu, KeiNeural Networks
                                   Yonekawa, Mori for Recommender
                                                   Kurokawa, et al. System          9




Table 3: Performance gain (%) of different transfer configurations compared to
the baseline (non-transfer) of individual target domains when Book is the source
domain. A positive (negative) value means positive (negative) transfer. The last
column shows the best configuration and the corresponding HR@10.
Target domain config-1 config-2 config-3 config-4 config-5 config-6 config-7 best config (HR@10)
    Sport       -11.40   -32.20   -30.68   -13.15    2.64    3.17     1.44   config-6 (0.255)
   Clothing     19.90    -14.81   -19.21    -9.99    4.59     4.26    0.61   config-1 (0.213)
      CD         -1.78   -26.51   -28.66   -11.96    -2.15   0.14    -1.03   config-6 (0.489)
      Pet        -4.41   -26.28   -24.14    -9.26    0.02    -0.65   -0.30   config-5 (0.255)
 DigitalMusic   -10.74   -11.92   -11.49    -6.17   -0.43    -6.06   -9.53   config-5 (0.200)
    Home          1.80   -11.67   -12.68    -4.10   10.89     8.59    3.81   config-5 (0.143)
     Toy         -8.31   -25.06   -23.68    -9.18    -0.88   -1.47    0.38   config-7 (0.342)
  Videogame      -0.43   -28.77   -27.38   -11.52    0.97     1.07    1.58   config-7 (0.312)
      Art         2.80   -20.51   -26.08    -4.78    6.47    9.05     8.48   config-6 (0.291)
 Automotive      -4.59   -17.01   -19.35    -5.25    1.85    2.74     0.24   config-6 (0.158)
  Cellphone     -19.85   -10.03   -21.30   -17.75   -12.58   -7.75   -9.36   config-6 (0.196)
     Food         0.35   -26.52   -22.33    -9.76    4.51     0.30   -1.15   config-5 (0.218)
 Instrument     -13.61    -7.38   -10.14    0.66     -3.24   -4.63   -7.38   config-4 (0.168)
    Office      -0.52    -13.22   -16.68    -5.52    -2.98   -0.80   -2.47   config-1 (0.268)
   Garden        -8.29   -14.23   -16.00    -6.65    -4.51   -7.52   -0.80   config-7 (0.201)




Table 4: Performance gain (%) of different transfer configurations compared to
the baseline (non-transfer) of individual target domains when Movie is the source
domain.
Target domain config-1 config-2 config-3 config-4 config-5 config-6 config-7 best config (HR@10)
    Sport        0.58    -17.34   -14.62   -12.76   3.10      2.35    0.71   config-5 (0.254)
   Clothing     15.42    -15.49    -9.38   -12.76   7.90      3.29   -1.35   config-1 (0.205)
      CD        -0.32    -16.68   -12.33   -10.37   -1.25    1.65    -0.15   config-6 (0.497)
      Pet        1.90    -12.69   -10.25   -10.65   -0.02    -0.72   -1.80   config-1 (0.260)
 DigitalMusic   -0.86     -3.46    -3.81    -4.73   -2.92    -8.66   -8.66   config-1 (0.199)
    Home        17.58     -3.11    -2.68    -1.79   14.24     4.90   -0.04   config-1 (0.151)
     Toy         0.89    -14.11    -8.66    -9.88   -1.41    -2.18   -0.74   config-1 (0.344)
  Videogame      8.18     -9.63    -9.19   -10.58   1.07      3.61    2.28   config-1 (0.332)
      Art        5.95     -9.08    -8.66    -8.30   4.39      5.24   6.22    config-7 (0.284)
 Automotive      8.24    -14.03    -8.84    -7.96   4.82      2.98    0.57   config-1 (0.166)
  Cellphone      6.72     -3.55     4.52    -0.06   -9.38    -5.48   -5.49   config-1 (0.227)
     Food       10.70    -10.62   -12.57    -4.30   7.91      3.10   -0.96   config-1 (0.231)
 Instrument      7.41     -1.24     1.15    -1.45   2.53     -7.38   -8.77   config-1 (0.180)
    Office       5.15     -4.77    -6.75    -1.83   -0.17    -1.25   -1.76   config-1 (0.303)
   Garden        4.74     -1.81    -4.28    -7.05   -3.43    4.75     0.81   config-6 (0.213)
10 On the
       Duc  Nguyen et al.of Deep Neural Networks for Recommender System
          Transferability                                                          31


Table 5: Performance gain (%) of different transfer configurations compared
to the baseline (non-transfer) of individual target domains when Kindle is the
source domain.
Target domain config-1 config-2 config-3 config-4 config-5 config-6 config-7 best config (HR@10)
    Sport        -6.04   -16.91   -22.22   -10.57    3.89    2.25    1.35    config-5 (0.256)
   Clothing      6.22    -18.22   -16.46   -10.86    2.22    2.09    0.50    config-1 (0.189)
      CD         -5.47   -15.68   -19.36    -9.99    1.56    0.96    0.15    config-5 (0.496)
      Pet        -3.88   -14.75   -15.53   -10.37    2.73    0.05   -0.96    config-5 (0.262)
 DigitalMusic   -11.27    -7.58    -7.80    -7.58   -1.07   -8.88   -5.62    config-5 (0.198)
    Home          2.30    -3.97    -4.59    -2.72   15.29    3.70    1.05    config-5 (0.148)
     Toy         -9.69   -13.63   -16.36    -8.00   -1.40   -0.55   -0.73    config-6 (0.339)
  Videogame       0.56    -7.08   -17.78    -9.38    6.15    5.20    1.07    config-5 (0.326)
      Art         5.54    -6.89   -15.36    -2.22   11.49    7.14    6.08    config-5 (0.298)
 Automotive      -0.64    -8.28   -10.85    -4.91    7.49    0.42    0.79    config-5 (0.165)
  Cellphone      -6.11    -2.01    -8.87   -12.09   -6.45   1.29    -4.84    config-6 (0.215)
     Food         2.43   -16.48   -13.35    -6.42    8.67    3.53    0.40    config-5 (0.227)
 Instrument       1.07    -4.11    -6.26    -3.66    1.53   -2.20   -3.58    config-5 (0.170)
    Office       3.76     -4.48    -9.84    -5.21    1.11   -1.57   -0.43    config-1 (0.299)
   Garden       -10.70   -12.43    -7.52    -8.38   -6.94   -0.35   0.35     config-7 (0.204)



4.2   Evaluation Results
In the first part of our experiment, we aim to answer the first and second ques-
tions regarding the transferability of the base network, namely Q1: Does transfer
learning lead to better recommendation on the target domain? and Q2: How to
transfer a neural network for the best transfer performance?. Table 3, Table 4,
Table 5 show the gains of seven transfer configurations compared to the baseline
(non-transfer method) of individual target domains when the source domain is
Book, Movie, and Kindle, respectively. A positive (negative) value indicates pos-
itive (negative) transfer. The last column of each table shows the configuration
with the highest gain and the corresponding HR@10.
    It can be seen that, for all three source domains, transferring the neural net-
work can improve the performance of most target domains. Among the fifthteen
target domains, fourteen domains are benefited from transferring from at lest
one source domain. In particular, the number of target domains with positive
transfer are 11, 14, and 13 when the source domain is Book, Movie, and Kindle,
respectively. Transferring can improve the Hit Ratio on the target domain by
up to 20% from the Book domain, up to 17% from the Movie domain, and up
to 15% from the Kindle domain. There are 10 target domains in which positive
transfer occurs with all three source domains, namely Automotive, Home, Food,
Art, Clothing, CD, Pet, Sport, Video, and Instrument. For the domains when the
negative transfer occurs, the Hit Ratio is reduced by 1-8%(Book), 1-4%(Movie),
and 1-5%(Kindle) compared to the baseline method. For the DigitalMusic do-
main, transfer learning always causes performance degradation compared to the
baseline for both three source domains.
32 On the
       DucTransferability
            Nguyen, Hao of Deep
                          Niu, KeiNeural Networks
                                   Yonekawa, Mori for Recommender
                                                   Kurokawa, et al. System       11




  Fig. 2: Importance of individual layers of the source domain model (Book).




    It can also be noted that the best transfer configuration varies across target
domains and source domains. When Book is the source domain, the best trans-
fer configuration under the Clothing and Instrument domains are Config-1 and
Config-4, respectively. For the four domains of Sport, CD, Art, and Automotive,
Config-6 yields the highest gains. Especially, Config-2 and Config-3 are in no case
the best. Config-4 leads to negative transfer with all target domains except for
Instrument. Generally, the performance of those three configurations are 10-30%
lower than that of the baseline.
   When transferring from the Movie domain (i.e., Table 4), Config-1 is the
best configuration for ten target domains. The Config-5, Config-6, and Config-7
configurations are the best configuration for only one target domain domain.
Again, it can be seen that the Config-2, Config-3 and Config-4 configurations
results in negative transfer for all target domains. As can be seen in Table 5,
Config-5 are the best configuration for most target domains when Kindle is the
source domain. For the two domains of Clothing and Office, Config-1 achieve the
highest gains. Again, it can be seen that the Config-2, Config-3, and Config-4
configurations cause negative transfer in all target domains.
    To understand the importance of individual layers, we follow the method
proposed in [17] to calculate the importance of individual fully connected (FC)
layers. Specifically, to evaluate the importance of a neuron, the log-likelihoods of
the correct label with and without the presence of the neuron are compared, and
the importance is calculated. Fig. 2 shows the importance values of individual
neurons of different FC layers. It can be seen that the layers close to the outputs
are generally more importance than those close to the inputs. This result may be
a hint to explain why the three configurations of Config-2, Config-3, and Config-
4, where the FC-4 is not transferred, are worsen than the other configurations.
This issue will be studied further in our future work.
   From the above results, we can have the following remarks regarding the
transferability of neural networks for recommendation systems.
12 On the
       Duc  Nguyen et al.of Deep Neural Networks for Recommender System
          Transferability                                                     33




                                (a) Source: Book




                                (b) Source: Movie




                               (c) Source: Kindle

Fig. 3: Relationship between KLD values and the gains of the Config-1, Config-
5, Config-6, and Config-7 configurations for different source domain a) Book, b)
Movie, and c) Kindle.


 – Transferring the pre-trained network from three source domains of Book,
   Movie, and Kindle can improve the recommendation performance on most
   of the target domains.
 – For a given source domain, different target domains require different transfer
   configurations. Especially, Config-1 is preferable when Movie is the source
   domain, whereas Config-5 achieves highest gains for the highest number of
   target domains when Kindle is the source domain.
 – Config-2, Config-3, and Config-4 always lead to negative transfer. This in-
   dicates that transferring of the source model contain the layers close to the
   output such as in case of Config-1, Config-5, Config-6, and Config-7.
    In the second part of our experiment, we investigate how different factors
affect the transfer performance, i.e, Question Q3. It is well-known that transfer
34 On the
       DucTransferability
            Nguyen, Hao of Deep
                          Niu, KeiNeural Networks
                                   Yonekawa, Mori for Recommender
                                                   Kurokawa, et al. System                        13

learning is more effective if the source and target domain are related [32]. Thus,
we first examine the impact of the relatedness between a source domain and a
target domain on the transfer performance. In this paper, we use the similarity
between data distributions of the source and target domains to measure the
relatedness. For that purpose, we first calculate the histogram of the number of
purchases per user Hu of individual domains. Then, we use the KL-Divergence
(KLD) to measure the relatedness R(DS , DT ) between a source domain DS and
a target domain DT as follows.

                          R(DS , DT ) = KLD(Hu (DS ), Hu (DT ))                                   (8)

Figure 3 shows the relationship between KLD values and the gains of the Config-
1, Config-5, Config-6, and Config-7 configurations for three source domains. The
line in each figure show the linear regression fit of the data with an 95% confi-
dent interval. The Pearson correlation coefficients (PCC) and p-values are also
shown. Because the Config-2, Config-3, Config-4 result in negative transfer for
most of the cases, they are excluded in this part. As can be seen in Fig. 3a, when
the Book is the source domain, the transfer gain correlates to the KLD values,
in which higher KLD value tends to lead to lower transfer learning performance.
Especially, this trend is clearly shown in cases of Config-5 and Config-6 where
|PCC| > 0.7. In case of Movie as the source domain (i.e., Fig. 3b), only the gain
of Config-5 shows correlations with the KLD values, whereas the correlations
between the three configurations of Config-1, Config-6, and Config-7 are not sta-
tistically significant, i.e., P-value > 0.05. As for the Kindle domain(i.e., Fig. 3c),
the correlation between transfer gain and KLD can be observed for Config-5 and
Config-7, but not for Config-1 and Config-6.



Table 6: Pearson correlation coefficients(P-values) between the transfer perfor-
mance of transfer configurations and a) the target domain sizes and b) target
domain sparsity.
           (a) Target domain dataset size                   (b) Target domain sparsity

  Config       Book        Movie       Kindle      Config      Book         Movie        Kindle
 Config-1 0.74 (0.002) 0.39 (0.151) 0.35 (0.198)   Config-1 0.48 (0.069) 0.07 (0.801) 0.03 (0.913)
 Config-5 0.28 (0.318) 0.35 (0.205) 0.03 (0.917)   Config-5 0.47 (0.074) 0.35 (0.198) 0.32 (0.252)
 Config-6 0.32 (0.249) 0.26 (0.357) 0.13 (0.640)   Config-6 0.55 (0.032) 0.56 (0.030) 0.31 (0.260)
 Config-7 0.17 (0.533) 0.07 (0.805) 0.14 (0.606)   Config-7 0.54 (0.037) 0.53 (0.040) 0.45 (0.089)




     Next, we investigate how characteristics of the target domain affect the trans-
fer performance. Specifically, we consider two key characteristics of the target
domain, namely dataset size and sparsity. Table 6 show the correlation coef-
ficients (P-value) between transfer performance and a) target domain dataset
size and b) target domain sparsity. It can be seen in Table 6a that the corre-
lation between the transfer performance with the target domain’s dataset size
14 On the
       Duc  Nguyen et al.of Deep Neural Networks for Recommender System
          Transferability                                                           35

is low (|P CC| < 0.4) and statistically insignificant (P-value > 0.05), except for
Config-1 configuration with Book as the source domain. As shown in Table 6b,
the target domain’s sparsity has higher correlation to the transfer performance
than the dataset size for Config-5, Config-6, and Config-7. However, the PCC
values are generally low (|P CC| < 0.6). Especially, there is almost no correlation
between the transfer performance of Config-1 and the sparsity when Movie and
Kindle are source domains.


5   Conclusions
In this paper, we investigate the transferability of deep neural networks for top-
N item recommendation task in recommender systems. Specifically, we adopt
MLP as the base network, and investigate seven transfer configurations using
eighteen real-world datasets. Experimental results shows that transferring layers
of the interaction network enhance performance on most of the target domains
by up to 20% in terms of Hit Ratio. Especially, in contrast to neural networks
for computer vision and NLP tasks, the layers close to the output are more
transferable than those close to the input. We also found that the best transfer
configuration highly depends on the source and target domains. Hence, different
from other tasks such as image classification, the transfer configuration should
be carefully chosen to achieve good performance in embedding-based recommen-
dation. Further investigation reveals that the relatedness between the source and
target domain measured in terms of KL-Divergence affects the transfer perfor-
mance, whereas the sizes and sparsity of target domains have little impacts on
the transfer performance. In future work, we will focus on developing transfer
techniques to dynamically decide the optimal transfer configuration.


Acknowledgment
This research was partially supported by JST CREST Grant Number J181401085,
Japan.


References
 1. Bansal, T., Belanger, D., McCallum, A.: Ask the gru: Multi-task learning for deep
    text recommendations. In: Proceedings of the 10th ACM Conference on Recom-
    mender Systems. pp. 107–114. Boston, Massachusetts, USA (Sep 2016)
 2. Bell, R., Koren, Y., Volinsky, C.: Matrix factorization techniques for recommender
    systems. Computer 42(08), 30–37 (Aug 2009)
 3. Bennett, J., Lanning, S., et al.: The netflix prize. In: Proceedings of KDD cup and
    workshop. vol. 2007, p. 35. New York, NY, USA. (2007)
 4. Chae, D.K., Kang, J.S., Kim, S.W., Lee, J.T.: Cfgan: A generic collaborative fil-
    tering framework based on generative adversarial networks. In: Proceedings of the
    27th ACM International Conference on Information and Knowledge Management.
    pp. 137–146. Torino, Italy (Oct 2018)
36 On the
       DucTransferability
            Nguyen, Hao of Deep
                          Niu, KeiNeural Networks
                                   Yonekawa, Mori for Recommender
                                                   Kurokawa, et al. System           15

 5. Dai, W., Yang, Q., Xue, G.R., Yu, Y.: Boosting for transfer learning. In: Pro-
    ceedings of the 24th International Conference on Machine Learning. pp. 193–200.
    Corvalis, Oregon, USA (Jun 2007)
 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
    tional transformers for language understanding. arXiv preprint arXiv:1810.04805
    (2018)
 7. Gao, S., Luo, H., Chen, D., Li, S., Gallinari, P., Guo, J.: Cross-domain recommen-
    dation via cluster-level latent factor model. In: Joint European conference on ma-
    chine learning and knowledge discovery in databases. pp. 161–176. Prague, Czech
    (Sep 2013)
 8. Gomez-Uribe, C.A., Hunt, N.: The netflix recommender system: Algorithms, busi-
    ness value, and innovation. ACM Trans. Manage. Inf. Syst. 6(4), 13:1–13:19 (Dec
    2015)
 9. Han, D., Liu, Q., Fan, W.: A new image classification method using cnn transfer
    learning and web data augmentation. Expert Systems with Applications 95, 43 –
    56 (2018)
10. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., Chua, T.S.: Neural collaborative
    filtering. In: Proceedings of the 26th International Conference on World Wide
    Web. pp. 173–182. Perth, Australia (2017)
11. Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Ges-
    mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp
    (2019)
12. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification
    (2018)
13. Hu, G., Zhang, Y., Yang, Q.: Conet: Collaborative cross networks for cross-domain
    recommendation. In: Proceedings of the 27th ACM International Conference on
    Information and Knowledge Management. pp. 667–676. Torino, Italy (Oct 2018)
14. Kornblith, S., Shlens, J., Le, Q.V.: Do better imagenet models transfer better? In:
    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp.
    2661–2671. CA, US
15. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature 521(7553), 436–444
    (2015)
16. Li, B., Yang, Q., Xue, X.: Can movies and books collaborate? cross-domain collab-
    orative filtering for sparsity reduction. In: Twenty-First International Joint Con-
    ference on Artificial Intelligence. pp. 2052–2057. CA, USA (Jul 2009)
17. Li, J., Monroe, W., Jurafsky, D.: Understanding neural networks through repre-
    sentation erasure (2016)
18. Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with
    deep adaptation networks. In: Proceedings of the 32nd International Conference on
    International Conference on Machine Learning - Volume 37. p. 97–105. ICML’15,
    JMLR.org (2015)
19. Man, T., Shen, H., Jin, X., Cheng, X.: Cross-domain recommendation: An embed-
    ding and mapping approach. In: Proceedings of the Twenty-Sixth International
    Joint Conference on Artificial Intelligence, IJCAI-17. pp. 2464–2470. Melbourne,
    Australia (Aug 2017)
20. Ni, J., Li, J., McAuley, J.: Justifying recommendations using distantly-labeled re-
    views and fine-grained aspects. In: Proceedings of the 2019 Conference on Empirical
    Methods in Natural Language Processing and the 9th International Joint Con-
    ference on Natural Language Processing (EMNLP-IJCNLP). pp. 188–197. Hong
    Kong, China (Nov 2019)
16 On the
       Duc  Nguyen et al.of Deep Neural Networks for Recommender System
          Transferability                                                             37

21. Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level im-
    age representations using convolutional neural networks. In: The IEEE Conference
    on Computer Vision and Pattern Recognition (CVPR) (June 2014)
22. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on Knowl-
    edge and Data Engineering 22(10), 1345–1359 (Oct 2010)
23. Pan, W., Xiang, E.W., Liu, N.N., Yang, Q.: Transfer learning in collaborative
    filtering for sparsity reduction. In: Twenty-fourth AAAI conference on artificial
    intelligence. pp. 230–235. Georgia, USA (Jul 2010)
24. Ricci, F., Rokach, L., Shapira, B.: Introduction to recommender systems handbook.
    In: Recommender systems handbook, pp. 1–35. Springer (2011)
25. Sedhain, S., Menon, A.K., Sanner, S., Xie, L.: Autorec: Autoencoders meet col-
    laborative filtering. In: Proceedings of the 24th International Conference on World
    Wide Web. pp. 111–112. Florence, Italy (May 2015)
26. Seo, S., Huang, J., Yang, H., Liu, Y.: Interpretable convolutional neural networks
    with dual local and global attention for review rating prediction. In: Proceedings
    of the Eleventh ACM Conference on Recommender Systems. pp. 297–305. Como,
    Italy (Aug 2017)
27. Shi, Q., Du, B., Zhang, L.: Domain adaptation for remote sensing image classifica-
    tion: A low-rank reconstruction and instance weighting label propagation inspired
    algorithm. IEEE Transactions on Geoscience and Remote Sensing 53(10), 5677–
    5689 (2015)
28. Smith, B., Linden, G.: Two decades of recommender systems at amazon.com. IEEE
    Internet Computing 21(3), 12–18 (May 2017)
29. Voigt, P., Von dem Bussche, A.: The eu general data protection regulation (gdpr).
    A Practical Guide, 1st Ed., Cham: Springer International Publishing (2017)
30. Wang, Z., Du, B., Guo, Y.: Domain adaptation with neural embedding matching.
    IEEE Transactions on Neural Networks and Learning Systems pp. 1–11 (2019).
    https://doi.org/10.1109/TNNLS.2019.2935608
31. Wiese, G., Weissenborn, D., Neves, M.: Neural domain adaptation for biomedi-
    cal question answering. In: Proceedings of the 21st Conference on Computational
    Natural Language Learning (CoNLL 2017). pp. 281–289. Vancouver, Canada (Aug
    2017)
32. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in
    deep neural networks? In: Advances in neural information processing systems. pp.
    3320–3328 (2014)
33. Zhang, S., Yao, L., Sun, A., Tay, Y.: Deep learning based recommender system: A
    survey and new perspectives. ACM Comput. Surv. 52(1), 5:1–5:38 (Feb 2019)
34. Zhu, F., Chen, C., Wang, Y., Liu, G., Zheng, X.: Dtcdr: A framework for dual-
    target cross-domain recommendation. In: Proceedings of the 28th ACM Interna-
    tional Conference on Information and Knowledge Management (Dec 2019)
35. Zhu, F., Wang, Y., Chen, C., Liu, G., Orgun, M., Wu, J.: A deep framework for
    cross-domain and cross-system recommendations. In: IJCAI International Joint
    Conference on Artificial Intelligence. pp. 3711–3717. Stockholm, Sweden (Jul 2018)
36. Zhu, Z., Wang, J., Caverlee, J.: Improving top-k recommendation via jointcollabo-
    rative autoencoders. In: The World Wide Web Conference. WWW ’19 (May 2019)