<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop on Knowledge Discovery and User Modelling for Smart Cities
August</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>CoNet: Collaborative Cross Networks for Cross-Domain Recommendation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Guangneng Hu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu Zhang</string-name>
          <email>yu.zhang.ust@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qiang Yang</string-name>
          <email>qyang@cse.ust.hk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Hong Kong University of Science and Technology</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>20</volume>
      <issue>2018</issue>
      <abstract>
        <p>The cross-domain recommendation technique is an effective way of alleviating the data sparse issue in recommender systems by leveraging the knowledge from relevant domains. Transfer learning is a class of algorithms underlying these techniques. In this paper, we propose a novel transfer learning approach for cross-domain recommendation by using neural networks as the base model. In contrast to the matrix factorization based cross-domain techniques, our method is deep transfer learning, which can learn complex user-item interaction relationships. We assume that hidden layers in two base networks are connected by cross mappings, leading to the collaborative cross networks (CoNet). CoNet enables dual knowledge transfer across domains and is achieved in multi-layer feedforward networks which can be trained efficiently by back-propagation. The proposed model is thoroughly evaluated on two large real-world datasets. It outperforms baselines by relative improvements of 7.84% in NDCG. We demonstrate the necessity of adaptively selecting representations to transfer. Our model can reduce tens of thousands training examples without performance degradation by comparing with non-transfer methods.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Collaborative filtering (CF) approaches, which model the preference of users on items based on their
past interactions such as product ratings, are the corner stone for recommender systems. Matrix
factorization (MF) is a class of CF methods which learn user latent factors and item latent factors
by factorizing their interaction matrix [
        <xref ref-type="bibr" rid="ref19 ref28">28,19</xref>
        ]. Neural collaborative filtering is another class of CF
methods which use neural networks to learn the complex user-item interaction function [
        <xref ref-type="bibr" rid="ref13 ref4 ref9">9,4,13</xref>
        ]. Neural
networks have the ability to learn highly nonlinear function, which is suitable to learn the complex
user-item interaction. Both traditional MF and neural CF, however, suffer from the cold-start and
data sparse issues.
      </p>
      <p>
        One effective solution is to transfer the knowledge from relevant domains and cross-domain
recommendation techniques address such problems [
        <xref ref-type="bibr" rid="ref2 ref21 ref3 ref33">2,21,33,3</xref>
        ]. In real life, a user typically participates
several systems to acquire different information services. For example, a user installs applications
in an app store as well as reads news from a website. It brings us an opportunity to improve the
recommendation performance in the target service (and all services) by learning across domains.
Following the above example, we can represent the app installation feedback using a binary matrix
where the entries indicate whether a user has installed an app. Similarly, we use another binary matrix
to indicate whether a user has read a news article. Typically these two matrices are highly sparse,
and it is beneficial to learn them simultaneously. This idea is sharpened into the collective matrix
factorization (CMF) [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] approach which jointly factorizes these two matrices by sharing the user
latent factors. It combines CF on a target domain and another CF on an auxiliary domain, enabling
knowledge transfer [
        <xref ref-type="bibr" rid="ref31 ref45">31,45</xref>
        ]. CMF, however, is a shallow model and has the difficulty in learning the
complex user-item interaction function [
        <xref ref-type="bibr" rid="ref13 ref9">9,13</xref>
        ]. Its knowledge sharing is only limited in the lower level
of user latent factors.
      </p>
      <p>
        Motivated by benefitting from both knowledge transfer learning and learning interaction function, we
propose a novel deep transfer learning approach for cross-domain recommendation using neural networks
as the base model. Though neural CF approaches are proposed for single domain recommendation [
        <xref ref-type="bibr" rid="ref13 ref40">13,40</xref>
        ],
there are few related works to study knowledge transfer learning for cross-domain recommendation
using neural networks. Instead, neural networks have been used as the base model in natural language
Page 14 of 40
processing [
        <xref ref-type="bibr" rid="ref43 ref5">5,43</xref>
        ] and computer vision [
        <xref ref-type="bibr" rid="ref27 ref44 ref8">44,27,8</xref>
        ]. We explore how to use a neural network as the base
model for each domain and enable the knowledge transfer on the entire network across domains. Then
a few questions and challenges are raised: 1) What to transfer/share between these individual networks
for each domain? 2) How to transfer/share during the learning of these individual networks for each
domain? and 3) How is the performance compared with single domain neural learning and shallow
cross-domain models?
      </p>
      <p>
        This paper aims at proposing a novel deep transfer learning approach by answering these questions
under cross-domain recommendation scenario. The usual transfer learning approach is to train a base
network and then copy its first several layers to the corresponding first layers of a target network with
fine-tuning or parameter frozen [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ]. This way of transferring has possibly two weak points. Firstly,
the shared-layer assumption is strong in practice as we find that it does not work well on real-world
cross-domain datasets. Secondly, the knowledge transfer happens in one direction, i.e., only from source
to target. Instead, we assume that hidden layers in two base networks are connected by dual mappings,
which do not require them to be identical. We enable dual knowledge transfer across domains by
introducing cross connections from one base network to another and vice versa, letting them benefit
from each other. These ideas are sharpened into the proposed collaborative cross networks (CoNet).
CoNet is achieved in simple multi-layer feedforward networks by using dual shortcut connections and
joint loss functions, which can be trained efficiently by back-propagation.
      </p>
      <p>The paper is organized as follows. We firstly introduce the preliminaries in Section 2, including
notations and the base network. In Section 3, we then present an intuitive model to realize the
cross-domain recommendation and point out several intrinsic weaknesses which limit its use. We
propose a novel deep transfer learning approach for cross-domain recommendation, named collaborative
cross networks (CoNet) in Section 4. The core component is the cross connection units which enable
knowledge transfer between source and target networks (Sec. 4.1). Its adaptive variant enforces the sparse
structure which adaptively controls when to transfer (Sec. 4.3). In Section 5, we experimentally show
the benefits of both transfer learning and deep learning for improving the recommendation performance
in terms of ranking metrics (Sec. 5.2). We show the necessity of adaptively selecting representations to
transfer (Sec. 5.3). We reduce tens of thousands training examples without performance degradation
by comparing with non-transfer models (Sec. 5.4), which can be used to save the cost/labor of labelling
data. We review related works in Section 6 and conclude the paper in Section 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminary</title>
      <p>2.1</p>
      <sec id="sec-2-1">
        <title>Notation</title>
        <p>We first give notations and describe the problem setting (Sec. 2.1). We then review a multi-layer neural
network as the base network for collaborative filtering (Sec. 2.2).</p>
        <p>
          We are given two domains, a source domain S (e.g. the news recommendation) and a target domain
T (e.g. the app recommendation). As a running example, we let the app recommendation be the
target domain and the news recommendation be the source domain. The set of users in two domains
are shared, denoted by U (of size m = jU j). Denote the set of items in S and T by IS and IT (of
size nS = jIS j and nT = jIT j), respectively. Each domain is a problem of collaborative filtering for
implicit feedback [
          <xref ref-type="bibr" rid="ref16 ref31">31,16</xref>
          ]. For the target domain, let a binary matrix RT 2 Rm nT describe user-app
installing interactions, where an entry rui 2 f0; 1g is 1 (observed entries) if user u has an interaction
with app i and 0 (unobserved) otherwise. Similarly, for the source domain, let another binary matrix
RS 2 Rm nS describe user-news reading interactions, where the entry ruj 2 f0; 1g is 1 if user u has
an interaction with news j and 0 otherwise. Usually the interaction matrix is very sparse since a user
only consumed a very small subset of all items.
        </p>
        <p>For the task of item recommendation, each user is only interested in identifying top-N items. The
items are ranked by their predicted scores: r^ui = f (u; ij ); where f is the interaction function and
are model parameters. For matrix factorization (MF) techniques, the match function is the fixed
dot product: r^ui = PuT Qi; and parameters are latent vectors of users and items = fP ; Qg where
P 2 Rm d; Q 2 Rn d and d is the dimension. For neural CF approaches, neural networks are used to
parameterize function f and learn it from interactions:
f (xuijP ; Q; f ) =
o( L(:::( 1(xui)):::));
(1)
Page 15 of 40
where the input xui = [P T xu; QT xi] is merged from projections of the user and the item, and the
projections are based on their one-hot encodings xu 2 f0; 1gm; xi 2 f0; 1gn and embedding matrices
P 2 Rm d; Q 2 Rn d. The output and hidden layers are computed by o and f lg in a multilayer
feedforward neural network (FFNN), and the connection weight matrices and biases are denoted by f .</p>
        <p>In our transfer/multitask learning approach for cross-domain recommendation, each domain is
modelled by a neural network and these networks are jointly learned to improve the performance
through mutual knowledge transfer. We review the base network in the following subsection before
introducing the proposed model.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Base Network</title>
        <p>
          We adopt an FFNN as the base network to parameterize the interaction function (see Eq.(1)). The
base network is similar to the Deep model in [
          <xref ref-type="bibr" rid="ref4 ref6">6,4</xref>
          ] and the MLP model in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The base network, as
shown in Figure 2 (the gray part or the blue part), consists of four modules with the information
flow from the input (u; i) to the output r^ui as follows. Input : (u; i) ! xu; xi. This module encodes
user-item interaction indices. We adopt the one-hot encoding. It takes user u and item i, and maps
them into one-hot encodings xu 2 f0; 1gm and xi 2 f0; 1gn where only the element corresponding
to that index is 1 and all others are 0. Embedding : xu; xi ! xui. This module embeds one-hot
encodings into continuous representations via two embedding matrices and then merges them as
xui = [P T xu; QT xi] to be the input of successive hidden layers. Hidden layers: xui zui. This
module takes the continuous representations from the embedding module and then transforms, through
multi-hop say L, to a final latent representation zui = L(:::( 1(xui):::). This module consists of
multiple hidden layers to learn nonlinear interaction between users and items. Output : zui ! r^ui.
This module predicts the score r^ui for the given user-item pair based on the representation zui from
the last layer of multi-hop module. Since we focus on one-class collaborative filtering, the output is
the probability that the input pair is a positive interaction. This can be achieved by a softmax layer:
r^ui = o(zui) = 1=(1 + exp( hT zui)); where h is parameter.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Cross-stitch Networks</title>
      <p>We first introduce an intuitive model to realize cross-domain recommendation using neural networks,
and point out several intrinsic strong assumptions limiting its use, which inspire the design of our
model in the next section.</p>
      <p>
        Intuitively, we can use an MLP model [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] on the target domain and use another MLP model on
the source domain. To enable knowledge transfer between two domains, we need some “cross” mapping
from the source to the target (and vice versa). We adapt cross-stitch units/networks (CSN) [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] for
cross-domain recommendation, which are originally proposed for visual recognition tasks (see Fig. 1a).
      </p>
      <p>Given two activation maps aA and aB from the l-th layer for two tasks A and B, CSN learn
linear combinations a~A; a~B of both the input activations and feed these combinations as input to the
successive layers’ filters:
a~iAj = SaiAj + DaiBj; a~iBj = SaiBj + DaiAj;
(2)
where the shared parameter D controls information transferred from the other network, S controls
information from the task-specific network, and (i; j) is the location in the activation map.</p>
      <p>
        Although the cross-stitch unit indeed incorporates knowledge from the source domain (and target
domain vice versa), there are several limitations of this simple stitch unit. Firstly, cross-stitch networks
cannot process the case that the dimensions of contiguous layers are different. In other words, it
assumes that the activations in successive layers are in the same vector space. This is not an issue in
convolutional networks for computer vision since the activation maps of contiguous layers are in the
same space [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. For collaborative filtering, however, it is not the case in typical multi-layer FFNNs
where the architecture follows a tower pattern: the lower layers are wider and higher layers have
smaller number of neurons [
        <xref ref-type="bibr" rid="ref13 ref6">6,13</xref>
        ]. Secondly, it assumes that the representations from other networks
are equally important with weights being all the same scalar D. Some features, however, are more
useful and predictive and it should be learned attentively from data [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ]. Thirdly, it assumes that the
representations from other networks are all useful since it transfers activations from every location in
Page 16 of 40
a dense way. The sparse structure, however, plays a key role in general learning paradigm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Instead,
our model can be extended to learn the sparse structure on the task relationship matrices which
are defined in Eq. (5), with the help of the existing sparsity-induced regularization. As we will see
in the experiments (see Table 2 and Figure 3), the sparse structure is necessary for generalization
performance.
      </p>
      <p>෤
α

α</p>
      <p>෤
 
 ෤
 
 ෤
 
Task A</p>
      <p>Task B</p>
      <p>Source network</p>
      <p>Target network
(a) Cross-stitch unit
(b) Cross connection unit
Page 17 of 40
since the matrix H can be used to match their dimension. For example, if the l-th layer (aapp and
anews) has dimension 128, and the l + 1-th layer (a~app and a~news) has dimension 64, then the matrix
H 2 R64 128. Secondly, the entries of H are learned from data. They are likely not to be all the same,
showing that the importances of transferred representations are different for each neuron/position.
Thirdly, we can enforce some prior on the matrix H to exploit the structure of the neural architecture.
The sparse structure can be enforced to adaptively select useful representations to transfer. Based on
the cross connection units, we propose the CoNet models in the following sections, including a basic
model (Sec. 4.2) and an adaptive variant (Sec. 4.3).
4.2</p>
      <sec id="sec-3-1">
        <title>Basic Model</title>
        <p>We propose the collaborative cross network (CoNet) model by adding cross connection units (see
Sec. 4.1) and the joint loss (see Sec. 4.4) to the entire FFNN, as shown in Figure 2. We firstly describe
a basic model in this section and then present an adaptive variant in the next section.</p>
        <p>We decompose the model parameters into two parts, task-shared and task-specific:
app =
fP ; (Hl)1Lg [ fQapp; f appg and</p>
        <p>news = fP ; (Hl)1Lg[fQnews; f newsg where P is the user
embedding matrix and Q are the item embedding matrices with the subscript specifying the corresponding
domain. We stack the cross connections units on the top of the shared user embeddings, enabling deep
knowledge transfer. Denote by W l the weight matrix connecting from the l-th to the l + 1-th layer
(we ignore biases for simplicity), and by Hl the linear projection underlying the corresponding cross
connections. Then two base networks are coupled by cross connections:</p>
        <p>ala+pp1 = (Walppalapp + Hlalnews);
aln+e1ws = (Wnlewsalnews + Hlalapp);
(4a)
(4b)
where the function</p>
        <p>
          ( ) is the widely used rectified activation units (ReLU) [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. We can see that
ala+pp1 receives two information flows: one is from the transform gate controlled by Walpp and one is
from the transfer gate controlled by Hl (similarly for the aln+e1ws in source network). We call Hl the
relationship/transfer matrix since it learns to control how much sharing is needed. To reduce model
parameters and make the model compact, we use the same linear transformation Hl for two directions,
similar to the cross-stitch networks. Actually, using different matrices for two directions did not improve
results on the evaluated datasets.
        </p>
        <sec id="sec-3-1-1">
          <title>Output</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>3rd layer</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>2nd layer</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>1st layer</title>
        </sec>
        <sec id="sec-3-1-5">
          <title>Embedding Input</title>
          <p>Ƹ

j news

u user
 Ƹ</p>
        </sec>
        <sec id="sec-3-1-6">
          <title>Cross</title>
        </sec>
        <sec id="sec-3-1-7">
          <title>Unit</title>
          <p>i app
cross unit is illustrated in the dotted rectangle box, see Fig. ??).
Page 18 of 40</p>
          <p>
            As we can see, the task relationship matrices fHlg are crucial to the proposed CoNet model. They
control the representation transfer from another domain. We can enforce these matrices to have some
structure. The assumption is that not all representations from another network are useful. We may
expect that the representations coming from other domains are sparse and selective. The selective
mechanism can help transfer general/useful representations while ignore the specific/noisy ones. The
sparse representations are widely adopted in multitask/transfer learning [
            <xref ref-type="bibr" rid="ref1 ref42 ref8">1,8,42</xref>
            ].
          </p>
          <p>This corresponds to enforcing a sparse prior on the structure and can be achieved by penalizing
the task relationship matrix {Hl} via some regularization. It may help the individual network to learn
intrinsic representations for itself and other tasks. In other words, {Hl} adaptively controls when to
transfer.</p>
          <p>
            We adopt the widely used sparsity-induced regularization—least absolute shrinkage and selection
operator (lasso) [
            <xref ref-type="bibr" rid="ref38">38</xref>
            ]. In detail, denote by r p the size of matrix Hl (usually r = p=2). That is, Hl
linearly transforms representations alnews 2 Rp in the news network and the result is as part of the
input to the next layer a~l+1
          </p>
          <p>app 2 Rr in the app network (see Eq.(4) and Eq.(3)). Denote by hij the (i; j)
entry of Hl. To induce overall sparsity, we impose the `1-norm penalty on the entries fhij g of Hl:
(Hl) =</p>
          <p>Xr
i=1</p>
          <p>Xp
j=1 jhij j;
where hyperparameter controls the degree of sparsity. This corresponds to the lasso regularization.
We call this sparse variant as the SCoNet model.</p>
          <p>Other priors like low-rank factorization are alternatives of sparse structure. And the lasso variants
like group lasso and sparse group lasso are also possible. We adopt the general sparse prior and the
widely used lasso regularization.
The model parameters include fP ; (Hl)lL=1g[fQapp; (Walpp; blapp)lL=1; happg[ fQnews; (Wnlews; blnews)lL=1,
hnewsg, where the embedding matrices P , Qapp and Qnews contain a large number of parameters since
they depend on the input size of users and items. Typically, the number of neurons in a hidden layer is
about one hundred. That is, the size of connection weight matrices and task relationship matrices is
Page 19 of 40
4.4</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Model Learning</title>
        <p>Due to the nature of the implicit feedback and the task of item recommendation, the squared loss is
not suitable since it is usually for rating regression/prediction. Instead, we adopt the cross-entropy loss:
L0 =</p>
        <p>
          X
(u;i)2R+[R
rui log r^ui + (1
rui) log(1
r^ui);
where R+ and R are the observed interaction matrix and randomly sampled negative examples [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ],
respectively. This objective function has probabilistic interpretation and is the negative logarithm
likelihood of the following likelihood function: L( jR+ [ R ) = Q(u;i)2R+ r^ui Q(u;i)2R (1 r^ui);
where are model parameters.
        </p>
        <p>Now we define the joint loss function, leading to the proposed CoNet model which can be trained
efficiently by back-propagation. Instantiating the base loss (L0) described in Eq. (6) by the loss of app
(Lapp) and loss of news (Lnews) recommendation, the objective function for the CoNet model is their
joint losses: L( ) = Lapp( app) + Lnews( news); where model parameters: = app [ news. Note
that app and news share user embeddings and transfer matrices fP ; (Hl)lL=1g. For the CoNet-sparse
model, the objective function is added by the term (Hl) in Eq.(5).</p>
        <p>
          The objective function can be optimized by stochastic gradient descent and its variants like adaptive
moment method (Adam) [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The update equations are: new old @L( )=@ ; where is the
learning rate. Typical deep learning library like TensorFlow (https://www.tensorflow.org) provides
automatic differentiation and hence we omit the gradient equations @L( )=@ which can be computed
by chain rule in back-propagation.
(5)
(6)
hundreds by hundreds. In total, the size of model parameters is linear with the input size and is close
to the size of typical latent factors models [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and neural CF approaches [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          During training, we update the target network using the target domain data and update the
source network using the source domain data. The learning procedure is similar to the cross-stitch
networks [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. And the cost of learning each base network is approximately equal to that of running a
typical neural CF approach [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. In total, the entire network can be efficiently trained by BP using
mini-batch stochastic optimization.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiment</title>
      <p>We conduct thorough experiments to evaluate the proposed models. We show their superior performance
over the state-of-the-art recommendation algorithms in a wide range of baselines (Sec. 5.2) and
demonstrate the effectiveness of the sparse variant to select representations (Sec. 5.3). We quantify
the benefit of knowledge transfer by reducing training examples (Sec. 5.4). Furthermore, we conduct
analyses on the sensitivity and sparsity (Sec. 5.5).
5.1</p>
      <sec id="sec-4-1">
        <title>Experimental Setup</title>
        <p>We begin the experiments by introducing the datasets, evaluation protocol, baselines, and
implementation details.</p>
        <p>
          Dataset We evaluate on two real-world cross-domain datasets. The first dataset, Mobile, is provided
by a large internet company, i.e., Cheetah Mobile (http://www.cmcm.com/en-us/). The information
contains logs of user reading news, the history of app installation, and some metadata such as news
publisher and user gender collected in one month in the US. The dataset we used contains 1,164,394
user-app installations and 617,146 user-news reading records. There are 23,111 shared users, 14,348 apps,
and 29,921 news articles. We aim to improve the app recommendation by transferring knowledge from
relevant news reading domain. The data sparsity is over 99.6%. The second dataset is a public Amazon
dataset (http://jmcauley.ucsd.edu/data/amazon/), which has been widely used to evaluate the
performance of collaborative filtering approaches [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. We use the two largest categories, Books and
Movies &amp; TV, as the cross-domain. We convert the ratings of 4-5 as positive samples. The dataset we
used contains 1,323,101 user-book ratings and 963,373 user-movie ratings. There are 80,763 shared
users, 93,799 books, and 35,896 movies. We aim to improve the book recommendation by transferring
knowledge from relevant movie watching domain. The data sparsity is over 99.9%. The statistics are
summarized in Table 1. As we can see, both datasets are very sparse and hence we hope improve
performance by transferring knowledge from auxiliary domains.
        </p>
        <p>
          Evaluation Protocol For item recommendation task, the leave-one-out (LOO) evaluation is widely
used and we follow the protocol in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. That is, we reserve one interaction as the test item for each
user. We determine hyper-parameters by randomly sampling another interaction per user as the
validation/development set. We follow the common strategy which randomly samples 99 (negative)
items that are not interacted by the user and then evaluate how well the recommender can rank the test
item against these negative ones. Since we aim at top-N item recommendation, the typical evaluation
metrics are hit ratio (HR), normalized discounted cumulative gain (NDCG), and mean reciprocal rank
(MRR), where the ranked list is cut off at topN = 10. HR intuitively measures whether the reserved test
item is present on the top-N list, defined as: HR = jU1j Pu2U (pu topN ); where pu is the hit position
for the test item of user u, and ( ) is the indicator function. NDCG and MRR also account for the
rank of the hit position, respectively defined as: N DCG = jU1j Pu2U logl(opgu2+1) ; M RR = jU1j Pu2U p1u :
Baseline We compare with various baselines:
        </p>
        <p>Dataset #user
Page 20 of 40</p>
        <p>
          Baselines Shallow method Deep method
Single-domain BPRMF [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] MLP [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]
        </p>
        <p>
          Cross-domain CDCF [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], CMF [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] MLP++, CSN [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]
BPRMF: Bayesian personalized ranking [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] is a typical latent factors CF approach which learns the
user and item factors via MF and pairwise rank loss. It is a shallow model and learns on the target
domain only. MLP: Multilayer perceptron [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] is a typical neural CF approach which learns user-item
interaction function using neural networks. MLP corresponds to the base network as described in
Section 2.2. It is a deep model and learns on the target domain only. MLP++: We combine two MLPs
by sharing the user embedding matrix only. This is a degenerated CoNet which has no cross connection
units. It is a simple/shallow knowledge transfer approach applied to two domains. CDCF:
Crossdomain CF with factorization machines (FM) [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] is a state-of-the-art cross-domain recommendation
which extends FM [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ]. It is a context-aware approach which applies factorization on the merged
domains (aligned by the shared users). That is, the auxiliary domain is used as context. On the Mobile
dataset, the context for a user in the target app domain is her history of reading news in the source
news domain. Similarly, the context for a user in the target book domain is her history of watching
movies in the source movie domain on the Amazon dataset. The feature vector for the input is a sparse
vector x 2 Rm+nT +nS where the non-zero entries are as follows: 1) the index for user id, 2) the index
for item id (target domain), and all indices for her reading articles/watching movies (source domain).
Since FM can mimic MF, CDCF can be thought of as a single-domain factorization method applied
on the merged source-target rating matrix. It showed better performance than other cross-domain
methods like triadic (tensor) factorization [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. It is a shallow cross-domain model. CMF: Collective
MF [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] is a multi-relation learning approach which jointly factorizes matrices of individual domains.
Here, the relation is user-item interaction. On Mobile, the two matrices are A = “user by app” and
B = “user by news” respectively. The shared user factors P enable knowledge transfer between two
domains. Then CMF factorizes matrices A and B simultaneously by sharing the user latent factors:
A P T QA and B P T QB. It is a shallow model and jointly learns on two domains. This can be
thought of a non-deep transfer/multitask learning approach for cross-domain recommendation. CSN:
The cross-stitch network [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], described in Sect, 3, is a good competitor. It is a deep multitask learning
model which jointly learns two base networks. It enables knowledge transfer via a linear combination
of activation maps from two networks via a shared coefficient, i.e., D in Eq.(2). This is a deep
transfer/multitask learning approach for cross-domain recommendation.
        </p>
        <p>Implementation For BPRMF, we use LightFM’s implementation1 which is a popular CF library.
For CDCF, we adapt the official libFM implementation2. For MLP, we use the code released by its
authors3. For CMF, we use a Python version reference to the original Matlab code4. Our methods are
implemented using TensorFlow. Parameters are randomly initialized from Gaussian N (0; 0:012). The
optimizer is Adam with initial learning rate 0.001. The size of mini batch is 128. The ratio of negative
sampling is 1. As for the design of network structure, we adopt a tower pattern, halving the layer size
1 https://github.com/lyst/lightfm
2 http://www.libfm.org
3 https://github.com/hexiangnan/neuralcollaborativefiltering
4 http://www.cs.cmu.edu/ajit/cmf/
Page 21 of 40
for each successive higher layer. Specifically, the configuration of hidden layers in the base network is
[64 ! 32 ! 16 ! 8]. This is also the network configuration of the MLP model. For CSN, it requires
that the number of neurons in each hidden layer is the same. The configuration notation [64] 4 equals
[64 ! 64 ! 64 ! 64]. We investigate several typical configurations.
5.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Comparing Different Approaches</title>
        <p>In this section, we report the recommendation performance of different methods and discuss the
findings. Table 2 shows the results of different models on the two datasets under three ranking metrics.
The last two columns are the relative improvement and its paired t-test of our model vs. the best
baselines. We can see that our proposed neural models are better than the base network (MLP), the
shallow cross-domain models (CMF and CDCF) learned using two domains information, and the deep
cross-domain model (MLP++ and CSN) on both datasets.</p>
        <p>On Mobile, our model achieves 4.28% improvements in terms of MRR comparing with the
nontransfer MLP, showing the benefits of knowledge transfer. Note that, the way of pre-training an MLP
on source domain and then transferring user embeddings to target domain as warm-up did not achieve
much improvement. In fact, the improvement is so small that it can be ignored. It shows the necessity
of dual knowledge transfer in a deep way. Our model improves more than 20% in terms of MRR
comparing with CDCF and CMF, showing the effectiveness of deep neural approaches. Together, our
neural models consistently give better performance than other existing methods. Within our models
(SCoNet vs CoNet), enforcing sparse structure on the task relationship matrices are useful. Note that,
the dropout technique and `2 norm penalty did not achieve these improvements. They may harm the
performance in some cases. It shows the necessity of selecting representations.</p>
        <p>On Amazon, our model achieves 7.84% improvements in terms of NDCG comparing with the best
baselines (MLP++), showing the benefits of knowledge transfer. Compared to the BPRMF, the inferior
performance of CMF and CDCF shows the difficulty in transferring knowledge between Amazon Books
and Movies, but our models also achieve good results. Comparing MLP++ and MLP, sharing user
embedding is sightly better than the base network due to shallow knowledge transfer. Within our
models, enforcing sparse structure on the task relationship matrices are also useful.</p>
        <p>CSN is inferior to the proposed CoNet models on both datasets. Moreover, it is surprising that
the CSN has some difficulty in benefitting from knowledge transfer on the Amazon dataset since it is
inferior to the non-transfer base network MLP. The reason is possibly that the assumptions of CSN are
not appropriate: all representations from the auxiliary domain are equally important and are all useful.
By using a matrix H rather than a scalar D, we can relax the first assumption. And by enforcing a
sparse structure on the matrix, we also relax the second assumption.</p>
        <p>Note that the relative improvement of the proposed model vs. the best baseline is more significant
on the Amazon dataset than that on the Mobile dataset, though the Amazon is much sparser than
the Mobile (see Table 1). One explanation is that the relatedness the book and movie domains is
much larger than that between the app and news domains. This will benefit all cross-domain methods
including CMF, CDCF, and CSN, since they exploit information from both two domains. Another
possibility is that the noise from auxiliary domain proposes a challenge for transferring knowledge.
This shows that the proposed model is more effective since it can select useful representations from the
source network and ignore the noisy ones. In the next section, we give a closer look at the impact of
the sparse structure.</p>
      </sec>
      <sec id="sec-4-3">
        <title>5.3 Impact of Selecting Representations</title>
        <p>On two real-world datasets, it both shows the usefulness of enforcing sparse structure on the task
relationship matrices H. We now quantify the contributions of the sparsity to CoNet. We investigate
the impact of the sparsity by controlling the difference of architectures between CSN and CoNet. That
is, we let them have the same architecture configuration. As a consequence, the performance of ablation
comes from different means of knowledge transfer: scalar D used in CSN and sparse matrix H used
in SCoNet.</p>
        <p>Figure 3 shows the results on the Mobile and Amazon datasets under several typical architectures.
We can see that the sparsity contributes to performance improvements and it is necessary to introduce
the sparsity in general settings. On the Mobile data, introducing the sparsity improves the NDCG
Page 22 of 40
0.85</p>
        <p>0.8
0.75
e
c
n
am0.7
r
o
f
r
eP0.65</p>
        <p>0.6
by relatively 2.29%. On the Amazon data, introducing the sparsity improves the NDCG by relatively
4.21%. These results show that it is beneficial to introduce the sparsity and to select representations to
transfer on both datasets.
5.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Benefit of Transferring Knowledge</title>
        <p>
          Transfer learning can reduce the labor and cost of labelling data instances. In this section, we quantify
the benefit of knowledge transfer by comparing with non-transfer methods. We do not compare with
the cross-domain baselines like CSN in this case because we are to investigate the benefits of transfer
learning approaches, not the effectiveness of the proposed model which has demonstrated in the above
Sec. 5.2. That is, we gradually reduce the number of training examples in the target domain until
the performance of the proposed model is inferior to the non-transfer MLP model. The more training
examples we can reduce, the more benefit we can get from transferring knowledge. Note that this is
similar to the settings of varying with the cold-start profile size when evaluating new users [
          <xref ref-type="bibr" rid="ref11 ref18">18,11</xref>
          ].
        </p>
        <p>Referring to Table 1, there are about 50 examples per user on the Mobile dataset. We gradually
reduce one and two training examples per user, respectively, to investigate the benefit of knowledge
transfer. To be fair, we ensure that every user has at least one training example since the non-transfer
MLP cannot deal with the cold-start user issue. The results are shown in Table 3 where the rows
corresponding to reduction percentage 0% are copied from Table 2 for clarity. The reduction amount is
how many training examples that we remove and the reduction percent is the ratio of reduction amount
over the original total training examples. The results show that we can save the cost of labelling about
Page 23 of 40</p>
        <p>0.9
0.85
0.8
e
c
n
am0.75
r
o
f
r
eP0.7
0.65
30; 000 training examples by transferring knowledge from the news domain but still have comparable
performance with the MLP model, a non-transfer baseline.</p>
        <p>According to Table 1, there are about 16 examples per user on the Amazon dataset. With a similar
setting to the Mobile dataset, the results shown in Table 3 indicates that we can save the cost of
labelling about 20; 000 training examples by transferring knowledge from movie domain. Note that the
Amazon dataset is extremely sparse (the density is only 0.017%), implying that there is difficulty in
acquiring many training examples. Under this scenario, our transfer models are an effective way of
alleviating the issue of data sparsity and the cost of collecting data.
We analyze the sensitivity to penalty in Eq.(5) which controls the sparsity. Results are shown on
the Mobile data only due to space limit and we give the corresponding conclusions on the Amazon
data. Figure 4 (left) shows the performance varying with the penalty of sparsity enforcing on the task
relationship matrices H. On the Mobile data, the performance achieves good results at 0.1 (default)
and 5.0, and it is 0.1 (default) and 1.0 on the Amazon data (not shown).</p>
        <p>Fitted
data
HR
NDCG
MRR
0.08
0.075
0.07
s
o
r
e
fz0.065
o
o
it
aR0.06
0.055</p>
        <p>Since the sparsity of transfer matrices (Hl)1L is crucial to select representations for transferring,
we show the change of zero entries over training epochs. For clarity and due to space limit, we only
show the results of the first transfer matrix H1 which connects the first and the second hidden layers.
Figure 4 (right) shows the results where we use a 4-order polynomial to robustly fit the data. We can
see that the matrix becomes sparser for the first 25 iterations, and the general trend is to sparsify.
Page 24 of 40
The average percent of zero entries in H1 is 6.5%. For the second and third transfer matrices, the
percentage becomes 6.0% and 6.3%, respectively. In summary, sparse transfer matrices are learned
and they can adaptively select partial representations to transfer across domains. It may be better to
transfer many instead of all representations.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Related Works</title>
      <p>Our work is related to research fields of (cross-domain) recommender systems and (deep) transfer
learning.</p>
      <p>
        Recommender systems Recommender systems aim at learning user preferences on unknown items
from their past history. Content-based recommendations are based on the matching between user
profiles and item descriptions [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]. It is difficult to build the profile for each user when there is no/few
content. Collaborative filtering (CF) alleviates this issue by predicting user preferences based on
the user-item interaction behavior, agnostic to the content [
        <xref ref-type="bibr" rid="ref46 ref7">7,46</xref>
        ]. Latent factor models are typical
CF methods which learn feature vectors for users and items mainly based on matrix factorization
(MF) techniques [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. MF has probabilistic interpretations [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] and flexible extensions to integrate
social relations [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] and item content [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] and both [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], leading to the hybrid methods. Recently,
neural networks are proposed to push the learning of feature vectors towards (highly) non-linear
representations, learning the user-item interaction function from data rather than using the fixed dot
inner used in MF [
        <xref ref-type="bibr" rid="ref13 ref6 ref9">9,6,13</xref>
        ]. Both MF and neural CF models, however, suffer from the data sparsity
issue.
      </p>
      <p>
        Cross-domain recommendation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is an effective technique to alleviate sparse issue. A class of
methods are based on MF applied to each domain, including the collective MF [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ], factorization on both
real numbers and binary values matrices [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], and transferring cluster-level patterns transfer [
        <xref ref-type="bibr" rid="ref21 ref22">21,22</xref>
        ].
Some other shallow methods exploit the bandit learning [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and the graph-based approach [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ]. The
deep multi-view neural approach [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] models the shared users as the pivot view and the source/target
items as other views by neural networks. We follow this deep learning research thread by using deep
networks to learn the interaction function through highly nonlinear transformations.
Transfer and multitask learning Transfer learning (TL) aims at improving the performance of
the target domain by exploiting knowledge from source domains [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] which matches the core idea of
cross-domain recommendation techniques. The typical TL technique in neural networks is to initialize
a target network with transferred features from a pre-trained source network [
        <xref ref-type="bibr" rid="ref30 ref44">30,44</xref>
        ]. Different from
this approach, we transfer knowledge in a deep way such that the two base networks benefit from
each other during the learning procedures, motivated by the cross-stitch network [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] which enables
information sharing between the two base networks. We generalize it by relaxing the underlying
assumption, especially via the idea of selecting representations to transfer.
7
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>We proposed a novel deep transfer learning for cross-domain recommendation. The sparse target
user-item interaction matrix can be reconstructed with the knowledge guidance from the source
domain. We demonstrated the necessity of selecting representations to transfer since it may harm the
performance by transferring all of them with equal importance. We found that naive deep transfer
models may be inferior to the shallow/neural non-transfer methods in some cases. Our transfer models
can reduce tens of thousands training examples by comparing with non-transfer methods without
performance degradation. Experiments validated their effectiveness by comparing with shallow/deep,
single/cross-domain baselines.</p>
      <p>The evaluated Mobile dataset is collected from mobile smart devices in different states of the U.S.
We found that some hot states like TX, FL, CA, IL and NY have a lot of records while other states
are scarce, showing a challenge in reliably learning personalization model in these sparse states. We
hypothesize that we can transfer the knowledge from hot states to sparse states, i.e. enabling knowledge
transfer between two states. A possible solution is to exploit shared items (apps/news) as a bridge
between states, indicating that users in two states are similar if they install/read the same apps/news.
The proposed models are then applicable.</p>
      <p>Acknowledgment The work is supported by HKPFS PF15-16701.
Page 25 of 40
Page 26 of 40</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Argyriou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Evgeniou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Pontil</surname>
          </string-name>
          <article-title>. Multi-task feature learning</article-title>
          .
          <source>In NIPS</source>
          ,
          <year>2007</year>
          .
          <volume>4</volume>
          .
          <fpage>3</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>S.</given-names>
            <surname>Berkovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kuflik</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Ricci</surname>
          </string-name>
          .
          <article-title>Cross-domain mediation in collaborative filtering</article-title>
          .
          <source>In UMAP</source>
          ,
          <year>2007</year>
          . 1
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>I.</given-names>
            <surname>Cantador</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Fernández-Tobías</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Berkovsky</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          .
          <article-title>Cross-domain recommender systems</article-title>
          .
          <source>In Recommender Systems Handbook</source>
          .
          <year>2015</year>
          .
          <volume>1</volume>
          ,
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. H.-T. Cheng, L. Koc,
          <string-name>
            <given-names>J.</given-names>
            <surname>Harmsen</surname>
          </string-name>
          , et al.
          <article-title>Wide &amp; deep learning for recommender systems</article-title>
          .
          <source>In Workshop on DL for RecSys</source>
          ,
          <year>2016</year>
          .
          <volume>1</volume>
          ,
          <issue>2</issue>
          .
          <fpage>2</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          .
          <article-title>A unified architecture for natural language processing: Deep neural networks with multitask learning</article-title>
          .
          <source>In ICML</source>
          ,
          <year>2008</year>
          . 1
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>P.</given-names>
            <surname>Covington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Adams</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Sargin</surname>
          </string-name>
          .
          <article-title>Deep neural networks for youtube recommendations</article-title>
          .
          <source>In ACM RecSys</source>
          ,
          <year>2016</year>
          .
          <volume>2</volume>
          .
          <issue>2</issue>
          ,
          <issue>3</issue>
          ,
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>M.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Karypis</surname>
          </string-name>
          .
          <article-title>Item-based top-n recommendation algorithms</article-title>
          .
          <year>2004</year>
          . 6
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>C.</given-names>
            <surname>Doersch</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          <article-title>. Multi-task self-supervised visual learning</article-title>
          .
          <source>In ICCV</source>
          ,
          <year>2017</year>
          .
          <volume>1</volume>
          ,
          <issue>3</issue>
          ,
          <issue>4</issue>
          .
          <fpage>3</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>G.</given-names>
            <surname>Dziugaite</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Roy</surname>
          </string-name>
          .
          <article-title>Neural network matrix factorization</article-title>
          .
          <year>2015</year>
          .
          <volume>1</volume>
          ,
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>A.</given-names>
            <surname>Elkahky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          .
          <article-title>A multi-view deep learning approach for cross domain user modeling in recommendation systems</article-title>
          .
          <source>In WWW</source>
          ,
          <year>2015</year>
          . 6
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. I.
          <string-name>
            <surname>Fernández-Tobías</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Braunhofer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Elahi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Ricci</surname>
            ,
            <given-names>and I. Cantador. Alleviating</given-names>
          </string-name>
          <article-title>the new user problem in collaborative filtering by exploiting personality information</article-title>
          .
          <source>UMUAI</source>
          ,
          <year>2016</year>
          .
          <volume>5</volume>
          .
          <fpage>4</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>R.</given-names>
            <surname>He</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>McAuley</surname>
          </string-name>
          .
          <article-title>Vbpr: visual bayesian personalized ranking from implicit feedback</article-title>
          .
          <source>In AAAI</source>
          ,
          <year>2016</year>
          .
          <volume>5</volume>
          .
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.-S.</given-names>
            <surname>Chua</surname>
          </string-name>
          .
          <article-title>Neural collaborative filtering</article-title>
          .
          <source>In WWW</source>
          ,
          <year>2017</year>
          .
          <volume>1</volume>
          ,
          <issue>2</issue>
          .2,
          <issue>3</issue>
          ,
          <issue>3</issue>
          ,
          <issue>4</issue>
          .5,
          <issue>5</issue>
          .1,
          <issue>5</issue>
          .1,
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>G.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>A synthetic approach for recommendation: Combining ratings, social relations, and reviews</article-title>
          .
          <source>In IJCAI</source>
          ,
          <year>2015</year>
          . 6
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. L.
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Gu</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>Personalized recommendation via cross-domain triadic factorization</article-title>
          .
          <source>In WWW</source>
          ,
          <year>2013</year>
          .
          <volume>5</volume>
          .
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koren</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Volinsky</surname>
          </string-name>
          .
          <article-title>Collaborative filtering for implicit feedback datasets</article-title>
          .
          <source>In IEEE ICDM</source>
          ,
          <year>2008</year>
          .
          <volume>2</volume>
          .
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>D.</given-names>
            <surname>Kingma</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <year>2015</year>
          .
          <volume>4</volume>
          .
          <fpage>4</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>D.</given-names>
            <surname>Kluver</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Konstan</surname>
          </string-name>
          .
          <article-title>Evaluating recommender behavior for new users</article-title>
          .
          <source>In ACM RecSys</source>
          ,
          <year>2014</year>
          .
          <volume>5</volume>
          .
          <fpage>4</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bell</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Volinsky</surname>
          </string-name>
          .
          <article-title>Matrix factorization techniques for recommender systems</article-title>
          .
          <source>Computer</source>
          ,
          <year>2009</year>
          .
          <volume>1</volume>
          ,
          <issue>4</issue>
          .5,
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>In NIPS</source>
          ,
          <year>2012</year>
          . 3
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Xue</surname>
          </string-name>
          .
          <article-title>Can movies and books collaborate?: cross-domain collaborative filtering for sparsity reduction</article-title>
          .
          <source>In IJCAI</source>
          ,
          <year>2009</year>
          .
          <volume>1</volume>
          ,
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xue</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Cross-domain collaborative filtering over time</article-title>
          .
          <source>In IJCAI</source>
          ,
          <year>2011</year>
          . 6
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23. B.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Yan</surname>
            , and
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Transferable contextual bandit for cross-domain recommendation</article-title>
          .
          <source>In AAAI</source>
          ,
          <year>2018</year>
          . 6
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <given-names>B.</given-names>
            <surname>Loni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanjalic</surname>
          </string-name>
          .
          <article-title>Cross-domain collaborative filtering with factorization machines</article-title>
          .
          <source>In ECIR</source>
          ,
          <year>2014</year>
          .
          <volume>5</volume>
          .
          <issue>1</issue>
          ,
          <issue>5</issue>
          .
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25. H. Ma,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. King.</surname>
          </string-name>
          <article-title>Sorec: social recommendation using probabilistic matrix factorization</article-title>
          .
          <source>In CIKM</source>
          ,
          <year>2008</year>
          . 6
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>J. McAuley</surname>
            and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Leskovec</surname>
          </string-name>
          .
          <article-title>Hidden factors and hidden topics: understanding rating dimensions with review text</article-title>
          .
          <source>In ACM RecSys</source>
          ,
          <year>2013</year>
          . 6
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27. I.
          <string-name>
            <surname>Misra</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Shrivastava</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hebert</surname>
          </string-name>
          .
          <article-title>Cross-stitch networks for multi-task learning</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2016</year>
          .
          <volume>1</volume>
          ,
          <issue>3</issue>
          ,
          <issue>1</issue>
          ,
          <issue>4</issue>
          .1,
          <issue>4</issue>
          .5,
          <issue>5</issue>
          .1,
          <issue>5</issue>
          .1,
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <given-names>A.</given-names>
            <surname>Mnih</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <article-title>Probabilistic matrix factorization</article-title>
          .
          <source>In NIPS</source>
          ,
          <year>2008</year>
          .
          <volume>1</volume>
          ,
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <given-names>V.</given-names>
            <surname>Nair</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Rectified linear units improve restricted boltzmann machines</article-title>
          .
          <source>In ICML</source>
          ,
          <year>2010</year>
          .
          <volume>4</volume>
          .
          <fpage>2</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>M. Oquab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>I. Laptev</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sivic</surname>
          </string-name>
          .
          <article-title>Learning and transferring mid-level image representations using convolutional neural networks</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2014</year>
          . 6
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <given-names>R.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cao</surname>
          </string-name>
          , N. Liu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lukose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Scholz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>One-class collaborative filtering</article-title>
          .
          <source>In IEEE ICDM</source>
          ,
          <year>2008</year>
          .
          <volume>1</volume>
          ,
          <issue>2</issue>
          .1,
          <issue>4</issue>
          .
          <fpage>4</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>A survey on transfer learning</article-title>
          .
          <source>IEEE Transactions on knowledge and data engineering</source>
          ,
          <year>2010</year>
          . 6
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33. W. Pan,
          <string-name>
            <given-names>N.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Xiang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Transfer learning to predict missing ratings via heterogeneous user feedbacks</article-title>
          .
          <source>In IJCAI</source>
          ,
          <year>2011</year>
          .
          <volume>1</volume>
          ,
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <given-names>M.</given-names>
            <surname>Pazzani</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Billsus</surname>
          </string-name>
          .
          <article-title>Content-based recommendation systems</article-title>
          .
          <source>In The adaptive web</source>
          .
          <year>2007</year>
          . 6
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          .
          <article-title>Factorization machines with libfm</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          ,
          <year>2012</year>
          .
          <volume>5</volume>
          .
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Freudenthaler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gantner</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>Schmidt-Thieme</surname>
          </string-name>
          .
          <article-title>Bpr: Bayesian personalized ranking from implicit feedback</article-title>
          .
          <source>In UAI</source>
          ,
          <year>2009</year>
          .
          <volume>5</volume>
          .
          <issue>1</issue>
          ,
          <issue>5</issue>
          .
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Gordon</surname>
          </string-name>
          .
          <article-title>Relational learning via collective matrix factorization</article-title>
          .
          <source>In SIGKDD</source>
          ,
          <year>2008</year>
          .
          <volume>1</volume>
          ,
          <issue>5</issue>
          .1,
          <issue>5</issue>
          .1,
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          38.
          <string-name>
            <given-names>R.</given-names>
            <surname>Tibshirani</surname>
          </string-name>
          .
          <article-title>Regression shrinkage and selection via the lasso</article-title>
          .
          <source>Journal of the Royal Statistical Society. Series B</source>
          ,
          <year>1996</year>
          .
          <volume>4</volume>
          .
          <fpage>3</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          39.
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhudinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Show, attend and tell: Neural image caption generation with visual attention</article-title>
          .
          <source>In ICML</source>
          ,
          <year>2015</year>
          . 3
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          40.
          <string-name>
            <surname>C. Yang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Yuan</surname>
          </string-name>
          , and J. Han.
          <article-title>Bridging collaborative filtering and semi-supervised learning: A neural approach for poi recommendation</article-title>
          .
          <source>In SIGKDD</source>
          ,
          <year>2017</year>
          . 1
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          41.
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>A graph-based recommendation across heterogeneous domains</article-title>
          .
          <source>In CIKM</source>
          ,
          <year>2015</year>
          . 6
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          42.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dhingra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun. Glomo</surname>
          </string-name>
          :
          <article-title>Unsupervisedly learned relational graphs as transferable representations</article-title>
          .
          <year>2018</year>
          .
          <volume>4</volume>
          .
          <fpage>3</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          43.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          .
          <article-title>Transfer learning for sequence tagging with hierarchical recurrent networks</article-title>
          .
          <year>2017</year>
          . 1
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          44.
          <string-name>
            <surname>J. Yosinski</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clune</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Bengio</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Lipson</surname>
          </string-name>
          .
          <article-title>How transferable are features in deep neural networks</article-title>
          ?
          <source>In NIPS</source>
          ,
          <year>2014</year>
          .
          <volume>1</volume>
          ,
          <issue>4</issue>
          .1,
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          45.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>A survey on multi-task learning</article-title>
          .
          <year>2017</year>
          . 1
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          46.
          <string-name>
            <surname>Z.-D Zhao</surname>
            and
            <given-names>M.-S.</given-names>
          </string-name>
          <string-name>
            <surname>Shang</surname>
          </string-name>
          .
          <article-title>User-based collaborative-filtering recommendation algorithms on hadoop</article-title>
          .
          <source>In International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2010</year>
          . 6
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>