=Paper= {{Paper |id=Vol-1887/paper2 |storemode=property |title=Cross-Domain Recommendation for Large-Scale Data |pdfUrl=https://ceur-ws.org/Vol-1887/paper2.pdf |volume=Vol-1887 |authors=Shaghayegh Sahebi,Peter Brusilovsky,Vladimir Bobrokov |dblpUrl=https://dblp.org/rec/conf/recsys/SahebiBB17 }} ==Cross-Domain Recommendation for Large-Scale Data== https://ceur-ws.org/Vol-1887/paper2.pdf
              Cross-Domain Recommendation for Large-Scale Data
             Shaghayegh Sahebi                                      Peter Brusilovsky                         Vladimir Bobrokov
      Department of Computer Science                          School of Information Sciences                       Rostelecom
       University at Albany – SUNY                               University of Pittsburgh                  10 building 2, Bahrushina st
             Albany, NY 12222                                     Pittsburgh, PA 15260                       Moscow, Russia 115184
            ssahebi@albany.edu                                       peterb@pitt.edu                          vcomzzz@gmail.com

ABSTRACT                                                                        systems. Full-scale cross-domain datasets are hard to find, so au-
Cross-domain algorithms have been introduced to help improving                  thors frequently use simulated cross-domain datasets. For example,
recommendations and to alleviate cold-start problem, especially                 Iwata and Takeuchi propose a matrix factorization based approach
in small and sparse datasets. These algorithms work by transfer-                in [8] where neither users nor items are shared between domains.
ring information from source domain(s) to target domain. In this                Although they used a large-scale dataset (using EachMovie, Netflix,
paper, we study if such algorithms can be helpful for large-scale               and MovieLens), their large-scale dataset is not from a cross-domain
datasets. We introduce a large-scale cross-domain recommender                   system. Rather, this movie rating dataset is divided into random
algorithm derived from canonical correlation analysis and analyze               user and item splits. A similar splitting in large-scale movies domain
its performance, in comparison with single and cross-domain base-               can be seen in [15]. Moreover, the rare large-scale cross-domain
line algorithms. Our experiments in both cold-start and hot-start               experiment reports in literature focus mostly on content-based
situations show the effectiveness of the proposed approach.                     cross-domain recommenders [4, 13, 18]. In [12], Loni et al. use fac-
                                                                                torization machines for domains in a large-scale Amazon dataset.
KEYWORDS                                                                        In their experiments, better use of within domain information gen-
                                                                                erated better results compared to using cross-domain information.
Cross-Domain, Domain Selection
                                                                                While the current literature show the importance of cross-domain
ACM Reference format:                                                           recommender systems, the limitations reviewed above do not allow
Shaghayegh Sahebi, Peter Brusilovsky, and Vladimir Bobrokov. 2017. Cross-       us to see how cross-domain recommender algorithms scale up.
Domain Recommendation for Large-Scale Data. In Proceedings of RecSysKTL            This paper attempts to fill in the gap of design and evalua-
Workshop @ ACM RecSys ’17, August 27, 2017, Como, Italy, , 7 pages.             tion of large-scale cross-domain recommenders by proposing a
DOI: N/A                                                                        cross-domain collaborative filtering algorithm and evaluating it
                                                                                using a dataset collected from a multi-domain recommender sys-
                                                                                tem, Imhonet. The proposed algorithm, CD-LCCA, is specifically
1    INTRODUCTION                                                               designed for scalability.
Cross-domain recommendation systems are gradually becoming                         The proposed approach relies on canonical correlation analysis
more attractive as a practical approach to improve quality of recom-            (CCA) [7] for transferring information from source domain to target
mendations. Number of social systems that collect user interaction              domain. CCA has been used in context-aware single-domain recom-
and preferences in different domains is constantly increasing. Ac-              mendation [5], content-based cross-domain recommendation [4],
cordingly, using information contributed by users in one system                 and medium-scale cross-domain collaborative filtering [17]. How-
to help generate better recommendations in another system in a                  ever, it has not been scaled for large-scale cross-domain collabo-
related domain has become more and more valuable. Especially                    rative filtering. In this paper, we use a computationally efficient
important in this context is the ability of cross-domain collaborative          implementation of CCA to model cross-domain recommendations
filtering to soften the cold-start situation by offering meaningful             in a large-scale dataset. We present our model in Section 2. We com-
suggestions at the very start of user interaction with a new domain.            pare the performance of our model with cross-domain and single
Starting with a few proof-of-concept studies [1, 2, 6, 10, 19], cross-          domain baselines in Section 3, and analyze its cold-start behavior
domain recommenders emerged in a sizable stream of research in                  in Section 4. Finally, we present a time performance analysis of the
the recommender systems field.                                                  algorithm in Section 5.
    Yet, in some sense, the work is still in early stages. While many
different models have been proposed and explored, the dominating                2  LARGE-SCALE CCA-BASED
approach to exploring new cross-domain recommendation ideas is
to use public datasets that are relatively small in comparison with
                                                                                   CROSS-DOMAIN ALGORITHM (CD-LCCA)
the full scale of data (items and users) in real-life recommender               2.1 Background
                                                                                CCA is a multivariate statistical model that studies the interrela-
RecSysKTL Workshop @ ACM RecSys ’17, August 27, 2017, Como, Italy               tionships among sets of multiple dependent variables and multiple
© 2017 Copyright is held by the author(s).
                                                                                independent variables[7]. Calculating CCA can be very resource-
                                                                                consuming especially in traditional approaches that should calcu-
                                                                                late QR-decomposition or singular value decomposition of large
                                                                                data matrices. To avoid this problem, Lu and Foster developed
                                                                                an iterative algorithm that can approximate CCA on very large




                                                                            9
datasets[14]. This approach relies on LING, a gradient-based least                Combining Equations 1, 2, and 3, we can now map between the
squares algorithm that can work on large-scale matrices. To com-               original source and target rating matrices as presented in Equation
pute CCA in L-CCA in [14], first a projection of one of the data               4 and have an estimation of user ratings in the target domain (Ŷ ).
matrices on a randomly-generated small matrix is produced. Then,
a QR-decomposition of this smaller matrix is calculated. After that,                                    Ŷ = X MWx c PWy−1
                                                                                                                        c
                                                                                                                           N −1                          (4)
CCA is calculated iteratively, by applying LING on the reduced-                   When the rating matrix sizes are too large, calculating the multi-
sized QR-decompositions of the original data matrices, in each                 plications in 4 can be resource-consuming. To resolve this, we take
iteration. Every time after running LING, a QR-decomposition is                advantage of the fact that [A|B]−1C = [A−1C |B −1C], and separate
calculated for numerical stability. Here, we build our large-scale             the source matrix into multiple smaller matrices, using column-
cross-domain recommender algorithm based on L-CCA proposed                     wise partitioning. Then, we perform the multiplication on each of
by Lu and Foster.                                                              these matrices and eventually join the results together.
                                                                                  Equation 4 gives us the opportunity to relate the source and
2.2    Model                                                                   target domain rating matrices. Based on that, we can estimate the
Large scale CCA finds a lower-dimensional representation of each               ratings in target domain Y based on ratings in source domain X .
of the input matrices and then calculates the canonical correlation            In other words, we can estimate user i’s rating on item j from
analysis between these two smaller matrices. To base our cross-                the target domain, given user i’s ratings in the sources domain
domain recommender algorithm on LCCA, suppose that we have                     using Equation 5. Thus, our cross-domain recommender system
a n × m source domain rating matrix X and a n × p target domain                can suggest the most relevant items to users in the target domain,
rating matrix Y . Here, n represents number of shared users between            having user ratings in the source domain. In the following sections,
source and target domains; m shows number of items in source                   we evaluate our proposed model both in the cold-start and hot-start
domain; and p shows number of items in target domain. The goal                 setting, using a large-scale dataset
of our model is to estimate user ratings in the target domain (Yi j s),
given user ratings in the source domain (X i j s). We find the mapping                                  cx        kcc a              c
                                                                                   ŷi, j = Σm                                        y           −1
                                                                                             q=1 X i,q Σo=1 Mq,o Σl =1 Wx co,l Pl,l Σr =1Wycl, r Nr, j   (5)
that is between these two domains using LCCA as explained in the
following.                                                                        Note that the focus of our proposed model is on cross-domain
   Suppose that X c (n × xc ) is a lower dimensional matrix that               recommenders with shared sets of users across domains. Although
represents source domain rating matrix X , and Yc (n ×yc ) is a lower          some of the research in the area of cross-domain recommender
dimensional matrix that represents target rating matrix Y in the               systems is focused on domains with non-overlapping data [8, 11,
LCCA algorithm. Calculating the canonical correlations between                 20, 21], the problem of lacking shared users have been a matter of
X c and Yc leads us to two canonical variates (X c Wx c (n × kcca )            debate [3]. Some approaches have tried to approach this problem
and Yc Wyc (n × kcca )) and a diagonal matrix P (kcca × kcca ) that            by sharing a subset of users between domains [9, 22]. We will leave
shows the canonical correlation between these variates. Using these            this expansion of the proposed model for future work.
canonical correlations and variates, we can map X c to Yc (and vice
versa). For example, Yc can be achieved using Equation 1.                      3     DO LARGE-SCALE CROSS-DOMAIN
                                                                                     ALGORITHMS HELP?
                         Yc = X c Wx c PWyTc                       (1)
                                                                               In our first set of experiments we study if the proposed cross-
   Although Equation 1 relates the lower dimensional representa-               domain recommender system is useful in large-scale datasets. In
tions of original source and target domains (X c and Yc ), we need             other words, by comparing the cross-domain and single-domain
to map the original source and target matrices (X and Y ) to esti-             recommendation results, we explore if target domain user data
mate user ratings in them. To build a relationship between original            can be enough for achieving good recommendations in large-scale
source and target domain matrices, we first look at the relationship           datasets; or if auxiliary information can be helpful.
between each domain matrix and its lower dimensional represen-
tation. Without loss of generality, we consider the source domain              3.1     Dataset
relationships. X c is built in the first step of LCCA by solving an
                                                                               We use the Imhonet dataset for carrying our experiments in this
iterative least square problem, having a QR-decomposition in each
                                                                               paper. This is an anonymized dataset obtained from an online Rus-
iteration. Although we loose the mapping information between X
                                                                               sian recommender service Imhonet.ru. It allows users to rate and
and X c in this iterative process, having both X and final X c ma-
                                                                               review a range of items from various domains, from books and
trices, we can restore their mapping. We can rewrite X and X c ’s
                                                                               movies to mobile phones and architectural monuments. Imhonet is
relationship as in X c = X M. Here, M is a m × c x mapping that
                                                                               a true multi-domain system: while it supported different domains,
projects X into X c ; and thus:
                                                                               each domain was treated almost as an independent sub-site with
                                                                               separate within-domain recommendations. This system also con-
                            M = X −1X c                            (2)         tains many aspects of a social network, including friendship links,
The same can be applied to find the mapping of target rating ma-               blogs and comments. Combination of explicit user feedback (rat-
trices Y and its lower-dimensional representation Yc (Equation                 ings) and diverse domains makes Imhonet very unique and valuable
3).                                                                            for cross-domain recommendation. We use a dataset that includes
                          N = Y −1Yc                         (3)               Imhonet’s four large domains - books, movies, games, and perfumes.




                                                                          10
It contains a full set of user ratings (at the time of collection) across                             0.5
                                                                                                                                                                          RMSE of Approaches
                                                                                                                                                                                                                                                     CD-CCA
                                                                                                                                                                                                                                                     CD-SVD
four domains.Each rating record in the dataset includes a user ID,                                   0.45
                                                                                                                                                                                                                                                     SD-SVD


an item ID, and a rating value between zero (not rated) and ten.                                      0.4

The same user ID indicates the same user across the sets of ratings.                                 0.35




                                                                                              RMSE
Some basic statistics about this dataset are shown in Table 1. To
                                                                                                      0.3
pre-process this dataset we find shared users across category pairs.
                                                                                                     0.25


                                                                                                      0.2
         Table 1: Basic Statistics for Imhonet Dataset.
                                                                                                     0.15
                                                                                                              eric        eric        eric        eric        eric        eric         eric        eric        eric        eric       eric eric
                                                                                                        Num meNum meNum vieNum meNum meNum okNum vieNum vieNum meNum okNum okNum
                                                                                                    ame       rfu            u          o            u           a          bo         mo            o           a        c_b
                                                                                                                                                                                                                              o
                                                                                                                                                                                                                                     c_b
                                                                                                                                                                                                                                         o
                                                                                                c_g       _pe           perf        c_m         perf        c_g        ric_       ric_          c_m         c_g
                                                                                           meri      eric          ric_        meri        ric_        meri        ume Nume                meri         meri        meri        meri
                                                                                        kNu ieNum              ume ookNu               ume ovieNu            v ieN        e           eNu           eNu         eNu umeNu
                                                                                    boo                      N                       N                                                                        m
                             Book       Game       Movie       Perfume                    mov         bo o k          b        gam
                                                                                                                                   e         m          m  o        g a m
                                                                                                                                                                              perf
                                                                                                                                                                                  u m
                                                                                                                                                                                           perf
                                                                                                                                                                                                u m
                                                                                                                                                                Category Pairs Sorted by CD-CCA RMSE
                                                                                                                                                                                                          g a
                                                                                                                                                                                                                    pe rf




user size                    362448     72307      426897      19717
item size                    167384     12768      90793       3640              Figure 1: RMSE of algorithms on 12 Imhonet domain pairs
density                      0.00022    0.00140    0.00073     0.00350
# record                     13438520   1324945    28281946    253948
                                                                                                                                                                           MAE of Approaches
average # rating per user    37.0771    18.2339    66.30       12.8796                                0.4
                                                                                                                                                                                                                                                     CD-CCA
                                                                                                                                                                                                                                                     CD-SVD
average # rating per item    80.2856    103.7708   311.4992    69.7659                               0.35
                                                                                                                                                                                                                                                     SD-SVD




                                                                                                      0.3



3.2    Experiment Setup



                                                                                              MAE
                                                                                                     0.25



To run the experiments, we used a user-stratified 5-fold cross-                                       0.2


validation setting: 20% of users are selected as test users and the                                  0.15

rest of them (80%) are selected as training users. We recommend
                                                                                                     0.1
items to test users given the training data and 20% of their ratings.                                   ame
                                                                                                                  eric         eric        eric          eric      eric        eric
                                                                                                            Num meNum meNum meNum vieNum meNum vieNum meNum okNum vieNum okNum okNum
                                                                                                                                                                                           eric          eric       eric       eric
                                                                                                                                                                                                                                 o
                                                                                                                                                                                                                                         eric
                                                                                                                                                                                                                                            o
                                                                                                                                                                                                                                              eric
                                                                                                    c_g       _pe
                                                                                                                  rfu        perf
                                                                                                                                  u
                                                                                                                                         perf
                                                                                                                                              u
                                                                                                                                                     c_m
                                                                                                                                                           o
                                                                                                                                                               ric_
                                                                                                                                                                    ga
                                                                                                                                                                          ric_
                                                                                                                                                                               mo         _ga            _bo       _mo        c_b       c_b
                                                                                                 ri                     ric_        ric_        meri                                 eric           eric      eric       meri       meri
                                                                                                                                                            ume eNume
   Some of the algorithms have parameters that should be selected                   b o o
                                                                                             m
                                                                                            mov
                                                                                               e
                                                                                          kNu ieNum
                                                                                                         eric
                                                                                                          boo
                                                                                                               k N ume eNume ookNu
                                                                                                                       gam             b         m o  v ieN
                                                                                                                                                              ga m
                                                                                                                                                                       perf
                                                                                                                                                                           u m e Num
                                                                                                                                                                                     m  o v ie Num eNum
                                                                                                                                                                                                perf
                                                                                                                                                                                                     u m       g a m eNu umeNu
                                                                                                                                                                                                                         perf
                                                                                                                                                                     Category Pairs Sorted by CD-CCA MAE
by cross-validation. To find the best set of parameters for each algo-
rithm, we select 15% of users as “validation" users and remove 80%               Figure 2: MAE of algorithms on 12 Imhonet domain pairs
of their ratings from the training set. We repeat the experiments 5              ordered by the MAE of the CD-LCCA
times, and report the average performance of algorithms. To mea-
sure the performance of algorithms, we use Root Mean Squared
Error (RMSE) and Mean Absolute Error (MAE). Although there
                                                                                 see that “books” and “movies” domain pairs have the most num-
are other measures, such as rank-based ones, to evaluate recom-
                                                                                 ber of users and “games” and “perfumes” domains have the least
mender systems, we choose these two error measures because of
                                                                                 number of common users. The “books” domain has the maximum
the proposed and baseline algorithm goals: they try to estimate
                                                                                 and “perfumes” domain has the minimum number of items. Also,
user ratings, instead of optimizing the recommendation rankings.
                                                                                 the “books” domain is among the most sparse domains, while the
Rank-based measures, such as precision, recall, and nDCG, would
                                                                                 “perfumes” domain is the least sparse one.
not be appropriate for and representative of these recommenders’
                                                                                    We run the proposed and baseline algorithms on each of these
performance.
                                                                                 domain pairs. Figures 1 and 2 show RMSE and MAE of algorithms
   For the single-domain algorithm, we use only the target do-
                                                                                 on 12 domain pairs of Imhonet. The reported errorbars represent
main dataset. However, for cross-domain algorithms, we have both
                                                                                 a 95% confidence interval for errors. As we can see in these fig-
source and target datasets. To be able to compare single and cross-
                                                                                 ures, the use of cross-domain data with a competitive algorithm
domain algorithms, we remove the same set of ratings for all of the
                                                                                 originally designed for a single domain doesn’t really help: the
algorithms.
                                                                                 single-domain algorithm (SD-SVD) performs better than, or sim-
                                                                                 ilar to, cross-domain baseline (CD-SVD) in many domains. Only
Table 2: Correlation algorithms’ RMSE with each other. *:                        in “movie → book" and “game → movie" domain pairs, CD-SVD is
significant with p-value < 0.01.                                                 significantly better than SD-SVD. The domains in these two pairs
                                                                                 are semantically closer, compared to other domain pairs. However,
                            CD-LCCA     CD-SVD      SD-SVD                       CD-LCCA performs significantly better than both CD-SVD and
           CD-LCCA          1           0.1993      -0.1909                      SD-SVD in all of the domain pairs. Thus, CD-LCCA is able to see
           CD-SVD           0.1993      1           0.7416*                      beyond the semantic relationships between domains and capture
           SD-SVD           -0.1909     0.7416*     1                            their latent similarities that may not seem intuitive. Also, we can
                                                                                 see that confidence intervals in most of the domain pairs (except
                                                                                 for “game → perfume" and “perfume → book") are small.
3.3    Results                                                                      To understand if average error of algorithms are related to each
There are four domains in the dataset: books, movies, perfumes,                  other in different domain pairs, we look at RMSE correlations be-
and games. This results in having 12 domain pairs to study. Some                 tween algorithms that are reported in Table 2. Here, we see that
of the statistics of domain pairs are presented in Table 3. We can               RMSE of CD-SVD and SD-SVD algorithms are highly correlated




                                                                            11
                                                      Table 3: Domain and domain-pair data size statistics for the Imhonet dataset

                                            source        target                           user size        source item size    target item size                                                          source density          target density
                                            book          game                             41756            125688              11407                                                                     0.0007                  0.0020
                                            book          movie                            186877           155765              85892                                                                     0.0003                  0.0014
                                            book          perfume                          16750            105805              3545                                                                      0.0011                  0.0037
                                            game          book                             41756            11407               125688                                                                    0.0020                  0.0007
                                            game          movie                            49784            11715               75599                                                                     0.0019                  0.0028
                                            game          perfume                          6297             6854                3232                                                                      0.0030                  0.0041
                                            movie         book                             186877           85892               155765                                                                    0.0014                  0.0003
                                            movie         game                             49784            75599               11715                                                                     0.0028                  0.0019
                                            movie         perfume                          17882            63708               3565                                                                      0.0041                  0.0037
                                            perfume       book                             16750            3545                105805                                                                    0.0037                  0.0011
                                            perfume       game                             6297             3232                6854                                                                      0.0041                  0.0030
                                            perfume       movie                            17882            3565                63708                                                                     0.0037                  0.0041


                                   4
                          12 #10                                                                                                                                                          0.38
                                                                                                                                                                                                                                                                             CD-CCA
                                                                                                                                                                                                                                                                             CD-SVD
                                                                                                                                                                                                                                                                             SD-SVD
                                                                                                                                                                                          0.36

                          10
                                                                                                                                                                                          0.34




                                                                                                                                         Average RMSE of Algorithms on all Domain Pairs
                                                                                                                                                                                          0.32
                           8

                                                                                                                                                                                           0.3
        Number of users




                           6                                                                                                                                                              0.28


                                                                                                                                                                                          0.26

                           4
                                                                                                                                                                                          0.24


                                                                                                                                                                                          0.22
                           2

                                                                                                                                                                                           0.2


                           0                                                                                                                                                              0.18
                               0       10   20   30        40              50              60         70   80   90   100                                                                         0   10     20     30   40             50               60   70   80   90         100
                                                       Target user profile size for Imhonet dataste                                                                                                                          Target User Profile Size




    Figure 3: Target profile sizes of users in Imhonet dataset                                                                   Figure 4: User-based RMSE of algorithms in the Imhonet
                                                                                                                                 dataset, averaged on all domain-pairs and sorted based on
                                                                                                                                 the users’ target domain profile size
with each other. However, CD-LCCA’s RMSE does not have any
significant correlations with the two baseline algorithms’ perfor-
                                                                                                                                                                                          0.32

mance.
                                                                                                                                                                                           0.3

   Altogether, we conclude that CD-LCCA is helpful in estimating
                                                                                                                                         Average MAE of Algorithms on all Domain Pairs




                                                                                                                                                                                          0.28

user preferences using auxiliary domain information in large-scale
                                                                                                                                                                                          0.26

datasets; the baseline cross-domain algorithm that is not designed                                                                                                                                                                                                      CD-CCA
                                                                                                                                                                                          0.24                                                                          CD-SVD

for this purpose (CD-SVD) may harm the recommendation results                                                                                                                                                                                                           SD-SVD



                                                                                                                                                                                          0.22

rather than helping; error of baseline recommender algorithms are
                                                                                                                                                                                           0.2

correlated; and CD-LCCA can understand unintuitive, but useful,
                                                                                                                                                                                          0.18

similarities between domain pairs that are not discovered by CD-
                                                                                                                                                                                          0.16

SVD.
                                                                                                                                                                                          0.14
                                                                                                                                                                                                 0   10     20     30   40             50               60   70   80   90         100
                                                                                                                                                                                                                             Target User Profile Size




4     DO LARGE-SCALE CROSS-DOMAIN
      ALGORITHMS ALLEVIATE COLD-START?                                                                                           Figure 5: User-based MAE of algorithms in the Imhonet
One of the major problems in recommender systems literature is                                                                   dataset, averaged on all domain-pairs and sorted based on
the cold-start problem [16]. Cross-domain recommenders aim to                                                                    the users’ target domain profile size
alleviate this problem by transferring target user’s source profile
information for recommendation in target domain. In CD-LCCA,
this transfer happens by mapping source and target domains using                                                                 profile size. Then, we calculate the error for each group of these
canonical variates and correlations as in Equation 4. In this section,                                                           users in each of the algorithms. Figure 3 shows number of test
we investigate the success of such transfer by comparing CD-LCCA,                                                                users vs. target domain profile sizes in all of the domain pairs. We
CD-SVD, and SD-SVD’s performance in cold-start setting. To un-                                                                   can see that most of the test users have a small profile size (less
derstand how each of these algorithms perform in cold-start setting,                                                             than 10 items) in the target domain. There are a few users with
we group test users of each dataset based on their target domain                                                                 100 and more items in their target profile. To have a better plot,




                                                                                                                           12
                                     book togame                                                     book tomovie
                   1                                                       0.34

                                                                           0.32
                  0.8
                                                                            0.3

                  0.6                                                      0.28

                                                                           0.26
                  0.4                                                      0.24

                                                                           0.22
                  0.2
                                                                            0.2

                   0                                                       0.18
                        0   20      40         60     80      100                 0         20      40         60    80        100



                                    book toperfume                                                   game tobook
                   1                                                        0.5
                                                                                                                          CD-CCA
                                                                           0.45                                           CD-SVD
                                                                                                                          SD-SVD
                  0.8
                                                                            0.4

                  0.6                                                      0.35

                                                                            0.3
                  0.4                                                      0.25

                                                                            0.2
                  0.2
                                                                           0.15

                   0                                                        0.1
                        0   20      40         60     80      100                 0         20      40         60    80        100




                                     game tomovie                                                   game toperfume
                  0.5                                                        1

                 0.45                                                       0.8
                  0.4
                                                                            0.6
                 0.35
                                                                            0.4
                  0.3
                                                                            0.2
                 0.25

                  0.2                                                        0

                 0.15                                                      -0.2
                        0   20      40         60     80      100                 0         20      40         60    80        100



                                     movie tobook                                                   movie togame
                 0.34                                                       0.9
                                                                                                                          CD-CCA
                 0.32                                                       0.8                                           CD-SVD
                                                                                                                          SD-SVD
                  0.3                                                       0.7
                 0.28                                                       0.6
                 0.26                                                       0.5
                 0.24                                                       0.4
                 0.22                                                       0.3
                  0.2                                                       0.2
                 0.18                                                       0.1
                 0.16                                                        0
                        0   20      40         60     80      100                 0         20      40         60    80        100




Figure 6: User-based RMSE of algorithms in Imhonet dataset, averaged on each domain-pair and sorted based on the users’
target domain profile size


we skipped showing these users. Also, there is a concave shape                        CD-SVD’s error. As the target domain profile size grows, the errors
at small (less than 10) target domain profile sizes. This happens                     of two baseline algorithms have no significant differences.
because Imhonet has asked some users to rate at least 20 items, for                      To have a better understanding of cold-start situation in each of
providing recommendations to them. Since we only use 20% of test                      the domain pairs, we look at the results of domain-pair combina-
user ratings in their target profiles, this increase in the profile size              tions separately. Figures 6 and 7 show each algorithm’s cold-start
happens for the profiles that have less than 10 items.                                RMSE and MAE in each of the domain pairs. Note that we have plot-
   Figures 4 and 5 show the RMSE and MAE of algorithms in the                         ted the errors for target profile sizes ranging from one to 100 items.
cold-start setting based on target user profile size, averaged for all                But, in some domain pairs (e.g., “game → “perfume"), maximum
of the domain pairs. As we can see in these pictures, in average                      user profile size is less than 100 and thus the plot is discontinued.
on all domain-pairs, CD-CCA performs significantly better than                           As we can see, for small profile sizes, in all domain pairs except
both of the baselines. Also, the single-domain baseline (SD-SVD) in                   “game → perfume" CD-LCCA performs significantly better than
average performs better than the cross-domain baseline (CD-SVD).                      baseline algorithms. This shows that CD-LCCA can successfully
In smaller profile sizes SD-SVD’s error is significantly lower than                   transfer useful information from most source domains to target
                                                                                      domain, especially in cold-start situation. For “book” and “movie”




                                                                           13
                                       book to game                                                   book to movie
                 0.9                                                       0.28
                 0.8
                                                                           0.26
                 0.7
                                                                           0.24
                 0.6
                 0.5                                                       0.22

                 0.4                                                        0.2
                 0.3
                                                                           0.18
                 0.2
                                                                           0.16
                 0.1
                  0                                                        0.14
                       0   20    40         60          80   100   120            0       20     40        60         80     100        120



                                      book to perfume                                                 game to book
                  1                                                        0.45
                                                                                                                                   CD-CCA
                                                                            0.4                                                    CD-SVD
                                                                                                                                   SD-SVD
                 0.8
                                                                           0.35
                 0.6                                                        0.3

                 0.4                                                       0.25

                                                                            0.2
                 0.2
                                                                           0.15

                  0                                                         0.1
                       0   20    40         60          80   100   120            0       20     40        60         80     100        120




                                game to movie                                                  game to perfume
               0.5                                                         1

               0.4
                                                                         0.5
               0.3
                                                                           0
               0.2

               0.1                                                       -0.5
                     0                     50                      100          0                         50                           100
                                movie to book                                                  movie to game
               0.3                                                         1
                                                                                                                           CD-CCA
             0.25                                                        0.8                                               CD-SVD
                                                                                                                           SD-SVD
                                                                         0.6
               0.2
                                                                         0.4
             0.15                                                        0.2
               0.1                                                         0
                     0                     50                      100          0                         50                           100


Figure 7: User-based MAE of algorithms in Imhonet dataset, averaged on each domain-pair and sorted based on the users’
target domain profile size


target domains the superior performance of CD-LCCA continues                          movie", CD-SVD can be significantly better than SD-SVD especially
in large profile sizes. But, in “game” and “perfume” target domains                   in larger profile sizes. Accordingly, in smaller target profile sizes
performance difference of algorithms is insignificant after users                     not only CD-SVD does not help, but also it can harm recommen-
have enough items in their target profile (between 25 and 45 items                    dation results. This shows that while CD-LCCA can efficiently use
for different domain pairs). There are fewer users with larger profile                the extra source domain information, CD-SVD cannot handle this
sizes in these domains. Thus, we have lower confidence in algo-                       information effectively.
rithms’ performance and wider confidence intervals, leading to                           Looking at error trends, for some domain pairs (e.g., “movie →
insignificant differences.                                                            book" and “game → movie"), we see an initial error increase as
   Comparing CD-SVD and SD-SVD, we can see that they mostly                           the target profile size grows. Although we expect to see smaller
have similar results. In all experiments with “movie" domain as                       errors, as we have more information from users in target domain,
the source domain, SD-SVD performs significantly better than CD-                      the observed trend is against such expectation. This trend happens
SVD from the beginning. But in “game → movie" and “perfume →                          in all algorithms including the single-domain baseline (SD-SVD).




                                                                          14
Thus, such behavior cannot be attributed to using extra information                                   International World Wide Web Conference Committee (IW3C2).
in cross-domain algorithms.                                                                       [5] Siamak Faridani. 2011. Using canonical correlation analysis for generalized
                                                                                                      sentiment analysis, product recommendation and search. In Proceedings of the
   Altogether, we can conclude that not only CD-LCCA can handle                                       fifth ACM conference on Recommender systems (RecSys ’11). ACM, New York, NY,
extra information from the semantically-related target domain effi-                                   USA.
                                                                                                  [6] Ignacio Fernández-Tobías, Iván Cantador, Marius Kaminskas, and Francesco
ciently, but it also can understand the relationship between source                                   Ricci. 2012. Cross-domain recommender systems: A survey of the state of the
and target domains that appear to be unrelated.                                                       art. In Spanish Conference on Information Retrieval.
                                                                                                  [7] Harold Hotelling. 1936. Relations Between Two Sets of Variates. Biometrika 28,
                                                                                                      3/4 (1936).
5    PERFORMANCE ANALYSIS                                                                         [8] Tomoharu Iwata and Koh Takeuchi. 2015. Cross-domain recommendation with-
                                                                                                      out shared users or items by sharing latent vector distributions. In Proceedings
In CD-LCCA, calculating large-scale CCA costs O(N np(N 2 +kpc ) +                                     of the Eighteenth International Conference on Artificial Intelligence and Statistics.
Nnk 2 ), in which N is number of iterations for least squares; n is                                   379–387.
number of data points (users); p is the number of items in the                                    [9] Arto Klami, Guillaume Bouchard, Abhishek Tripathi, and others. 2014. Group-
                                                                                                      sparse Embeddings in Collective Matrix Factorization. In Proceedings of Interna-
target domain; N 2 is the number of iterations to compute Yr using                                    tional Conference on Learning Representations (ICLR) 2014.
gradient descent; kpc is the number of top singular vectors used in                              [10] Bin Li. 2011. Cross-domain collaborative filtering: A brief survey. In Tools with
                                                                                                      Artificial Intelligence (ICTAI), 2011 23rd IEEE International Conference on. IEEE,
LING; and k is the number of components. The multiplications in                                       1085–1086.
CD-LCCA depend on the number of nonzero elements in matrices.                                    [11] Bin Li, Qiang Yang, and Xiangyang Xue. 2009. Transfer learning for collaborative
In the worst case of multiplying dense matrices, the multiplications                                  filtering via a rating-matrix generative model. In Proceedings of the 26th Annual
                                                                                                      International Conference on Machine Learning. ACM, 617–624.
cost O(npk + nk 2 ). Thus, as a whole, CD-LCCA costs O(Nnp(N 2 +                                 [12] Babak Loni, Alan Said, Martha Larson, and Alan Hanjalic. 2014. ’Free
kpc ) + N nk 2 + npk). Since kpc 6 k and k 6 p, CD-LCCA costs less                                    lunch’enhancement for collaborative filtering with factorization machines. In
than O(N np(N 2 + k)).                                                                                Proceedings of the 8th ACM Conference on Recommender systems. ACM, 281–284.
                                                                                                 [13] Yucheng Low, Deepak Agarwal, and Alexander J Smola. 2011. Multiple do-
   In our experiments, we ran all of the algorithms on two similar                                    main user personalization. In Proceedings of the 17th ACM SIGKDD international
machines: a MacOS machine with 64GB RAM and two 4-core Intel                                          conference on Knowledge discovery and data mining. ACM, 123–131.
                                                                                                 [14] Yichao Lu and Dean P Foster. 2014. large scale canonical correlation analysis
Xeon, 2.26GHz CPUs and a Linux machine (CentOS) with 64GB                                             with iterative least squares. In Advances in Neural Information Processing Systems.
RAM and two 4-core Intel Xeon, 2.40GHz CPUs. On average, run-                                         91–99.
ning CD-LCCA in Matlab on each domain pair took 21210 seconds                                    [15] Weike Pan, Evan Wei Xiang, and Qiang Yang. 2012. Transfer Learning in Collab-
                                                                                                      orative Filtering with Uncertain Ratings.. In AAAI.
(close to 6 hours), while running CD-SVD with GraphChi took                                      [16] Denis Parra and Shaghayegh Sahebi. 2013. Recommender systems: Sources of
almost 4 hours.                                                                                       knowledge and evaluation metrics. In Advanced Techniques in Web Intelligence-2.
                                                                                                      Springer, 149–175.
                                                                                                 [17] Shaghayegh Sahebi and Peter Brusilovsky. 2015. It Takes Two to Tango: An Explo-
6    CONCLUSIONS                                                                                      ration of Domain Pairs for Cross-Domain Collaborative Filtering. In Proceedings
                                                                                                      of the 9th ACM Conference on Recommender Systems. ACM, 131–138.
This work presented a large-scale cross-domain collaborative filter-                             [18] Weiqing Wang, Zhenyu Chen, Jia Liu, Qi Qi, and Zhihong Zhao. 2012. User-based
ing approach, CD-LCCA. Our experiments on a large-scale user-                                         collaborative filtering on cross domain by tag transfer learning. In Proceedings of
item rating dataset with 12 domain pairs showed that cross-domain                                     the 1st International Workshop on Cross Domain Knowledge Discovery in Web and
                                                                                                      Social Network Mining. ACM, 10–17.
collaborative filtering can be helpful even in large-scale target do-                            [19] Pinata Winoto and Tiffany Tang. 2008. If You Like the Devil Wears Prada the
mains. We saw that CD-LCCA improves recommendation results in                                         Book, Will You also Enjoy the Devil Wears Prada the Movie? A Study of Cross-
                                                                                                      Domain Recommendations. New Generation Computing 26 (2008).
both hot and cold-start settings in all domain pairs. But, the baseline                          [20] Lei Wu, Wensheng Zhang, and Jue Wang. 2014. Fusion Hidden Markov Model
cross-domain algorithm helped only in domain pairs with higher                                        with Latent Dirichlet Allocation Model in Heterogeneous Domains. In Proceedings
semantic similarities. In some cases, adding auxiliary information                                    of International Conference on Internet Multimedia Computing and Service. ACM,
                                                                                                      261.
in the baseline cross-domain algorithm harmed the results. Thus,                                 [21] Yu Zhang, Bin Cao, and Dit-Yan Yeung. 2010. Multi-domain collaborative fil-
we concluded that CD-LCCA is able to capture unintuitive relation-                                    tering. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial
ships between different domains, that are not being understood                                        Intelligence. AUAI Press, 725–732.
                                                                                                 [22] Lili Zhao, Sinnojialin Pan, Evanwei Xiang, Erheng Zhong, Zhongqi Lu, and
by the baseline algorithms. Our cold-start analysis showed that                                       Qiang Yang. 2013. Active transfer learning for cross-system recommendation.
the proposed model is especially helpful in the cold-start setting.                                   In Proceedings of the 27th AAAI Conference on Artificial Intelligence, AAAI 2013,
                                                                                                      Bellevue, Washington, USA. 1205.
CD-LCCA focuses on domains with shared users. As a follow up to
this work, we will expand CD-LCCA to perform cross-domain rec-
ommendation in domains with partly-shared, and partly exclusive
users.

REFERENCES
 [1] Shlomo Berkovsky, Tsvi Kuflik, and Francesco Ricci. 2007. Cross-Domain Media-
     tion in Collaborative Filtering. In Proceedings of the 11th international conference
     on User Modeling (UM ’07). Springer-Verlag, Berlin, Heidelberg.
 [2] Shlomo Berkovsky, Tsvi Kuflik, and Francesco Ricci. 2008. Mediation of user
     models for enhanced personalization in recommender systems. User Modeling
     and User-Adapted Interaction 18, 3 (Aug. 2008).
 [3] Paolo Cremonesi and Massimo Quadrana. 2014. Cross-domain Recommendations
     Without Overlapping Data: Myth or Reality?. In Proceedings of the 8th ACM
     Conference on Recommender Systems (RecSys ’14). ACM, New York, NY, USA,
     297–300.
 [4] Ali Elkahky, Yang Song, and Xiaodong He. 2015. A Multi-View Deep Learning
     Approach for Cross Domain User Modeling in Recommendation Systems. In




                                                                                            15