KB4Rec: A Dataset for Linking Knowledge Bases with
                 Recommender Systems

    Wayne Xin Zhao, Gaole He, Hongjian Dou, Jin Huang, Siqi Ouyang and Ji-Rong Wen
 {batmanfly,ouyangsiqi0726}@gmail.com, {hegaole, hongjiandou, jin.huang, jrwen}@ruc.edu.cn
                     School of Information, Renmin University of China


                                                                 used in RSs [5, 6], usually called knowledge-aware rec-
                                                                 ommendation.
                         Abstract                                   To develop a knowledge-aware recommender sys-
                                                                 tem, a key data problem is how we can obtain rich
     To develop a knowledge-aware recommender                    and structured knowledge information for RS items.
     system, a key data problem is how we can                    Overall, there are two main solutions from existing
     obtain rich and structured knowledge infor-                 studies. First, side information is collected from the
     mation for recommender system (RS) items.                   RS platform [1, 2, 3], and several studies further con-
     Existing datasets or methods either use side                struct tiny and simple KB-like knowledge structure [7].
     information from original recommender sys-                  The number of attributes or relations is usually lim-
     tems (containing very few kinds of useful in-               ited, and much useful knowledge information has not
     formation) or utilize private knowledge base                been considered. Second, several works propose to link
     (KB). In this paper, we present a public linked             RS with private KBs [5]. The linkage results are not
     KB dataset for recommender systems, named                   publicly available.
     KB4Rec v1.0, which has linked three widely                     To address the need for the linked dataset of RS and
     used RS datasets with the popular KB Free-                  KBs, we present a public linked KB dataset for rec-
     base. Based on our linked dataset, we pre-                  ommender systems, named KB4Rec v1.0, freely avail-
     form some interesting qualitative analysis ex-              able at https://github.com/RUCDM/KB4Rec. Our ba-
     periments, in which we discuss the effect of                sic idea is to heuristically link items from RSs with
     two important factors (i.e., popularity and re-             entities from a public large-scale KB1 . On the RS
     cency) on whether a RS item can be linked to                side, we select three widely used datasets (i.e., Movie-
     a KB entity.                                                Lens [1], LFM-1b [2] and Amazon book [3]) covering
                                                                 three different data domains, namely movie, music and
1    Introduction                                                book; on the KB side, we select the well-known Free-
With the rapid development of Web techniques, vari-              base [8]. We try to maximize the applicability of our
ous kinds of side information has become available in            linked dataset by selecting very popular RS datasets
recommender systems (RS). In an early stage, such                and KBs. Specially, we are also aware of some closely
context information is usually unstructured, and its             related studies [9, 10], which also aim to link RS items
availability is limited to specific data domains or plat-        with KB entities. While, our focus is on the Freebase,
forms [1, 2, 3]. Recently, more and more efforts have            which is now widely used in many NLP or related do-
been made by both research and industry communi-                 mains [4].
ties for structuring world knowledge or domain facts                In our KB4Rec v1.0 dataset, we organized the link-
in a variety of data domains. One of the most typical            age results by linked ID pairs, which consists of a RS
organization forms is knowledge base (KB) [4]. KBs               item ID and a KB entity ID. We do not share the
provide a general and unified way to organize and re-            original datasets, since they are maintained by orig-
late information entities, which have been shown to be           inal researchers or publishers. All the IDs are inner
useful in many applications. Specially, KBs have been            values from the original datasets. Once such a link-
                                                                 age has been accomplished, it is able to reuse existing
Copyright © CIKM 2018 for the individual papers by the papers'
                                                                   1 We use the terms of “items” and “entities” respectively for
authors. Copyright © CIKM 2018 for the volume as a collection
                                                                 RSs and KBs.
by its editors. This volume and its papers are published under
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
large-scale KB data for RSs. For example, the movie         3      Linked Dataset Construction
of “Avatar” from MovieLens dataset [1] has a corre-
                                                            In our work, we need to prepare two kinds of datasets,
sponding entity entry in Freebase, and we are able to
                                                            namely RS and KB data. Next, we first give the de-
obtain its attribute information by reading out all its
                                                            tailed descriptions of the original datasets, and then
associated relation triples in Freebase. Based on the
                                                            discuss the linkage method.
linked dataset, we first preform some interesting qual-
itative analysis experiments, in which we discuss the       RS Datasets. We consider three popular RS datasets
effect of two important factors (i.e., popularity and       for linkage, namely MovieLens, LFM-1b and Amazon
recency) on whether a RS item can be linked to a KB         book, which covers the three domains of movie, music
entity. Finally, we present the comparison of several       and book respectively.
knowledge-aware recommendation algorithms on our                (1) MovieLens dataset [1] describes users’ prefer-
linked dataset.                                             ences on movies. A preference record takes the form
                                                            huser, item, rating, timestampi, indicating the rat-
2   Existing Datasets and Methods                           ing score of a user for a movie at some time. There
                                                            have been four MovieLens datasets released, known
In this section, we briefly review the related datasets     as 100K, 1M , 10M , and 20M , reflecting the approxi-
and methods.                                                mate number of ratings in each dataset. We select the
   Early knowledge-aware recommendation algorithms          largest MovieLens 20M for linkage.
are also called context-aware recommendation algo-              (2) LFM-1b dataset [2] describes users’ interaction
rithms, in which the side information from the original     records on music. It provides information including
RS platform is considered as context data. For exam-        artists, albums, tracks, and users, as well as individual
ple, social network information of Epinions dataset is      listening events. It records the listening count of a song
utilized in [11, 12], POI property information of Yelp      by a user, but does not contain rating information.
dataset is utilized in [13], movie attribute information        (3) Amazon book dataset [3] describes users’ pref-
of MovieLens dataset is utilized in [7] and user profile    erences on book products with the data form of huser,
information of microblogging dataset has been utilized      item, rating, timestampi. The dataset is very sparse,
in [14]. These datasets usually contain very few kinds      containing 22 million ratings from 8 million users
of side information, and the relation between different     across nearly 23 million items.
kinds of side information is ignored.                           In the three RS datasets, we several kinds of side
   To make such side information more structured,           information such as item titles (all), IMDB ID (movie),
Heterogeneous Information Networks (HIN) have been          writer (book) and artist (music). We utilize such side
proposed as a general technique for modeling informa-       information for subsequent KB linkage.
tion networks [15]. In HINs, we can effectively learn
                                                            KB Dataset. We adopt the large-scale pubic KB
underlying relation patterns (called meta-path) and
                                                            Freebase. Freebase [8] is a KG announced by Metaweb
organize side information via meta-path-based repre-
                                                            Technologies, Inc. in 2007 and was acquired by Google
sentations. For example, HIN-based recommendation
                                                            Inc. on July 16, 2010. Freebase stores facts by triples
have been applied to solve PER [7] and MCRec [16].
                                                            of the form hhead, relation, taili. Since Freebase shut
HIN based algorithms usually rely on graph search al-
                                                            down its services on August 31, 2016, we use the ver-
gorithms, which is difficult to deal with large-scale re-
                                                            sion of March 2015, which is its latest public version.
lation pattern finding.
                                                            We select Freebase because it has been widely applied
   More recently, KBs have become a popular kind
                                                            in the research communities [4].
of data resources to store and organize world knowl-
edge or domain facts. Many studies have been pro-           Table 1: Statistics of the linkage results. The three
posed [4] for the construction, inference and applica-      domains correspond to the RS datasets of MovieLens
tions of KBs. Specially, several pioneering studies try     20M , LFM-1b and Amazon book, respectively.
                                                                Datasets   #Items      #Linked-Items    #Users     #Interactions
to leverage existing KB information for improving the            Movie      27,279         25,982       138,493     20,000,263
                                                                 Music     6,479,700     1,254,923      120,317    1,021,931,544
recommendation performance [17, 5, 18]. They ap-                 Book      2,330,066      109,671      3,468,412     22,507,155

ply a heuristic method for linking RS items with KB
entities. In these studies, they use a private KB for       RS to KB Linkage. With an offline Freebase search
linkage, which cannot be obtained publicly.                 API, we retrieve KB entities with item titles as queries.
   Specially, we are also aware of some closely related     If no KB entity with the same title was returned, we
studies, including [9, 10], which also aim to link RS       say the RS item is rejected in the linkage process. If at
items with KB entities. While, our focus is on the          least one KB entity with the same title was returned,
Freebase, which is now widely used in many NLP or           we further incorporate one kind of side information as
related domains [4].                                        a refined constraint for accurate linkage: IMDB ID,
artist name and writer name are used for the three          terizes the attractiveness of an item from the users in
domains of movie, music and book respectively. We           a RS. First, we sort the items ascendingly according to
find only a small number (about one thousand for each       its popularity value. Then, we further equally divide
domain) of RS items can not be accurately linked or         all the items into five ordered bins with the same num-
rejected via the above procedure, and simply discard        ber of items. Hence, an item with a larger bin number
them. During the linkage process, we deal with several      will be more popular than another with a smaller bin
problems that will affect the results of string match al-   number. Then we compute the linkage ratio for each
gorithms, e.g., lowercase, abbreviation, and the order      bin and the results are reported in Fig. 1(a) (the three
of family/given names. Since the LFM-1b dataset is          subfigures on the left). It can be observed that a bin
extremely large, we remove all the musics with fewer        with a larger number has a higher linkage ratio than
than ten listening events. Even after filtering, it still   the ones with a smaller number. The results indicate
contains about 6.5 million musics.                          that popularity is likely to have positive effect on link-
                                                            age.
Basic Statistics. We summarize the basic statistics
of the three linked datasets in the second column of        Effect of Recency on Linkage. The second factor
Table 1. It can be observed that for the MovieLens          we consider is the recency, i.e., the time when a RS
20M dataset, we have a very high linkage ratio: about       item was created. Our assumption is that if a RS item
95.2% items can be accurately linked to a KB en-            was created or released on an earlier time, it would be
tity. For LFM-1b dataset, the linkage ratio is 19.4%.       more probable to be included in KBs. Since human
But, the linkage ratio for the book domain is very low,     attention aggregation is a gradually growing process,
about 4.7%. A possible explanation is that MovieLens        a RS item usually requires a considerable amount of
20M dataset contains fewer items than the other two         time to become popular. To check this assumption, we
datasets, which are ready refined by original releasers.    need to obtain the release date of RS items. However,
Besides, we speculate that there may exist domain bias      only the MovieLens 20M dataset contains such an at-
in the construction of Freebase. Although the linkage       tribute information, we only report the analysis result
ratios for the latter two datasets are not high, the ab-    on this dataset. We first sort the items according to
solute numbers of linked items are large. Such a linked     their release dates ascendingly, and then equally divide
dataset is feasible for research-purpose studies.           all the items into ten ordered bins following the pro-
                                                            cedure of the above popularity analysis. Finally, we
Shared Datasets. We name the above linked linked
                                                            compute the linkage ratios for each bin. The results
KB dataset for recommender systems as KB4Rec
                                                            are reported in Fig. 1(b). We can see that the linkage
v1.0, freely available at https://github.com/RUCDM/
                                                            ratios gradually decrease with time going. The results
KB4Rec. In our KB4Rec v1.0 dataset, we organized
                                                            indicate that recency is likely to have negative effect
the linkage results by linked ID pairs, which consists
                                                            on linkage, i.e., an older RS item seems to be more
of a RS item ID and a KB entity ID. All the IDs are in-
                                                            probable to be included in a KB than a more recent
ner values from the original datasets. We have 25,982,
                                                            one. Especially, the last bin has a dramatic drop. A
1,254,923, and 109,671 linked ID pairs for MovieLens
                                                            possible reason is that our dump of Freebase was re-
20M , LFM-1b and Amazon book respectively.
                                                            leased on March 2015, and many new items have not
                                                            been included in Freebase.
4   Linkage Analysis
Previously, we have shown the linkage ratios for dif-       5    Experiment
ferent datasets. We find that a considerable amount
of RS items can not be linked to KB entities. It is in-     In this section, we present the comparison of some
teresting to study what factors will affect the linkage     existing recommendation algorithms using our linked
ratio. We consider two kinds of factors for analysis.       datasets.
Effect of Popularity on Linkage. Intuitively, a             Experimental Setup. Since our linked datasets are
popular RS item should be more likely to be included        very large, we first generate a small test set for evalu-
in a KB than an unpopular item, since it is reasonable      ation. We take the subset from the last year for LFM-
to incorporate more “important” RS items judged by          1b dataset and the subset from year 2005 to 2015 for
the RS users into KBs. The construction of KB it-           MovieLens 20M dataset. We also perform 3-core fil-
self usually involves manual efforts, which is difficult    tering for Amazon book dataset and 10-core filtering
to avoid the bias of human attention. To measure the        for other datasets. We consider the last-item recom-
popularity of a RS item, we adopt a simple frequency-       mendation task for evaluation. Since enumerating all
based method by counting the number of users who            the items as candidate is time-consuming, we pair each
have interacted with the item. This measure charac-         ground-truth with 100 negative items to form a ran-
                   1                                                   0.6                                                   0.09                                                      1
                                                                       0.5                                                   0.08
 Linkage ratio   0.96


                                                       Linkage ratio


                                                                                                             Linkage ratio


                                                                                                                                                                     Linkage ratio
                                                                                                                             0.07                                                    0.95
                                                                       0.4
                 0.92                                                                                                        0.06
                                                                       0.3                                                                                                            0.9
                 0.88                                                                                                        0.05
                                                                       0.2
                                                                                                                             0.04                                                    0.85
                 0.84                                                  0.1                                                   0.03
                  0.8                                                   0                                                    0.02                                                     0.8
                        A   B     C     D     E                              A       B     C     D      E                            A     B     C     D     E                              A    B    C   D   E   F   G   H   I   J
                    Popularity bins in MovieLens 20M                             Popularity bins in LFM-1b                          Popularity bins in Amazon book                              Time bins in MovieLens 20M

                                                                                  (a) Popularity.                                                                                                    (b) Recency.
Figure 1: Examining the effect of two factors on the linkage results. We use A, B, · · · to indicate the bin number
in an ordered way. The first three subfigures correspond to the popularity analysis, and the last one corresponds
to the recency analysis.

Table 2: Performance comparison of different methods on the task of last-item recommendation. We report the
detailed statistics of the evaluation set in the first column.
                   Datasets                          Methods         MRR Hit@10 NDCG@10
                   MovieLens 20M                     BPR             0.128     0.276      0.144
                      #users:              61,583 SVDFeature 0.204             0.448      0.243
                      #items:              19,533 mCKE               0.178     0.382      0.209
                      #interactions: 5,868,015 KSR                   0.294     0.571      0.344
                   LFM-1b                            BPR             0.227     0.458      0.265
                      #users:               7,694 SVDFeature 0.337             0.544      0.373
                      #items:              30,658 mCKE               0.371     0.541      0.399
                      #interactions       203,975 KSR                0.427     0.607      0.460
                   Amazon book                       BPR             0.222     0.505      0.272
                      #users:              65,125 SVDFeature 0.264             0.544      0.315
                      #items:              69,975 mCKE               0.248     0.494      0.291
                      #interactions:      828,560 KSR                0.353     0.653      0.413
domly candidate list. We adopt MRR, HR and NDCG              grates Recurrent Neural Networks with knowledge-
as the evaluation metrics. To use the rich KB informa-       enhanced Memory Networks for recommendation. The
tion, we embed KB data into low-dimensional vectors          readers can refer to [6] for more detailed results and
using extended KB subgraphs as in [6]. We consider           analysis.
four methods for comparisons, including BPR [19],
SVDFeature [20], mCKE [5] and KSR [6]. BPR                   6 Conclusion
does not use KB information, SVDFeature and mCKE
                                                             This paper introduced a public dataset for linking RS
utilize KB information using shallow models, while
                                                             with KB, namely KB4Rec v1.0. Our dataset covered
KSR utilizes KB information using sequential neural
                                                             three domains consists of a large number of linked ID
networks and memory networks.
                                                             pairs. As future work, we will consider linking more RS
                                                             datasets with Freebase. We will also consider adopting
Results and Analysis. Table 2 presents the results
                                                             other KB data for linkage, e.g., YAGO and DBpedia.
of different methods for the last-item recommendation.
First, among all the methods, BPR performs worst on
three datasets, since it does not use KB information.
                                                             References
Second, SVDFeature is better than BPR. It is imple-            [1] F. Maxwell Harper and Joseph A. Konstan. The
mented with a pairwise ranking loss function, and can              movielens datasets. TiiS, 5(4):1–19, 2016.
be roughly understood as an enhanced BPR model
with the incorporation of the learned KB embeddings.           [2] Markus Schedl. The lfm-1b dataset for music re-
Finally, we analyze the performance of the knowledge-              trieval and recommendation. In ICMR, 2016.
aware recommendation methods, namely mCKE and                  [3] Ruining He and Julian Mcauley. Ups and downs:
KSR. Overall, mCKE does not work well as expected,                 Modeling the visual evolution of fashion trends
which only beats SVDFeature on the LFM-1b dataset.                 with one-class collaborative filtering. In WWW,
A possible reason is that our implementation of mCKE               2016.
fixes the learned KB embeddings, while the original
CKE model adaptively updates KB embeddings. As a               [4] Quan Wang, Zhendong Mao, Bin Wang, and
comparison, the recently proposed KSR method works                 Li Guo. Knowledge graph embedding: A survey of
best consistently on the three datasets, which inte-               approaches and applications. IEEE TKDE, 2017.
 [5] Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian,      [17] Hongwei Wang, Fuzheng Zhang, Xing Xie, and
     Xing Xie, and Wei-Ying Ma. Collaborative knowl-         Minyi Guo. DKN: deep knowledge-aware network
     edge base embedding for recommender systems.            for news recommendation. In WWW, pages 1835–
     In SIGKDD, pages 353–362, 2016.                         1844, 2018.
 [6] Jin Huang, Wayne Xin Zhao, Hong-Jian Dou,          [18] Hongwei Wang, Fuzheng Zhang, Jialin Wang,
     Ji-Rong Wen, and Edward Y. Chang. Improv-               Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo.
     ing sequential recommendation with knowledge-           Ripple network: Propagating user preferences on
     enhanced memory networks. In SIGIR, 2018.               the knowledge graph for recommender systems.
                                                             2018.
 [7] Xiao Yu, Xiang Ren, Yizhou Sun, Quanquan
     Gu, Bradley Sturt, Urvashi Khandelwal, Brandon     [19] Steffen Rendle, Christoph Freudenthaler, Zeno
     Norick, and Jiawei Han. Personalized entity rec-        Gantner, and Lars Schmidt-Thieme.           Bpr:
     ommendation: a heterogeneous information net-           Bayesian personalized ranking from implicit feed-
     work approach. In WSDM, pages 283–292, 2014.            back. In UAI, 2009.
 [8] Google.   Freebase data dumps.   https://          [20] Tianqi Chen, Weinan Zhang, Qiuxia Lu, Kailong
     developers.google.com/freebase/data, 2016.              Chen, Zhao Zheng, and Yong Yu. Svdfeature:
 [9] Sören Auer, Christian Bizer, Georgi Kobi-              a toolkit for feature-based collaborative filtering.
     larov, Jens Lehmann, Richard Cyganiak, and              Journal of Machine Learning Research, 2012.
     Zachary G. Ives. Dbpedia: A nucleus for a web of
     open data. In ISWC 2007 + ASWC 2007, 2007.
[10] Tommaso Di Noia, Vito Claudio Ostuni, Paolo
     Tomeo, and Eugenio Di Sciascio. Sprank: Se-
     mantic path-based ranking for top-N recommen-
     dations using linked open data. ACM TIST,
     8(1):9:1–9:34, 2016.
[11] Mohsen Jamali and Martin Ester. A matrix fac-
     torization technique with trust propagation for
     recommendation in social networks. In RecSys,
     pages 135–142, 2010.
[12] Hao Ma, Irwin King, and Michael R. Lyu. Learn-
     ing to recommend with social trust ensemble. In
     SIGIR, pages 203–210, 2009.
[13] Huiji Gao, Jiliang Tang, Xia Hu, and Huan Liu.
     Content-aware point of interest recommendation
     on location-based social networks. In AAAI,
     pages 1721–1727, 2015.
[14] Wayne Xin Zhao, Yanwei Guo, Yulan He, Han
     Jiang, Yuexin Wu, and Xiaoming Li. We know
     what you want to buy: a demographic-based sys-
     tem for product recommendation on microblogs.
     In KDD, 2014.
[15] Yizhou Sun and Jiawei Han. Mining heteroge-
     neous information networks: a structural analy-
     sis approach. SIGKDD Explorations, 14(2):20–28,
     2012.
[16] Binbin Hu, Chuan Shi, Wayne Xin Zhao, and
     Philip S. Yu. Leveraging meta-path based con-
     text for top- N recommendation with A neural co-
     attention model. In SIGKDD, pages 1531–1540,
     2018.