KB4Rec: A Dataset for Linking Knowledge Bases with Recommender Systems Wayne Xin Zhao, Gaole He, Hongjian Dou, Jin Huang, Siqi Ouyang and Ji-Rong Wen {batmanfly,ouyangsiqi0726}@gmail.com, {hegaole, hongjiandou, jin.huang, jrwen}@ruc.edu.cn School of Information, Renmin University of China used in RSs [5, 6], usually called knowledge-aware rec- ommendation. Abstract To develop a knowledge-aware recommender sys- tem, a key data problem is how we can obtain rich To develop a knowledge-aware recommender and structured knowledge information for RS items. system, a key data problem is how we can Overall, there are two main solutions from existing obtain rich and structured knowledge infor- studies. First, side information is collected from the mation for recommender system (RS) items. RS platform [1, 2, 3], and several studies further con- Existing datasets or methods either use side struct tiny and simple KB-like knowledge structure [7]. information from original recommender sys- The number of attributes or relations is usually lim- tems (containing very few kinds of useful in- ited, and much useful knowledge information has not formation) or utilize private knowledge base been considered. Second, several works propose to link (KB). In this paper, we present a public linked RS with private KBs [5]. The linkage results are not KB dataset for recommender systems, named publicly available. KB4Rec v1.0, which has linked three widely To address the need for the linked dataset of RS and used RS datasets with the popular KB Free- KBs, we present a public linked KB dataset for rec- base. Based on our linked dataset, we pre- ommender systems, named KB4Rec v1.0, freely avail- form some interesting qualitative analysis ex- able at https://github.com/RUCDM/KB4Rec. Our ba- periments, in which we discuss the effect of sic idea is to heuristically link items from RSs with two important factors (i.e., popularity and re- entities from a public large-scale KB1 . On the RS cency) on whether a RS item can be linked to side, we select three widely used datasets (i.e., Movie- a KB entity. Lens [1], LFM-1b [2] and Amazon book [3]) covering three different data domains, namely movie, music and 1 Introduction book; on the KB side, we select the well-known Free- With the rapid development of Web techniques, vari- base [8]. We try to maximize the applicability of our ous kinds of side information has become available in linked dataset by selecting very popular RS datasets recommender systems (RS). In an early stage, such and KBs. Specially, we are also aware of some closely context information is usually unstructured, and its related studies [9, 10], which also aim to link RS items availability is limited to specific data domains or plat- with KB entities. While, our focus is on the Freebase, forms [1, 2, 3]. Recently, more and more efforts have which is now widely used in many NLP or related do- been made by both research and industry communi- mains [4]. ties for structuring world knowledge or domain facts In our KB4Rec v1.0 dataset, we organized the link- in a variety of data domains. One of the most typical age results by linked ID pairs, which consists of a RS organization forms is knowledge base (KB) [4]. KBs item ID and a KB entity ID. We do not share the provide a general and unified way to organize and re- original datasets, since they are maintained by orig- late information entities, which have been shown to be inal researchers or publishers. All the IDs are inner useful in many applications. Specially, KBs have been values from the original datasets. Once such a link- age has been accomplished, it is able to reuse existing Copyright © CIKM 2018 for the individual papers by the papers' 1 We use the terms of “items” and “entities” respectively for authors. Copyright © CIKM 2018 for the volume as a collection RSs and KBs. by its editors. This volume and its papers are published under the Creative Commons License Attribution 4.0 International (CC BY 4.0). large-scale KB data for RSs. For example, the movie 3 Linked Dataset Construction of “Avatar” from MovieLens dataset [1] has a corre- In our work, we need to prepare two kinds of datasets, sponding entity entry in Freebase, and we are able to namely RS and KB data. Next, we first give the de- obtain its attribute information by reading out all its tailed descriptions of the original datasets, and then associated relation triples in Freebase. Based on the discuss the linkage method. linked dataset, we first preform some interesting qual- itative analysis experiments, in which we discuss the RS Datasets. We consider three popular RS datasets effect of two important factors (i.e., popularity and for linkage, namely MovieLens, LFM-1b and Amazon recency) on whether a RS item can be linked to a KB book, which covers the three domains of movie, music entity. Finally, we present the comparison of several and book respectively. knowledge-aware recommendation algorithms on our (1) MovieLens dataset [1] describes users’ prefer- linked dataset. ences on movies. A preference record takes the form huser, item, rating, timestampi, indicating the rat- 2 Existing Datasets and Methods ing score of a user for a movie at some time. There have been four MovieLens datasets released, known In this section, we briefly review the related datasets as 100K, 1M , 10M , and 20M , reflecting the approxi- and methods. mate number of ratings in each dataset. We select the Early knowledge-aware recommendation algorithms largest MovieLens 20M for linkage. are also called context-aware recommendation algo- (2) LFM-1b dataset [2] describes users’ interaction rithms, in which the side information from the original records on music. It provides information including RS platform is considered as context data. For exam- artists, albums, tracks, and users, as well as individual ple, social network information of Epinions dataset is listening events. It records the listening count of a song utilized in [11, 12], POI property information of Yelp by a user, but does not contain rating information. dataset is utilized in [13], movie attribute information (3) Amazon book dataset [3] describes users’ pref- of MovieLens dataset is utilized in [7] and user profile erences on book products with the data form of huser, information of microblogging dataset has been utilized item, rating, timestampi. The dataset is very sparse, in [14]. These datasets usually contain very few kinds containing 22 million ratings from 8 million users of side information, and the relation between different across nearly 23 million items. kinds of side information is ignored. In the three RS datasets, we several kinds of side To make such side information more structured, information such as item titles (all), IMDB ID (movie), Heterogeneous Information Networks (HIN) have been writer (book) and artist (music). We utilize such side proposed as a general technique for modeling informa- information for subsequent KB linkage. tion networks [15]. In HINs, we can effectively learn KB Dataset. We adopt the large-scale pubic KB underlying relation patterns (called meta-path) and Freebase. Freebase [8] is a KG announced by Metaweb organize side information via meta-path-based repre- Technologies, Inc. in 2007 and was acquired by Google sentations. For example, HIN-based recommendation Inc. on July 16, 2010. Freebase stores facts by triples have been applied to solve PER [7] and MCRec [16]. of the form hhead, relation, taili. Since Freebase shut HIN based algorithms usually rely on graph search al- down its services on August 31, 2016, we use the ver- gorithms, which is difficult to deal with large-scale re- sion of March 2015, which is its latest public version. lation pattern finding. We select Freebase because it has been widely applied More recently, KBs have become a popular kind in the research communities [4]. of data resources to store and organize world knowl- edge or domain facts. Many studies have been pro- Table 1: Statistics of the linkage results. The three posed [4] for the construction, inference and applica- domains correspond to the RS datasets of MovieLens tions of KBs. Specially, several pioneering studies try 20M , LFM-1b and Amazon book, respectively. Datasets #Items #Linked-Items #Users #Interactions to leverage existing KB information for improving the Movie 27,279 25,982 138,493 20,000,263 Music 6,479,700 1,254,923 120,317 1,021,931,544 recommendation performance [17, 5, 18]. They ap- Book 2,330,066 109,671 3,468,412 22,507,155 ply a heuristic method for linking RS items with KB entities. In these studies, they use a private KB for RS to KB Linkage. With an offline Freebase search linkage, which cannot be obtained publicly. API, we retrieve KB entities with item titles as queries. Specially, we are also aware of some closely related If no KB entity with the same title was returned, we studies, including [9, 10], which also aim to link RS say the RS item is rejected in the linkage process. If at items with KB entities. While, our focus is on the least one KB entity with the same title was returned, Freebase, which is now widely used in many NLP or we further incorporate one kind of side information as related domains [4]. a refined constraint for accurate linkage: IMDB ID, artist name and writer name are used for the three terizes the attractiveness of an item from the users in domains of movie, music and book respectively. We a RS. First, we sort the items ascendingly according to find only a small number (about one thousand for each its popularity value. Then, we further equally divide domain) of RS items can not be accurately linked or all the items into five ordered bins with the same num- rejected via the above procedure, and simply discard ber of items. Hence, an item with a larger bin number them. During the linkage process, we deal with several will be more popular than another with a smaller bin problems that will affect the results of string match al- number. Then we compute the linkage ratio for each gorithms, e.g., lowercase, abbreviation, and the order bin and the results are reported in Fig. 1(a) (the three of family/given names. Since the LFM-1b dataset is subfigures on the left). It can be observed that a bin extremely large, we remove all the musics with fewer with a larger number has a higher linkage ratio than than ten listening events. Even after filtering, it still the ones with a smaller number. The results indicate contains about 6.5 million musics. that popularity is likely to have positive effect on link- age. Basic Statistics. We summarize the basic statistics of the three linked datasets in the second column of Effect of Recency on Linkage. The second factor Table 1. It can be observed that for the MovieLens we consider is the recency, i.e., the time when a RS 20M dataset, we have a very high linkage ratio: about item was created. Our assumption is that if a RS item 95.2% items can be accurately linked to a KB en- was created or released on an earlier time, it would be tity. For LFM-1b dataset, the linkage ratio is 19.4%. more probable to be included in KBs. Since human But, the linkage ratio for the book domain is very low, attention aggregation is a gradually growing process, about 4.7%. A possible explanation is that MovieLens a RS item usually requires a considerable amount of 20M dataset contains fewer items than the other two time to become popular. To check this assumption, we datasets, which are ready refined by original releasers. need to obtain the release date of RS items. However, Besides, we speculate that there may exist domain bias only the MovieLens 20M dataset contains such an at- in the construction of Freebase. Although the linkage tribute information, we only report the analysis result ratios for the latter two datasets are not high, the ab- on this dataset. We first sort the items according to solute numbers of linked items are large. Such a linked their release dates ascendingly, and then equally divide dataset is feasible for research-purpose studies. all the items into ten ordered bins following the pro- cedure of the above popularity analysis. Finally, we Shared Datasets. We name the above linked linked compute the linkage ratios for each bin. The results KB dataset for recommender systems as KB4Rec are reported in Fig. 1(b). We can see that the linkage v1.0, freely available at https://github.com/RUCDM/ ratios gradually decrease with time going. The results KB4Rec. In our KB4Rec v1.0 dataset, we organized indicate that recency is likely to have negative effect the linkage results by linked ID pairs, which consists on linkage, i.e., an older RS item seems to be more of a RS item ID and a KB entity ID. All the IDs are in- probable to be included in a KB than a more recent ner values from the original datasets. We have 25,982, one. Especially, the last bin has a dramatic drop. A 1,254,923, and 109,671 linked ID pairs for MovieLens possible reason is that our dump of Freebase was re- 20M , LFM-1b and Amazon book respectively. leased on March 2015, and many new items have not been included in Freebase. 4 Linkage Analysis Previously, we have shown the linkage ratios for dif- 5 Experiment ferent datasets. We find that a considerable amount of RS items can not be linked to KB entities. It is in- In this section, we present the comparison of some teresting to study what factors will affect the linkage existing recommendation algorithms using our linked ratio. We consider two kinds of factors for analysis. datasets. Effect of Popularity on Linkage. Intuitively, a Experimental Setup. Since our linked datasets are popular RS item should be more likely to be included very large, we first generate a small test set for evalu- in a KB than an unpopular item, since it is reasonable ation. We take the subset from the last year for LFM- to incorporate more “important” RS items judged by 1b dataset and the subset from year 2005 to 2015 for the RS users into KBs. The construction of KB it- MovieLens 20M dataset. We also perform 3-core fil- self usually involves manual efforts, which is difficult tering for Amazon book dataset and 10-core filtering to avoid the bias of human attention. To measure the for other datasets. We consider the last-item recom- popularity of a RS item, we adopt a simple frequency- mendation task for evaluation. Since enumerating all based method by counting the number of users who the items as candidate is time-consuming, we pair each have interacted with the item. This measure charac- ground-truth with 100 negative items to form a ran- 1 0.6 0.09 1 0.5 0.08 Linkage ratio 0.96 Linkage ratio Linkage ratio Linkage ratio 0.07 0.95 0.4 0.92 0.06 0.3 0.9 0.88 0.05 0.2 0.04 0.85 0.84 0.1 0.03 0.8 0 0.02 0.8 A B C D E A B C D E A B C D E A B C D E F G H I J Popularity bins in MovieLens 20M Popularity bins in LFM-1b Popularity bins in Amazon book Time bins in MovieLens 20M (a) Popularity. (b) Recency. Figure 1: Examining the effect of two factors on the linkage results. We use A, B, · · · to indicate the bin number in an ordered way. The first three subfigures correspond to the popularity analysis, and the last one corresponds to the recency analysis. Table 2: Performance comparison of different methods on the task of last-item recommendation. We report the detailed statistics of the evaluation set in the first column. Datasets Methods MRR Hit@10 NDCG@10 MovieLens 20M BPR 0.128 0.276 0.144 #users: 61,583 SVDFeature 0.204 0.448 0.243 #items: 19,533 mCKE 0.178 0.382 0.209 #interactions: 5,868,015 KSR 0.294 0.571 0.344 LFM-1b BPR 0.227 0.458 0.265 #users: 7,694 SVDFeature 0.337 0.544 0.373 #items: 30,658 mCKE 0.371 0.541 0.399 #interactions 203,975 KSR 0.427 0.607 0.460 Amazon book BPR 0.222 0.505 0.272 #users: 65,125 SVDFeature 0.264 0.544 0.315 #items: 69,975 mCKE 0.248 0.494 0.291 #interactions: 828,560 KSR 0.353 0.653 0.413 domly candidate list. We adopt MRR, HR and NDCG grates Recurrent Neural Networks with knowledge- as the evaluation metrics. To use the rich KB informa- enhanced Memory Networks for recommendation. The tion, we embed KB data into low-dimensional vectors readers can refer to [6] for more detailed results and using extended KB subgraphs as in [6]. We consider analysis. four methods for comparisons, including BPR [19], SVDFeature [20], mCKE [5] and KSR [6]. BPR 6 Conclusion does not use KB information, SVDFeature and mCKE This paper introduced a public dataset for linking RS utilize KB information using shallow models, while with KB, namely KB4Rec v1.0. Our dataset covered KSR utilizes KB information using sequential neural three domains consists of a large number of linked ID networks and memory networks. pairs. As future work, we will consider linking more RS datasets with Freebase. We will also consider adopting Results and Analysis. Table 2 presents the results other KB data for linkage, e.g., YAGO and DBpedia. of different methods for the last-item recommendation. First, among all the methods, BPR performs worst on three datasets, since it does not use KB information. References Second, SVDFeature is better than BPR. It is imple- [1] F. Maxwell Harper and Joseph A. Konstan. The mented with a pairwise ranking loss function, and can movielens datasets. TiiS, 5(4):1–19, 2016. be roughly understood as an enhanced BPR model with the incorporation of the learned KB embeddings. [2] Markus Schedl. The lfm-1b dataset for music re- Finally, we analyze the performance of the knowledge- trieval and recommendation. In ICMR, 2016. aware recommendation methods, namely mCKE and [3] Ruining He and Julian Mcauley. Ups and downs: KSR. Overall, mCKE does not work well as expected, Modeling the visual evolution of fashion trends which only beats SVDFeature on the LFM-1b dataset. with one-class collaborative filtering. In WWW, A possible reason is that our implementation of mCKE 2016. fixes the learned KB embeddings, while the original CKE model adaptively updates KB embeddings. As a [4] Quan Wang, Zhendong Mao, Bin Wang, and comparison, the recently proposed KSR method works Li Guo. Knowledge graph embedding: A survey of best consistently on the three datasets, which inte- approaches and applications. IEEE TKDE, 2017. [5] Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, [17] Hongwei Wang, Fuzheng Zhang, Xing Xie, and Xing Xie, and Wei-Ying Ma. Collaborative knowl- Minyi Guo. DKN: deep knowledge-aware network edge base embedding for recommender systems. for news recommendation. In WWW, pages 1835– In SIGKDD, pages 353–362, 2016. 1844, 2018. [6] Jin Huang, Wayne Xin Zhao, Hong-Jian Dou, [18] Hongwei Wang, Fuzheng Zhang, Jialin Wang, Ji-Rong Wen, and Edward Y. Chang. Improv- Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. ing sequential recommendation with knowledge- Ripple network: Propagating user preferences on enhanced memory networks. In SIGIR, 2018. the knowledge graph for recommender systems. 2018. [7] Xiao Yu, Xiang Ren, Yizhou Sun, Quanquan Gu, Bradley Sturt, Urvashi Khandelwal, Brandon [19] Steffen Rendle, Christoph Freudenthaler, Zeno Norick, and Jiawei Han. Personalized entity rec- Gantner, and Lars Schmidt-Thieme. Bpr: ommendation: a heterogeneous information net- Bayesian personalized ranking from implicit feed- work approach. In WSDM, pages 283–292, 2014. back. In UAI, 2009. [8] Google. Freebase data dumps. https:// [20] Tianqi Chen, Weinan Zhang, Qiuxia Lu, Kailong developers.google.com/freebase/data, 2016. Chen, Zhao Zheng, and Yong Yu. Svdfeature: [9] Sören Auer, Christian Bizer, Georgi Kobi- a toolkit for feature-based collaborative filtering. larov, Jens Lehmann, Richard Cyganiak, and Journal of Machine Learning Research, 2012. Zachary G. Ives. Dbpedia: A nucleus for a web of open data. In ISWC 2007 + ASWC 2007, 2007. [10] Tommaso Di Noia, Vito Claudio Ostuni, Paolo Tomeo, and Eugenio Di Sciascio. Sprank: Se- mantic path-based ranking for top-N recommen- dations using linked open data. ACM TIST, 8(1):9:1–9:34, 2016. [11] Mohsen Jamali and Martin Ester. A matrix fac- torization technique with trust propagation for recommendation in social networks. In RecSys, pages 135–142, 2010. [12] Hao Ma, Irwin King, and Michael R. Lyu. Learn- ing to recommend with social trust ensemble. In SIGIR, pages 203–210, 2009. [13] Huiji Gao, Jiliang Tang, Xia Hu, and Huan Liu. Content-aware point of interest recommendation on location-based social networks. In AAAI, pages 1721–1727, 2015. [14] Wayne Xin Zhao, Yanwei Guo, Yulan He, Han Jiang, Yuexin Wu, and Xiaoming Li. We know what you want to buy: a demographic-based sys- tem for product recommendation on microblogs. In KDD, 2014. [15] Yizhou Sun and Jiawei Han. Mining heteroge- neous information networks: a structural analy- sis approach. SIGKDD Explorations, 14(2):20–28, 2012. [16] Binbin Hu, Chuan Shi, Wayne Xin Zhao, and Philip S. Yu. Leveraging meta-path based con- text for top- N recommendation with A neural co- attention model. In SIGKDD, pages 1531–1540, 2018.