1. Introduction

Revisiting the performance evaluation of knowledge-aware recom mender systems: are we making progress?

Marina Ananyeva

m.ananyeva@tinkoff.ru 0 1

Oleg Lashinin

o.a.lashinin@tinkoff.ru 1

Maria Kuznetsova

ext.mekuznetsova@tinkoff.ru 1 0 National Research University Higher School of Economics , 20 Myasnitskaya St, Moscow, 101000, Russian Federation 1 Tinkof , 38A Khutorskaya St, Moscow, 127287, Russian Federation

Knowledge-aware recommender systems incorporate side information to improve recommendation performance. The authors of new algorithms are usually focused on developing new ideas behind the proposed methods and comparing their models with existing knowledge-aware recommender models. Meanwhile, some commonly used state-of-the-art general top-n recommender models are ignored as potential baselines. In this study, we compare previously proposed knowledge-based recommender systems with simple and computationally efective recommender models (EASE and ItemKNN) that do not use any additional information about users and items. Our results on three datasets show that claimed efect of using side information in recommender systems is still questionable.

recommender systems knowledge-based models evaluation

1. Introduction

In recent years, knowledge-aware recommender systems have gained great popularity and development. Most of the approaches are based on knowledge graphs (KGbased). Other types of knowledge-aware recommender edge graph contains two subgraphs. The first one is the user-item graph, in which vertices correspond to users and items. They are connected by edges that reflect interactions between users and items. The second subgraph contains additional information about users and items. It has intersecting nodes with the first graph and may have other additional vertices. Edges of the second subgraph represent side information that connects some nodes of the second subgraph. As a result, KG-based models not only use interactions between users and items but also enrich the models with various types of side information. Such algorithms rely on collaborative signals and mend film X because the target user prefers films of a similar genre or likes other films from the same director.

In this example, the model cannot retrieve such causes model they often compare it only with other KG-based models, or with weak conventional top-n recommender baselines.

We have found that some of the matrix ever, according to [2], well-tuned baselines such as widely-known ItemKNN [3], graph-based models without additional information, and even linear models can task. Additionally, according to benchmarks [4], simple baselines are computationally efective and often more scalable to real-world applications. Thus, the benefit of using knowledge graphs in recommender systems should be further studied.

In this study, we compare highly cited KG-based mod(CF-based) models, namely ItemKNN and EASE [5]. The additionally model other hidden causes of interactions. factorization-based approaches are often used to demonFor instance, a movie recommender model may recom- strate the superiority of using knowledge graphs. Howpurely from interactions. Therefore, the use of knowl- outperform deep learning approaches under the same © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License els with computational efective collaborative filtering contributions of this paper can be summarized as follows: vector representation of a KG, that allows solving them simultaneously in multitasking learning mode. • We aggregated information about datasets, rank- Path-based approaches exploit the idea that similar ening metrics, and evaluation strategies. For exper- tities in a graph are connected by close relationships. For iments, we included baselines for 12 graph-based example, PER [18] treats KG as a heterogeneous informarecommender models (8 of them use knowledge tion network and extracts latent features from meta-paths graphs) published from 2016 to 2020. Also, we of varying length and nature to represent heterogeneous discuss some methodological issues in the evalu- relationships between users and elements. HeteRec [19] ation of KG-based models. leverages the meta-path similarities to enrich the user• We conducted experiments on three real-world item interaction matrix. However, the first path-based datasets with hyperparameters tuning both for approaches involved the manual formation of meta-paths, baselines and KG-based models. Our study sheds which led to sub-optimal results and limited their use for light on the questionable recommendation quality diferent recommendation scenarios. of KG-based models. In [20], the authors proposed combining KGE and pathbased approaches into the RippleNet model. The general idea of the method is that the user’s embedding is formed 2. Related work from the embeddings of items with which the user interacted in the past, as well as from their neighbors on a 2.1. Graph-based models graph with a given depth. Other examples of hybrid modThere are approaches that use the graph structure of els are KGCN [21], where the representation of the object interactions without additional information. The idea is formed by aggregating the embeddings. In KGNNLS of exploiting only user-item graphs for recommenda- [22] authors prove that label smoothness regularization tion is researched for a long time. Users and items are is equivalent to the label propagation problem and use represented as nodes and interactions between them as a leave-one-out loss to evaluate the importance of each edges. Early approaches like ItemRank [6], P3 [7], and relationship type to the user. KGAT [23] model uses the P3 [8] utilize random walks to propagate user prefer- attention mechanism to distinguish the importance of ences. Lately, graph neural networks were applied to the influence of diferent neighbors. retrieve high-order connections between nodes and retrieve non-trivial collaborative signals. To do this, such 2.3. Surveys of KG-based RecSys models models as PinSage [9], GC-MC [10], and NGCF [11] are developed based on graph convolution operation. Recently, simplified versions of Graph Convolution Networks (GCN) were proposed. For example, the authors of LightGCN [12] removed nonlinearities and collapsing multiple weight matrices and outperformed previous state-of-the-art NGCF [11]. Lately, researchers presented SGL [13] with a self-supervised task and additional data augmentation.

There are a few comprehensive surveys about graphsbased recommender models. In paper [24], researchers provide a comprehensive overview of graph-based models for recommendations. Recent work [25] analyzes existing work in KG-based recommendations and outlines future directions. But this paper does not contain independent experiments with well-tuned collaborative ifltering baselines. Both papers do not include extra experiments with tuning hyperparameters of baselines. Although there are a lot of evaluation research studies with a comparison of conventional top-N recommendation models [2, 26, 27], there are no such articles on KG-based models. To the best of our knowledge, we are the first to provide experiments with a wide range of KG-based models and compare them with strong CF-based baselines.

2.2. KG-based models

Models with external knowledge can be divided into 3 main groups: embedding-based (KGE), path-based, and hybrid methods [14].

Embedding-based methods use information from the knowledge graph to enrich the representation of users and items. As a rule, the main idea of such models is based on the integration of standard collaborative filtering and embeddings from the knowledge graph. For example, CKE [15] combines various types of additional data about items: structural, textual, and visual information. All representations are then aggregated together to calculate ifnal recommendations. In KTUP [ 16] the researchers do a graph completion task. MKR [17] model connects two tasks - constructing a recommender system and a

3. Experiments

A typical knowledge graph can be divided into two subgraphs. The first one is a user-item graph, gathered from interaction data. And the second subgraph represents additional knowledge. To study the performance of using additional information in recommender models, we are 797 91 492 59 941 181 471 280 214 212 605 205

Gowalla,

Yelp2018, Amazon-Books

Gowalla,

Yelp2018, Amazon-Books

Gowalla,

Yelp2018, Amazon-Books

Yelp2018, Amazon-Books, Alibaba-iFashion MovieLens-1M IntentBooks

Amazon MovieLens-1M Book-Crossing

Bing-News MovieLens-20M Book-Crossing

Last.FM MovieLens-1M Book-Crossing

Last.FM Bing-News MovieLens-1M Amazon-Books

Last-FM

Yelp2018 MovieLens-20M, Book-Crossing,

Last.FM, Dianping-Food

Models with bipartite user-item graph

MF, NeuMF, CMN, HOP-Rec, GC-MC, PinSage

MF, GC-MC, NGCF, DisenGCN,

MacridVAE NGCF, Mult-VAE,

GRMF NGCF, LightGCN,

Mult-VAE, DNN+SSL

KGE models

BPRMF, BPRMF+TransE, PRP, PER, LibFM, CMF, CTR

BPRMF+SDAE BPR, BPR_HFT, VBPR, DeppCoNN, CKE, JRL CKE, SHINE, DKN, PER, LibFM + TransR,

Wide&Deep SVD, LibFM,

LibFM + TransE, PER, CKE, RippleNet

PER, CKE, DKN,

RippleNet, LibFM + TransR,

Wide&Deep Hybrid graph-based models FM, BPRMF, CFKG, CKE,

CoFM FM, NFM, CKE, CFKG, MCRec, RippleNet, GC-MC

SVD, LibFM,

LibFM+TransE, PER, CKE, RippleNet models baselines are underlined. A bipartite graph 1 is defined as a set of triplets {(, , ) ∈ , ∈ } , where , are sets of users and objects, respectively vertices, and connections represent interactions. In the User-Item graph, in addition to interactions, vertices with external features of the item are added.

3.1. Datasets

We consider three real-world datasets from diferent areas of application and varying sparsity degrees.

MovieLens-1M1: This dataset contains ratings for

1https://grouplens.org/datasets/movielens/1m/

Amazon-Books2: This dataset contains information about the ratings of books purchased by Amazon users. 30-core filtering is used to ensure the dataset’s quality and match the available computing resources.

TTRS [30] is a real dataset of financial transactions of 50,000 randomly selected clients of Tinkof bank, covering spending in all areas of life - from buying food and household appliances to paying for various forms of leisure. Users with less than 10 interactions were excluded.

2http://jmcauley.ucsd.edu/data/amazon/

Graph Bipartite Bipartite Bipartite

Item User-Item Item for each user Item for each user

Item Item Item for each user Bipartite + item Attention Bipartite Disentangled representation

Motivation High-order embedding propagation Without complex operations and non-linearity Self-supervision

Structural, textual, visual embeddings + CF Heterogenous user-item knowledge graph Preference propagation Graph Convolutional Multi-task learning, cross&compress unit Graph completion, preference translation

Label smoothing

3.2. Experimental Settings

Evaluation metrics. We use the standard metrics NDCG@k and Recall@k averaged over the entire sample to assess the recommendations’ quality. Due to paper space limitations, we report metrics only with k = 10.

Evaluation protocol. For each dataset, 80% of the interactions for each user were randomly selected as the training sample. We also leave 10% of interactions for validation and 10% for the test set. One negative example is randomly selected from a uniform distribution for each interaction during the training phase. At the stage of validation and testing, we rank positive examples relative to all items in the dataset.

Baselines. Three popular non-graph models based on statistics were chosen as baselines: library for building recommender systems RecBolee3 [4].

We took all the graph models implemented in Recbole of version 1.0.0 and analyzed the relevant articles for the datasets, metrics, baselines, data preprocessing, and pipeline for building recommender systems. Table 1 presents the summary statistics.

We found a few issues with evaluating KG-based models and consequently claimed the value of additional knowledge for recommendation quality.

Firstly, the authors include matrix factorization (BPR, SVD) and factorization machines (FM, LibFM) as collaborative filtering baselines. Meanwhile, KNN-based, VAEbased, and Item-based models are ignored. Moreover, MKR [17] and RippleNet [20] even are not compared with CF baselines. Therefore, we cannot conclude that

KG-based models outperform algorithms based only on 1. Pop is a recommendation of the most popular user-item interactions.

items in the dataset. Popularity is calculated by Secondly, researchers from most papers do not propthe frequency of the object’s occurrence among erly tune hyperparameters of CF-based models. Accordall interactions; ing to [2], it leads to the weak performance of baselines 2. ItemKNN [3] – an item-based approach to rec- and possible superiority of proposed KG-based methods. ommending objects that are close in the cosine To fill this gap, we conduct comprehensive experisimilarity measure; ments with tuning both baselines and KG-based models. 3. EASE [5] is a state-of-the-art linear model based Implementation details. All calculations were peron the idea of an autoencoder. It has an analytical formed on the Tesla V100 GPU, experiments were carsolution for a convex loss function. ried out on the basis of the RecBole framework. For all models, we tune hyperparameters with random greedsearch optimizing NDCG@10 within 24 hours for each

Hyperparameter Fitting. To review the current state of KG-based models, we took the popular open

3https://github.com/RUCAIBox/RecBole

Table 2 presents NDCG@10 and Recall@10 values in our experiments.

RQ1: Graph models in all datasets outperform the Pop baseline but demonstrate poor performance compared to the state-of-the-art non-graph model EASE [5] and ItemKNN [3] on Amazon Books. Although the EASE model was published in 2019 and could not be used as a baseline in most models, the authors of the DGCF [28], LightGCN [12], and SGL [13] should have considered including it, as well as ItemKNN [3], in their analysis.

Even though EASE and ItemKNN use less information based on interaction data alone, they are often able to model a behavioral signal better than knowledge graph 4. Conclusion models and are easier to implement.

RQ2: Approaches with a bipartite user-item graph In this study, we examined the quality and required rethat are not enhanced with additional connections with sources for training the developed KG-based recommenitem characteristics demonstrated the next best quality dation models from diferent perspectives. Furthermore, after EASE. Despite the fact that both models with a bi- we uncovered a few methodological issues that contribute partite user-item graph and a knowledge graph incorpo- to the stated efectiveness of KG-based models, which may be addressed by incorporating simple linear base4https://github.com/kg-based-recsys-eval/kg_based_recsys_eval lines and adjusting its hyperparameters in the experimental set-up. Our comprehensive experiments on three real-world datasets reveal that a basic linear model EASE outperforms any KG-based approach. Moreover, the early-adopted ItemKNN approach shows better quality in making recommendations than some graph models on several datasets. Our findings indicate that researchers should include and adjust such simple baselines in their experiments and compare new algorithms to them.

Hamilton, J. Leskovec, Graph convolutional neural networks for web-scale recommender systems, in: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2018, pp. 974–983. [10] R. v. d. Berg, T. N. Kipf, M. Welling, Graph convolutional matrix completion, arXiv preprint arXiv:1706.02263 (2017). [11] X. Wang, X. He, M. Wang, F. Feng, T.-S. Chua, Neural graph collaborative filtering, in: Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval, 2019, pp. 165–174. [1] Z. Meng, R. McCreadie, C. Macdonald, I. Ounis, [12] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, M. Wang, Exploring data splitting strategies for the evalua- Lightgcn: Simplifying and powering graph convolution of recommendation models, in: Fourteenth tion network for recommendation, in: Proceedings ACM conference on recommender systems, 2020, of the 43rd International ACM SIGIR conference on pp. 681–686. research and development in Information Retrieval, [2] M. F. Dacrema, P. Cremonesi, D. Jannach, Are we 2020, pp. 639–648.

really making much progress? a worrying analy- [13] J. Wu, X. Wang, F. Feng, X. He, L. Chen, J. Lian, sis of recent neural recommendation approaches, X. Xie, Self-Supervised Graph Learning for Recin: Proceedings of the 13th ACM Conference on ommendation, Association for Computing MachinRecommender Systems, RecSys ’19, Association for ery, New York, NY, USA, 2021, p. 726–735. URL: Computing Machinery, New York, NY, USA, 2019, https://doi.org/10.1145/3404835.3462862. p. 101–109. URL: https://doi.org/10.1145/3298689. [14] Q. Guo, F. Zhuang, C. Qin, H. Zhu, X. Xie, H. Xiong, 3347058. doi:1 0 . 1 1 4 5 / 3 2 9 8 6 8 9 . 3 3 4 7 0 5 8 . Q. He, A survey on knowledge graph-based recom[3] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item- mender systems, IEEE Transactions on Knowledge based collaborative filtering recommendation algo- and Data Engineering (2020). rithms, in: Proceedings of the 10th international [15] F. Zhang, N. J. Yuan, D. Lian, X. Xie, W.-Y. Ma, Colconference on World Wide Web, 2001, pp. 285–295. laborative knowledge base embedding for recom[4] W. X. Zhao, S. Mu, Y. Hou, Z. Lin, Y. Chen, X. Pan, mender systems, in: Proceedings of the 22nd ACM K. Li, Y. Lu, H. Wang, C. Tian, et al., Recbole: To- SIGKDD international conference on knowledge wards a unified, comprehensive and eficient frame- discovery and data mining, 2016, pp. 353–362. work for recommendation algorithms, in: Proceed- [16] Y. Cao, X. Wang, X. He, Z. Hu, T.-S. Chua, Uniings of the 30th ACM International Conference on fying knowledge graph learning and recommenInformation & Knowledge Management, 2021, pp. dation: Towards a better understanding of user 4653–4664. preferences, in: The world wide web conference, [5] H. Steck, Embarrassingly shallow autoencoders for 2019, pp. 151–161.

sparse data, in: The World Wide Web Conference, [17] H. Wang, F. Zhang, M. Zhao, W. Li, X. Xie, M. Guo, 2019, pp. 3251–3257. Multi-task feature learning for knowledge graph [6] M. Gori, A. Pucci, V. Roma, I. Siena, Itemrank: A enhanced recommendation, in: The world wide random-walk based scoring algorithm for recom- web conference, 2019, pp. 2000–2010. mender engines., in: IJCAI, volume 7, 2007, pp. [18] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandel2766–2771. wal, B. Norick, J. Han, Personalized entity recom[7] C. Cooper, S. H. Lee, T. Radzik, Y. Siantos, Random mendation: A heterogeneous information network walks in recommender systems: exact computation approach, in: Proceedings of the 7th ACM internaand simulations, in: Proceedings of the 23rd inter- tional conference on Web search and data mining, national conference on world wide web, 2014, pp. 2014, pp. 283–292.

811–816. [19] X. Yu, X. Ren, Y. Sun, B. Sturt, U. Khandelwal, Q. Gu, [8] B. Paudel, F. Christofel, C. Newell, A. Bernstein, Up- B. Norick, J. Han, Recommendation in heterogedatable, accurate, diverse, and scalable recommen- neous information networks with implicit user feeddations for interactive applications, ACM Trans- back, in: Proceedings of the 7th ACM conference actions on Interactive Intelligent Systems (TiiS) 7 on Recommender systems, 2013, pp. 347–350. (2016) 1–34. [20] H. Wang, F. Zhang, J. Wang, M. Zhao, W. Li, X. Xie, [9] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. M. Guo, Ripplenet: Propagating user preferences on the knowledge graph for recommender systems, in: Proceedings of the 27th ACM international conference on information and knowledge management, 2018, pp. 417–426. [21] H. Wang, M. Zhao, X. Xie, W. Li, M. Guo, Knowledge graph convolutional networks for recommender systems, in: The world wide web conference, 2019, pp. 3307–3313. [22] H. Wang, F. Zhang, M. Zhang, J. Leskovec, M. Zhao,

W. Li, Z. Wang, Knowledge-aware graph neural networks with label smoothness regularization for recommender systems, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 968–977. [23] X. Wang, X. He, Y. Cao, M. Liu, T.-S. Chua, Kgat:

Knowledge graph attention network for recommendation, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 950–958. [24] S. Wu, F. Sun, W. Zhang, X. Xie, B. Cui, Graph neural networks in recommender systems: a survey,

ACM Computing Surveys (CSUR) (2020). [25] J. Chicaiza, P. Valdiviezo-Diaz, A comprehensive survey of knowledge graph-based recommender systems: Technologies, development, and contributions, Information 12 (2021) 232. [26] V. W. Anelli, A. Bellogín, T. Di Noia, D. Jannach,

C. Pomo, Top-n recommendation algorithms: A quest for the state-of-the-art, arXiv preprint arXiv:2203.01155 (2022). [27] S. Rendle, W. Krichene, L. Zhang, Y. Koren, Revisiting the performance of ials on item recommendation benchmarks, arXiv preprint arXiv:2110.14037 (2021). [28] X. Wang, H. Jin, A. Zhang, X. He, T. Xu, T.-S. Chua,

Disentangled graph collaborative filtering, in: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, 2020, pp. 1001–1010. [29] Y. Zhang, Q. Ai, X. Chen, P. Wang, Learning over knowledge-base embeddings for recommendation, arXiv preprint arXiv:1803.06540 (2018). [30] S. Kolesnikov, O. Lashinin, M. Pechatov, A. Kosov,

Ttrs: Tinkof transactions recommender system benchmark, arXiv preprint arXiv:2110.05589 (2021).