<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Revisiting the performance evaluation of knowledge-aware recom mender systems: are we making progress?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marina Ananyeva</string-name>
          <email>m.ananyeva@tinkoff.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleg Lashinin</string-name>
          <email>o.a.lashinin@tinkoff.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Kuznetsova</string-name>
          <email>ext.mekuznetsova@tinkoff.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <addr-line>20 Myasnitskaya St, Moscow, 101000, Russian Federation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tinkof</institution>
          ,
          <addr-line>38A Khutorskaya St, Moscow, 127287, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowledge-aware recommender systems incorporate side information to improve recommendation performance. The authors of new algorithms are usually focused on developing new ideas behind the proposed methods and comparing their models with existing knowledge-aware recommender models. Meanwhile, some commonly used state-of-the-art general top-n recommender models are ignored as potential baselines. In this study, we compare previously proposed knowledge-based recommender systems with simple and computationally efective recommender models (EASE and ItemKNN) that do not use any additional information about users and items. Our results on three datasets show that claimed efect of using side information in recommender systems is still questionable.</p>
      </abstract>
      <kwd-group>
        <kwd>recommender systems</kwd>
        <kwd>knowledge-based models</kwd>
        <kwd>evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, knowledge-aware recommender systems
have gained great popularity and development. Most
of the approaches are based on knowledge graphs
(KGbased). Other types of knowledge-aware recommender
edge graph contains two subgraphs. The first one is the
user-item graph, in which vertices correspond to users
and items. They are connected by edges that reflect
interactions between users and items. The second subgraph
contains additional information about users and items. It
has intersecting nodes with the first graph and may have
other additional vertices. Edges of the second subgraph
represent side information that connects some nodes of
the second subgraph. As a result, KG-based models not
only use interactions between users and items but also
enrich the models with various types of side
information. Such algorithms rely on collaborative signals and
mend film X because the target user prefers films of a
similar genre or likes other films from the same director.</p>
      <p>In this example, the model cannot retrieve such causes
model they often compare it only with other KG-based
models, or with weak conventional top-n recommender
baselines.</p>
      <p>We have found that some of the matrix
ever, according to [2], well-tuned baselines such as
widely-known ItemKNN [3], graph-based models
without additional information, and even linear models can
task. Additionally, according to benchmarks [4], simple
baselines are computationally efective and often more
scalable to real-world applications. Thus, the benefit of
using knowledge graphs in recommender systems should
be further studied.</p>
      <p>In this study, we compare highly cited KG-based
mod(CF-based) models, namely ItemKNN and EASE [5]. The
additionally model other hidden causes of interactions. factorization-based approaches are often used to
demonFor instance, a movie recommender model may recom- strate the superiority of using knowledge graphs.
Howpurely from interactions. Therefore, the use of knowl- outperform deep learning approaches under the same
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License els with computational efective collaborative filtering
contributions of this paper can be summarized as follows: vector representation of a KG, that allows solving them
simultaneously in multitasking learning mode.
• We aggregated information about datasets, rank- Path-based approaches exploit the idea that similar
ening metrics, and evaluation strategies. For exper- tities in a graph are connected by close relationships. For
iments, we included baselines for 12 graph-based example, PER [18] treats KG as a heterogeneous
informarecommender models (8 of them use knowledge tion network and extracts latent features from meta-paths
graphs) published from 2016 to 2020. Also, we of varying length and nature to represent heterogeneous
discuss some methodological issues in the evalu- relationships between users and elements. HeteRec [19]
ation of KG-based models. leverages the meta-path similarities to enrich the
user• We conducted experiments on three real-world item interaction matrix. However, the first path-based
datasets with hyperparameters tuning both for approaches involved the manual formation of meta-paths,
baselines and KG-based models. Our study sheds which led to sub-optimal results and limited their use for
light on the questionable recommendation quality diferent recommendation scenarios.
of KG-based models. In [20], the authors proposed combining KGE and
pathbased approaches into the RippleNet model. The general
idea of the method is that the user’s embedding is formed
2. Related work from the embeddings of items with which the user
interacted in the past, as well as from their neighbors on a
2.1. Graph-based models graph with a given depth. Other examples of hybrid
modThere are approaches that use the graph structure of els are KGCN [21], where the representation of the object
interactions without additional information. The idea is formed by aggregating the embeddings. In KGNNLS
of exploiting only user-item graphs for recommenda- [22] authors prove that label smoothness regularization
tion is researched for a long time. Users and items are is equivalent to the label propagation problem and use
represented as nodes and interactions between them as a leave-one-out loss to evaluate the importance of each
edges. Early approaches like ItemRank [6], P3 [7], and relationship type to the user. KGAT [23] model uses the
P3 [8] utilize random walks to propagate user prefer- attention mechanism to distinguish the importance of
ences. Lately, graph neural networks were applied to the influence of diferent neighbors.
retrieve high-order connections between nodes and
retrieve non-trivial collaborative signals. To do this, such 2.3. Surveys of KG-based RecSys models
models as PinSage [9], GC-MC [10], and NGCF [11] are
developed based on graph convolution operation.
Recently, simplified versions of Graph Convolution
Networks (GCN) were proposed. For example, the authors
of LightGCN [12] removed nonlinearities and
collapsing multiple weight matrices and outperformed previous
state-of-the-art NGCF [11]. Lately, researchers presented
SGL [13] with a self-supervised task and additional data
augmentation.</p>
      <p>There are a few comprehensive surveys about
graphsbased recommender models. In paper [24], researchers
provide a comprehensive overview of graph-based
models for recommendations. Recent work [25] analyzes
existing work in KG-based recommendations and
outlines future directions. But this paper does not contain
independent experiments with well-tuned collaborative
ifltering baselines. Both papers do not include extra
experiments with tuning hyperparameters of baselines.
Although there are a lot of evaluation research studies with
a comparison of conventional top-N recommendation
models [2, 26, 27], there are no such articles on KG-based
models. To the best of our knowledge, we are the first
to provide experiments with a wide range of KG-based
models and compare them with strong CF-based
baselines.</p>
      <sec id="sec-1-1">
        <title>2.2. KG-based models</title>
        <p>Models with external knowledge can be divided into 3
main groups: embedding-based (KGE), path-based, and
hybrid methods [14].</p>
        <p>Embedding-based methods use information from the
knowledge graph to enrich the representation of users
and items. As a rule, the main idea of such models is based
on the integration of standard collaborative filtering and
embeddings from the knowledge graph. For example,
CKE [15] combines various types of additional data about
items: structural, textual, and visual information. All
representations are then aggregated together to calculate
ifnal recommendations. In KTUP [ 16] the researchers
do a graph completion task. MKR [17] model connects
two tasks - constructing a recommender system and a</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Experiments</title>
      <p>A typical knowledge graph can be divided into two
subgraphs. The first one is a user-item graph, gathered from
interaction data. And the second subgraph represents
additional knowledge. To study the performance of using
additional information in recommender models, we are
797
91
492
59
941
181
471
280
214
212
605
205</p>
      <p>Gowalla,</p>
      <p>Yelp2018,
Amazon-Books</p>
      <p>Gowalla,</p>
      <p>Yelp2018,
Amazon-Books</p>
      <p>Gowalla,</p>
      <p>Yelp2018,
Amazon-Books</p>
      <p>Yelp2018,
Amazon-Books,
Alibaba-iFashion
MovieLens-1M
IntentBooks</p>
      <p>Amazon
MovieLens-1M
Book-Crossing</p>
      <p>Bing-News
MovieLens-20M
Book-Crossing</p>
      <p>Last.FM
MovieLens-1M
Book-Crossing</p>
      <p>Last.FM
Bing-News
MovieLens-1M
Amazon-Books</p>
      <p>Last-FM</p>
      <p>Yelp2018
MovieLens-20M,
Book-Crossing,</p>
      <p>Last.FM,
Dianping-Food</p>
      <p>Models with bipartite user-item graph</p>
      <p>MF, NeuMF,
CMN, HOP-Rec,
GC-MC, PinSage</p>
      <p>MF, GC-MC,
NGCF, DisenGCN,</p>
      <p>MacridVAE
NGCF, Mult-VAE,</p>
      <p>GRMF
NGCF, LightGCN,</p>
      <p>Mult-VAE,
DNN+SSL</p>
      <p>KGE models</p>
      <p>BPRMF, BPRMF+TransE,
PRP, PER, LibFM, CMF, CTR</p>
      <p>BPRMF+SDAE
BPR, BPR_HFT, VBPR,
DeppCoNN, CKE, JRL
CKE, SHINE, DKN,
PER, LibFM + TransR,</p>
      <p>Wide&amp;Deep
SVD, LibFM,</p>
      <p>LibFM + TransE,
PER, CKE, RippleNet</p>
      <p>PER, CKE, DKN,</p>
      <p>RippleNet,
LibFM + TransR,</p>
      <p>Wide&amp;Deep
Hybrid graph-based models
FM, BPRMF,
CFKG, CKE,</p>
      <p>CoFM
FM, NFM, CKE,
CFKG, MCRec,
RippleNet, GC-MC</p>
      <p>SVD, LibFM,</p>
      <p>LibFM+TransE,
PER, CKE, RippleNet
models baselines are underlined. A bipartite graph  1 is defined as a set of triplets {(,   , )  ∈  ,  ∈  } , where  ,  are sets
of users and objects, respectively vertices, and connections   represent interactions. In the User-Item graph, in addition to
interactions, vertices with external features of the item are added.</p>
      <sec id="sec-2-1">
        <title>3.1. Datasets</title>
        <p>We consider three real-world datasets from diferent areas
of application and varying sparsity degrees.</p>
        <p>MovieLens-1M1: This dataset contains ratings for</p>
        <sec id="sec-2-1-1">
          <title>1https://grouplens.org/datasets/movielens/1m/</title>
          <p>Amazon-Books2: This dataset contains information
about the ratings of books purchased by Amazon users.
30-core filtering is used to ensure the dataset’s quality
and match the available computing resources.</p>
          <p>TTRS [30] is a real dataset of financial transactions
of 50,000 randomly selected clients of Tinkof bank,
covering spending in all areas of life - from buying food
and household appliances to paying for various forms
of leisure. Users with less than 10 interactions were
excluded.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2http://jmcauley.ucsd.edu/data/amazon/</title>
          <p>Graph
Bipartite
Bipartite
Bipartite</p>
          <p>Item
User-Item
Item for
each user
Item for
each user</p>
          <p>Item
Item
Item for
each user
Bipartite + item
Attention
Bipartite
Disentangled representation</p>
          <p>Motivation
High-order embedding
propagation
Without complex
operations and
non-linearity
Self-supervision</p>
          <p>Structural, textual,
visual embeddings + CF
Heterogenous user-item
knowledge graph
Preference propagation
Graph Convolutional
Multi-task learning,
cross&amp;compress unit
Graph completion,
preference translation</p>
          <p>Label smoothing</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. Experimental Settings</title>
        <p>Evaluation metrics. We use the standard metrics
NDCG@k and Recall@k averaged over the entire sample
to assess the recommendations’ quality. Due to paper
space limitations, we report metrics only with k = 10.</p>
        <p>Evaluation protocol. For each dataset, 80% of the
interactions for each user were randomly selected as the
training sample. We also leave 10% of interactions for
validation and 10% for the test set. One negative example
is randomly selected from a uniform distribution for each
interaction during the training phase. At the stage of
validation and testing, we rank positive examples relative
to all items in the dataset.</p>
        <p>Baselines. Three popular non-graph models based on
statistics were chosen as baselines:
library for building recommender systems RecBolee3 [4].</p>
        <p>We took all the graph models implemented in Recbole
of version 1.0.0 and analyzed the relevant articles for
the datasets, metrics, baselines, data preprocessing, and
pipeline for building recommender systems. Table 1
presents the summary statistics.</p>
        <p>We found a few issues with evaluating KG-based
models and consequently claimed the value of additional
knowledge for recommendation quality.</p>
        <p>Firstly, the authors include matrix factorization (BPR,
SVD) and factorization machines (FM, LibFM) as
collaborative filtering baselines. Meanwhile, KNN-based,
VAEbased, and Item-based models are ignored. Moreover,
MKR [17] and RippleNet [20] even are not compared
with CF baselines. Therefore, we cannot conclude that</p>
        <p>KG-based models outperform algorithms based only on
1. Pop is a recommendation of the most popular user-item interactions.</p>
        <p>items in the dataset. Popularity is calculated by Secondly, researchers from most papers do not
propthe frequency of the object’s occurrence among erly tune hyperparameters of CF-based models.
Accordall interactions; ing to [2], it leads to the weak performance of baselines
2. ItemKNN [3] – an item-based approach to rec- and possible superiority of proposed KG-based methods.
ommending objects that are close in the cosine To fill this gap, we conduct comprehensive
experisimilarity measure; ments with tuning both baselines and KG-based models.
3. EASE [5] is a state-of-the-art linear model based Implementation details. All calculations were
peron the idea of an autoencoder. It has an analytical formed on the Tesla V100 GPU, experiments were
carsolution for a convex loss function. ried out on the basis of the RecBole framework. For all
models, we tune hyperparameters with random
greedsearch optimizing NDCG@10 within 24 hours for each</p>
        <p>Hyperparameter Fitting. To review the current
state of KG-based models, we took the popular open</p>
        <sec id="sec-2-2-1">
          <title>3https://github.com/RUCAIBox/RecBole</title>
          <p>Table 2 presents NDCG@10 and Recall@10 values in our
experiments.</p>
          <p>RQ1: Graph models in all datasets outperform the Pop
baseline but demonstrate poor performance compared
to the state-of-the-art non-graph model EASE [5] and
ItemKNN [3] on Amazon Books. Although the EASE
model was published in 2019 and could not be used as a
baseline in most models, the authors of the DGCF [28],
LightGCN [12], and SGL [13] should have considered
including it, as well as ItemKNN [3], in their analysis.</p>
          <p>Even though EASE and ItemKNN use less information
based on interaction data alone, they are often able to
model a behavioral signal better than knowledge graph 4. Conclusion
models and are easier to implement.</p>
          <p>RQ2: Approaches with a bipartite user-item graph In this study, we examined the quality and required
rethat are not enhanced with additional connections with sources for training the developed KG-based
recommenitem characteristics demonstrated the next best quality dation models from diferent perspectives. Furthermore,
after EASE. Despite the fact that both models with a bi- we uncovered a few methodological issues that contribute
partite user-item graph and a knowledge graph incorpo- to the stated efectiveness of KG-based models, which
may be addressed by incorporating simple linear
base4https://github.com/kg-based-recsys-eval/kg_based_recsys_eval
lines and adjusting its hyperparameters in the
experimental set-up. Our comprehensive experiments on three
real-world datasets reveal that a basic linear model EASE
outperforms any KG-based approach. Moreover, the
early-adopted ItemKNN approach shows better quality
in making recommendations than some graph models on
several datasets. Our findings indicate that researchers
should include and adjust such simple baselines in their
experiments and compare new algorithms to them.</p>
          <p>Hamilton, J. Leskovec, Graph convolutional neural
networks for web-scale recommender systems, in:
Proceedings of the 24th ACM SIGKDD international
conference on knowledge discovery &amp; data mining,
2018, pp. 974–983.
[10] R. v. d. Berg, T. N. Kipf, M. Welling, Graph
convolutional matrix completion, arXiv preprint
arXiv:1706.02263 (2017).
[11] X. Wang, X. He, M. Wang, F. Feng, T.-S. Chua,
Neural graph collaborative filtering, in: Proceedings of
the 42nd international ACM SIGIR conference on
Research and development in Information Retrieval,
2019, pp. 165–174.
[1] Z. Meng, R. McCreadie, C. Macdonald, I. Ounis, [12] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, M. Wang,
Exploring data splitting strategies for the evalua- Lightgcn: Simplifying and powering graph
convolution of recommendation models, in: Fourteenth tion network for recommendation, in: Proceedings
ACM conference on recommender systems, 2020, of the 43rd International ACM SIGIR conference on
pp. 681–686. research and development in Information Retrieval,
[2] M. F. Dacrema, P. Cremonesi, D. Jannach, Are we 2020, pp. 639–648.</p>
          <p>really making much progress? a worrying analy- [13] J. Wu, X. Wang, F. Feng, X. He, L. Chen, J. Lian,
sis of recent neural recommendation approaches, X. Xie, Self-Supervised Graph Learning for
Recin: Proceedings of the 13th ACM Conference on ommendation, Association for Computing
MachinRecommender Systems, RecSys ’19, Association for ery, New York, NY, USA, 2021, p. 726–735. URL:
Computing Machinery, New York, NY, USA, 2019, https://doi.org/10.1145/3404835.3462862.
p. 101–109. URL: https://doi.org/10.1145/3298689. [14] Q. Guo, F. Zhuang, C. Qin, H. Zhu, X. Xie, H. Xiong,
3347058. doi:1 0 . 1 1 4 5 / 3 2 9 8 6 8 9 . 3 3 4 7 0 5 8 . Q. He, A survey on knowledge graph-based
recom[3] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item- mender systems, IEEE Transactions on Knowledge
based collaborative filtering recommendation algo- and Data Engineering (2020).
rithms, in: Proceedings of the 10th international [15] F. Zhang, N. J. Yuan, D. Lian, X. Xie, W.-Y. Ma,
Colconference on World Wide Web, 2001, pp. 285–295. laborative knowledge base embedding for
recom[4] W. X. Zhao, S. Mu, Y. Hou, Z. Lin, Y. Chen, X. Pan, mender systems, in: Proceedings of the 22nd ACM
K. Li, Y. Lu, H. Wang, C. Tian, et al., Recbole: To- SIGKDD international conference on knowledge
wards a unified, comprehensive and eficient frame- discovery and data mining, 2016, pp. 353–362.
work for recommendation algorithms, in: Proceed- [16] Y. Cao, X. Wang, X. He, Z. Hu, T.-S. Chua,
Uniings of the 30th ACM International Conference on fying knowledge graph learning and
recommenInformation &amp; Knowledge Management, 2021, pp. dation: Towards a better understanding of user
4653–4664. preferences, in: The world wide web conference,
[5] H. Steck, Embarrassingly shallow autoencoders for 2019, pp. 151–161.</p>
          <p>sparse data, in: The World Wide Web Conference, [17] H. Wang, F. Zhang, M. Zhao, W. Li, X. Xie, M. Guo,
2019, pp. 3251–3257. Multi-task feature learning for knowledge graph
[6] M. Gori, A. Pucci, V. Roma, I. Siena, Itemrank: A enhanced recommendation, in: The world wide
random-walk based scoring algorithm for recom- web conference, 2019, pp. 2000–2010.
mender engines., in: IJCAI, volume 7, 2007, pp. [18] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U.
Khandel2766–2771. wal, B. Norick, J. Han, Personalized entity
recom[7] C. Cooper, S. H. Lee, T. Radzik, Y. Siantos, Random mendation: A heterogeneous information network
walks in recommender systems: exact computation approach, in: Proceedings of the 7th ACM
internaand simulations, in: Proceedings of the 23rd inter- tional conference on Web search and data mining,
national conference on world wide web, 2014, pp. 2014, pp. 283–292.</p>
          <p>811–816. [19] X. Yu, X. Ren, Y. Sun, B. Sturt, U. Khandelwal, Q. Gu,
[8] B. Paudel, F. Christofel, C. Newell, A. Bernstein, Up- B. Norick, J. Han, Recommendation in
heterogedatable, accurate, diverse, and scalable recommen- neous information networks with implicit user
feeddations for interactive applications, ACM Trans- back, in: Proceedings of the 7th ACM conference
actions on Interactive Intelligent Systems (TiiS) 7 on Recommender systems, 2013, pp. 347–350.
(2016) 1–34. [20] H. Wang, F. Zhang, J. Wang, M. Zhao, W. Li, X. Xie,
[9] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. M. Guo, Ripplenet: Propagating user preferences on
the knowledge graph for recommender systems, in:
Proceedings of the 27th ACM international
conference on information and knowledge management,
2018, pp. 417–426.
[21] H. Wang, M. Zhao, X. Xie, W. Li, M. Guo,
Knowledge graph convolutional networks for
recommender systems, in: The world wide web
conference, 2019, pp. 3307–3313.
[22] H. Wang, F. Zhang, M. Zhang, J. Leskovec, M. Zhao,</p>
          <p>W. Li, Z. Wang, Knowledge-aware graph neural
networks with label smoothness regularization for
recommender systems, in: Proceedings of the 25th
ACM SIGKDD international conference on
knowledge discovery &amp; data mining, 2019, pp. 968–977.
[23] X. Wang, X. He, Y. Cao, M. Liu, T.-S. Chua, Kgat:</p>
          <p>Knowledge graph attention network for
recommendation, in: Proceedings of the 25th ACM SIGKDD
international conference on knowledge discovery
&amp; data mining, 2019, pp. 950–958.
[24] S. Wu, F. Sun, W. Zhang, X. Xie, B. Cui, Graph
neural networks in recommender systems: a survey,</p>
          <p>ACM Computing Surveys (CSUR) (2020).
[25] J. Chicaiza, P. Valdiviezo-Diaz, A comprehensive
survey of knowledge graph-based recommender
systems: Technologies, development, and
contributions, Information 12 (2021) 232.
[26] V. W. Anelli, A. Bellogín, T. Di Noia, D. Jannach,</p>
          <p>C. Pomo, Top-n recommendation algorithms:
A quest for the state-of-the-art, arXiv preprint
arXiv:2203.01155 (2022).
[27] S. Rendle, W. Krichene, L. Zhang, Y. Koren,
Revisiting the performance of ials on item
recommendation benchmarks, arXiv preprint arXiv:2110.14037
(2021).
[28] X. Wang, H. Jin, A. Zhang, X. He, T. Xu, T.-S. Chua,</p>
          <p>Disentangled graph collaborative filtering, in:
Proceedings of the 43rd international ACM SIGIR
conference on research and development in
information retrieval, 2020, pp. 1001–1010.
[29] Y. Zhang, Q. Ai, X. Chen, P. Wang, Learning over
knowledge-base embeddings for recommendation,
arXiv preprint arXiv:1803.06540 (2018).
[30] S. Kolesnikov, O. Lashinin, M. Pechatov, A. Kosov,</p>
          <p>Ttrs: Tinkof transactions recommender system
benchmark, arXiv preprint arXiv:2110.05589 (2021).</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>