1. Introduction

GInRec: A Gated Architecture for Inductive Recommendation using Knowledge Graphs

Theis E. Jendal

tjendal@cs.aau.dk

Matteo Lissandrini

matteo@cs.aau.dk

Peter Dolog

dolog@cs.aau.dk

Katja Hose

2katja.hose@tuwien.ac.at khose@cs.aau.dk

We have witnessed increasing interest in exploiting KGs to integrate contextual knowledge in recommender systems in addition to user-item interactions, e.g., ratings. Yet, most methods are transductive, i.e., they represent instances seen during training as low-dimensionality vectors but cannot do so for unseen instances. Hence, they require heavy retraining every time new items or users are added. Conversely, inductive methods promise to solve these issues. KGs enhance inductive recommendation by ofering information on item-entity relationships, whereas existing inductive methods rely purely on interactions, which makes recommendations for users with few interactions sub-optimal and even impossible for new items. In this work, we investigate the actual ability of inductive methods exploiting both the structure and the data represented by KGs. Hence, we propose GInRec, a state-of-the-art method that uses a graph neural network with relation-specific gates and a KG to provide better recommendations for new users and items than related inductive methods. As a result, we re-evaluate state-of-the-art methods, identify better evaluation protocols, highlight unwarranted conclusions from previous proposals, and showcase a novel, stronger architecture for this task. The source code is available at: https://github.com/theisjendal/kars2023-recommendation-framework.

1. Introduction

In Recommender Systems (RSs), an item is recommended to a user based on their preferences. Usually, these preferences are extracted from a user’s historic interactions with items, such as clicks or purchases. A RS can either recommend based on user-item interactions, based on descriptive features of the items, or both. In the first case, for example in the movie domain, the system would assume that users that watch the same movies are likely to do so also in the future; this approach is commonly referred to as Collaborative Filtering (CF) [ 1, 2, 3, 4, 5 ]. In the second case, instead, the system would assume that the user is likely to watch movies of similar genres and plots of movies they watched in the past, i.e. a contentbased method. Challenges arise for the former approach when, for a given user, only very few interactions are known; a similar challenge arises when information describing items is scarce. The idea is then to combine both kinds of information. In this regard, recently, RSs have been proposed to model knowledge about items derived from a KG [6, 7, 8, 9, 10, 11, 12]. A KG represents entities and their attributes as nodes and edges within a graph model, e.g., taxonomies, item descriptions, or categories attached to items. These models further integrate useritem interactions into the graph, obtaining in this way a Collaborative KG (CKG), as in Figure 1.A. Allowing for recommendations to users that have only a few ratings or with newly added items that have none at all. Inception

Don Jon Alex

Aiden

Max GInRec ( GNN GInRec ( GNN ; ;

Max The Prestige

American Hustle The Prestige ) = ) =

Heist Sci-Fi

Crime Fiction

Action

Fiction Tragedy

Drama × = likes?

For example, in Figure 1.B, we are making predictions

for a user for whom we do not have any information (empty the embedding vector) except for a few rated items. The KG connects directors, genres, and actors to the rated movies, where some of these entities are described by textual information, e.g., bios and synopses.

We can use the connections and data to infer user preferences beyond the collaborative signals.

Many existing methods only work in a transductive setting; that is, it is assumed that all users and items have been seen during training [13, 14], meaning transductive ∈ℐ and I, =0 if we do not have any information about models require retraining whenever new users or items the specific pair, e.g., if the user has never interacted are introduced. Instead, some models try to ofer induc- with the item. Then, given the matrix I and the KG tive capabilities [13, 15, 14]. In an inductive setting, users , the CKG c∶⟨ c, ℛc, ℒc⟩, is an extension of , havand items exist that are not in the training set; therefore, ing c=∪ , ℒc=ℒ∪{Likes}, and ℛc=ℛ∪{(, Likes, ) | they extract information from the data to incorporate ∀∈ , ∈ℐ s.t. I, =1}. local structures to obtain an inductive bias. Nonetheless, Finally, every node ∈ c is associated with a set of existing methods usually model only user-item interactions, node features, assuming a function ∶ c↦ℝ exists, ignoring the KG [ 16, 13, 14, 17 ]. Our analysis of existing called the feature function, assigning to each node a feaworks (see Section 3) identified four important limita- ture vector of dimension . Typically, this vector provides tions: (i) the reliance on user metadata [ 17, 18 ], (ii) the a -dimensional encoding of the node’s contents, e.g., in tendency to rely exclusively on collaborative informa- this and other works [13] the word embedding of the textion and to bias preference over popular items [13, 15, 14], tual description obtained by literal values attached to the (iii) the poor scalability of methods that create user-item nodes are used. However, since we do not want to use, or subgraphs for every rating [19, 15], and (iv) the missed have, any user information, with the exception of their opportunity to exploit item metadata and KG structure. ratings, then we have ∀∈ . () = 0⃗. Therefore, given

Hence, we first propose a new architecture for induc- an interaction matrix I, a KG with feature function tive recommendation using KGs. In our design, we strive , a user , and an item such that I, =0, we model the for simplicity by adopting the eficiency and expressivity recommendation problem as the problem of predicting of Graph Neural Networks (GNNs) to aggregate struc- the likelihood of I, =1 if we present the item to the user tural information of each node’s neighborhood, but going . In practice, we model our task as a top-k recommenbeyond trivial extensions of the GraphSAGE architecture dation problem. Thus, we aim at learning a model Θ to as well as any other existing inductive method [16, 19, 20, parametrize a transformation function ℱ∶×ℐ↦[0, 1] , 17, 15, 14, 21] due to its gated architecture that more ef- such that ℱΘ(, ) ≥ ℱ Θ(, ) , imposes a partial order on fectively extrapolates inductive biases from the semantic ℐ for every user in , if it is more likely that the user and structural information encoded in the CKG. Further- would like over than vice versa. more, by reviewing the experimental evaluation of exist- Finally, in the recommendation setting, we define two ing works, we have identified problematic methodologies types of users: those for which preferences across some on which we report here together with our results. Thus, items in ℐ were known when learning Θ, i.e., at training we propose the Gated Inductive Recommendation (GIn- time, and those for which no item rating was known Rec), a new architecture that fully exploits the semantic during training, but for which some rating is known at information of real-world KGs for inductive predictions inference time. We refer to the former as to the warmin a scalable way. start users and to the latter as cold-start users , with = ∪ . Transductive methods can only recommend 2. Problem Formulation for users in the warm-start set, while inductive methods can recommend for users in both sets. Typically, when a new user joins a platform, it is common practice to present them with an initial set of items to be rated. Thus, we consider cold-start users for which, at inference time, we have some ratings, even though those ratings are usually few and sparse [ 18, 23, 20 ].

Formally, we define a KG as a directed labeled multigraph identified by the triple ⟨ , ℛ, ℒ ⟩ , where is the set of entities (nodes) in the graph, ℒ, ℒ∩=∅ , is the set of labels for the relations, and the edges between entities are represented as ℛ⊆ ×ℒ× . Consider, for instance, the top-right portion of the example in Figure 1. Here, nodes represent movies, actors, directors, and a taxonomy of genres, while edges represent how nodes are connected, e.g., in the triple (Inception, hasGenre, Heist).

As common in the literature [22], we split entities into two sets: the set of recommendable entities ℐ⊂ , being the entities that the system can recommend to a user (e.g. movies); and the set of descriptive entities ℰ ⊂ (e.g., actors, genres, classes), such that =ℐ∪ℰ .

Furthermore, we adopt the concept of a CKG [6], i.e. a KG augmented with users’ interactions with items, also shown in Figure 1. Formally, given the set of users , the interaction matrix I∈{0, 1}||×|ℐ | is a matrix of size | |×|ℐ | , having I, =1 if user ∈ has liked the item

3. Related Work

Most existing recommendation methods either only use bipartite graphs of user interactions with items [3, 5, 16] or are transductive [6, 7]. Instead, inductive learning models generate predictions for unseen nodes by directly reasoning over the features that describe them, but existing methods do not exploit KG data. Here, we provide an overview of inductive methods, detailing their limitations compared to our proposal (as summarized in Table 1) and describe the advantages of relational gates.

Inductiveness. GraphSAGE [13] was the first inductive

GNN capable of eficiently generating embeddings for unseen nodes by leveraging pretrained node features for Table 1 node classification. It was later expanded to a scalable Related methods, whether they use User Metadata, whether item-item recommendation method, meaning no explicit they handle Relational information (i.e., KG), the Task they modeling of user-item ratings [16]. support among (C) Node Classification, (R) Ranking, and (P)

Other methods have been proposed for inductive ma- Rating Prediction, and whether the method constructs a Subtrix completion by extracting subgraphs around each graph from user-item pairs. user-item pair to obtain the necessary representations [24, Inductive User 25, 19, 15]. These approaches are designed for the sin- NMGoCdeFl[5] Us7er Ite7m Meta7data Relat7ional TaRsk Subg7raph gle rating prediction task and not for the ranking task. KGAT [6] 7 7 7 4 R 7 Generating these sub-graphs is prohibitively space- and KKGPRCNN-[L7S] [8] 77 77 77 44 RR 77 time-consuming. Thus, they cannot eficiently produce MeLU [ 18 ] 7 7 4 7 R 7 user-item rankings, since a subgraph is generated for RuleRec [28] 7 7 7 4 R 7 all user-item pairs [20]. Furthermore, these methods LMGGCANT [[39]] 77 77 77 47 RR 77 do not use KG information; thus, they cannot provide GraphSAGE [13] (4) 4 7 7 C 7 predictions for new items with no interactions. There- PBiEnRSTA4GREec[1[62]6] (44) 47 77 77 RR 77 fore, instead of constructing subgraphs, GInRec employs IGMC [19] 4 4 7 7 P 4 subsampling of neighboring nodes to obtain a scalable IDCF [20] 4 7 7 7 P 7 prediction mechanism [16, 13] and uses KGs to gain infor- IPCGPD[1[147]] 47 44 47 77 RR 77 mation about items with few user interactions. Several GIMC [15] 4 4 7 7 P 4 methods exploit user metadata, e.g., gender and age infor- RGeIBnKRCec[27] 44 47 77 44 RP 77 mation [ 17, 18 ]. Yet, this information is rarely available, making it impossible to use these methods in practice. rather than the vector level.

Hence, in our method, we assume no user metadata, learn- For KGs, gates have been used to capture long-term ing instead how to aggregate information. Some methods path relations [7] and for aggregating neighbors for multiare made for sequential recommendations [26], or can- model graphs [9]. Multi-modal information and relations not recommend for new users or for new items [14, 27], in KGs difer in both semantic meaning and practical and in general cannot capture high-order connectivities application, requiring diferent aggregation techniques. between users, making them less relevant for our study. Therefore, GInRec adopts new relational-specific gates

Additionally, some methods are quasi-inductive since as an addition to the neighborhood aggregation. they consist of two parts: (1) a transductive part to obtain Thus, GInRec proposes a scalable inductive method for some initial embeddings and (2) an inductive part where user-personalized recommendation that learns to extract the method learns to generate embeddings for new users knowledge from a KG using relational gates without reor items [21, 20]. GInRec is fully inductive since it uses quiring any user metadata. the extracted node textual features instead and thus does not need to learn the initial embeddings. Finally, many 4. Methodology methods target the prediction of a user rating, which underperforms in the ranking task, even compared to We now present our model Gated Inductive Recommendanon-personalized methods [ 2 ]. Thus, existing works (Ta- tion (GInRec). The model consists of three components ble 1) either: (i) create subgraphs, which do not scale in (as shown in Figure 2): (i) embedding layer, where we the ranking task; (ii) use personal user data, which is al- compress node feature information to create node emmost never available; or (iii) solve a rating-prediction task beddings (Figure 2 A); (ii) gated propagation layer, which that ofers sub-optimal performances in practice. There- chooses which information to propagate from the emfore, we select GraphSAGE [13] and PinSAGE [16] as beddings of neighboring nodes in a CKG to produce a the only inductive recommenders fitting our recommen- high-order representation of each node and its neighbors dation setting and select IDCF [20] as a representative (Figure 2 C and E); and (iii) prediction layer, creating a baseline for quasi-inductive methods. user and an item embedding given all propagation layers, Gates. Gates were originally used in Recurrent Neu- outputting a ranking score (Figure 2 D). Hence, our arral Networks (RNNs) to learn long-term dependencies chitecture learns to recommend for users based on their in time series [29, 30]. A gate limits the amount of in- interactions alone, introducing a gating mechanism to formation passed by learning a scalar in [0, 1], for each adaptively select information during aggregation along dimension in a vector. On the contrary, the attention with an autoencoder regularization measure. mechanism, which is often used in GNN aggregators, learns a single scalar for the entire vector [6, 8, 16]. Hence, the gates allow for diferentiation at the dimension level

4.1. Embedding Layer

GInRec is designed as an inductive relational graph neural network. Therefore, given a target node ∈ c, and

Feature autoencoder

Encode

Decode Extracted features

X ∈ R|E'|×d

Encoded features: X' ∈ R|E'|×d' where AEen:ℛ| ′|× ↦ℛ| ′|× ′, with ′≪ , is the encoding function mapping the initial feature vector for each node to a set of lower dimensionality vectors through multiple fully connected layers with the Leaky ReLU activation [35]. Analogously, AEde:ℛ| ′|× ′↦ℛ| ′|× is a decoding function mapping the lower dimension embeddings back to the original vectors. Therefore, we produce a matrix X′ = AEen(X) of low dimensionality embedding movie plot or biography of the actor. Similar to semi- for the initial nodes in ′ . Moreover, in our architecnal works [31], we use Sentence-BERT [32] to process the textual description of each entity and produce senture, AE is jointly learned with the final ranking loss (as described later in subsection 4.4). Thus, X′ provides a tence embeddings such that sentences of similar semantic ifne-tuned compressed representation of the extracted meaning are close to each other in a vector space. For features. Since the initial embedding has no range limits, multi-sentence descriptions, the average sentence embed- no activation is used for the final decode layer. ding is used. When textual descriptions for descriptive entities are not available, we use ComplEX [33] to train entity embeddings for the descriptive entity in the KG, since these are static or very slowly changing. The initial vector is a concatenation of the textual embedding and the structural data, e.g., node degree. We standardize the features by removing the mean and scaling to unit variance as in other works [13]. Our approach can be extended to include additional features, such as item pictures for multi-modal descriptions, but we leave their study as future work. Thus, defining the initial matrix as X∈ℝ| ′|× , where the ’th entity has the embedding of the ’th row and users are initialized as zero vectors in the embedding. As such, we only require a few interactions to represent users and textual descriptions for items.

The size of the initial feature vectors are usually large

(in our model, we have > 756 ), making subsequent toEncoder (AE) layer to reduce the dimensionality [34] (shown in Figure 2 A). The loss of the AE is defined as: L = MSE(X, AEde(AEen(X)))

4.2. Gated Propagation Layer

The core of GNNs is the ability to aggregate information from its neighborhood. Relation types could allow the model to diferentiate between the relation interactions and aggregate information dependent upon the combinations of edges in the CKG. Thus, we explore the efect of gates in our GNN’s architecture and extend these with relation-specific weights [ 9, 36]. In the following, we ifrst describe the individual parts for a single step, i.e., relation-specific gating, information propagation, and aggregation, and then how to generalize the process to high-order propagations.

Relation-specific gates:

We design two relation-specific gates that control the information flow during message passing: (i) Inner Product and (ii) Concatenation. The Inner Product gate uses the inner product of the ℎ and the afinity between the two. We take into account different relations (similar to TransR [37]) by first transforming the entities’ embeddings into a relation-specific vector space before finding the afinities: (ℎ, , ) = computations infeasible. Therefore, we introduce an Au- entities as the gate, making the gate dependent on (1) ((W eℎ)⊤W e ). where is the sigmoid activation func- 4.3. Prediction tion, W

is a relation-specific transformation matrix, and (ℎ, , ) ∈ . The Concatenation gate works as the original reset and update gate mechanisms used by GRU [30].

Here, we utilize a relation-specific linear transforma At each layer, information from increasingly distant enti

ties is aggregated, and we, therefore, have multiple representations of the entities after layers of propagation. tion, which learns which parts of the tail entity’s em- as it is passed to the prediction step. Similar to previous bedding are important in the aggregation step given both the head and the tail with ‖ being concatenation as: (ℎ, , )= (W (eℎ‖e )).

Information Propagation: Given the direct neighbors

apporaches [39, 6, 9], we concatenate the output after each layer for a user and item as: e∗ = e1‖ … ‖e ,

e∗ = e1‖ … ‖e This approach is able to retain information of the difof entity ℎ as ℎ={(ℎ, , )|(ℎ, , )∈ }

, also called its ego- ferent representations at all steps. For the final predicformation propagation is vital for GInRec’s performance, items should be ranked higher than others [9, 6] as: ), where is an of training triples with item being rated higher than where ℬ = {(, , ) | I = 1 and I ≠ 1} is a set aggregator function. We identify four common aggrega- and is the sigmoid function. The final loss functors used in other architectures, namely: Bi-interaction tion is a combination of autoencoder loss in Equation 1 aggregator [6], GCN aggregator [39], GraphSAGE ag- and the BPR loss in Equation 5, so to learn an encoded tion, learned, non-linear similarity measures are usually outperformed by a simple inner product [40] that also (2) reduces complexity. Hence, our prediction is computed as follows: ̂ =e∗⊤e∗.

4.4. Optimization We use Bayesian Personalized Ranking (BPR) as the collaborative loss function, assuming previously interacted

L =

∑ (,,)∈ℬ −ln ( ̂ − ̂ ) (4) (5) network [38], we can define the neighborhood aggregation vector of ℎ as: e ℎ =

1 | ℎ| (ℎ,,)∈ ℎ ∑ (ℎ, , ) e .

In contrast to other gated networks [36, 9], our model’s gates are relation-specific, allowing it to propagate different information from diferent parts of an entity’s embedding based on the relation to it. This fine-grained inespecially when not relying on the initial user features. e ℎ, formally defined as

e′ℎ= ( eℎ, e ℎ

Aggregation: The final part combines an entity’s cur

rent embedding eℎ with the aggregated ego embedding gregator [13], and LightGCN aggregator [3], finding the

LightGCN aggregator to be the best performing through

hyperparameter tuning. The LightGCN aggregator can be defined as tions or non-linear activations.

= e ℎ, not having any transformaHigh-order propagation: To propagate information from n-hop neighbors and utilize high-order connectivity information, we stack the model in layers [6, 9, 13]. As illustrated by the arrow from A to B in Figure 2, we use the output of the embedding layer X′ ∈ ℝ| |× ′ as the initial embedding in the propagation layers. We thus define the next representation layer + 1 , recursively using the previous layer and the neighborhood representation as: The weight matrices in the aggregator functions are in ℎ ∈ℝ (−1) ×2 (−1) depending on whether they refer to the or gate, respectively.

(3) ℎ (−1) , embedding suitable for recommendation while containing enough information to reconstruct, computed as: L = L {W( ′), W( ′) + L + ‖Θ‖ 22 with Θ={W()

, W() |∀∈{1, ..., }}∪ |∀ ′∈{1, ..., }} is the set of learnable parameters, is a parameter for tuning the 2 regularization, and is a parameter to tune the autoencoder loss. The autoencoder loss also works as a regularizer while also being recommender-specific. Generating embeddings for all nodes, with MovieLens Subsampled (ML-S) shown in Table 2 took 0.48s, and ranking items for all users took 0.078s, compared to PinSAGE’s 0.446s and 0.966s, with a

RTX 2070 Super and Intel i9-9900, averaged over 5 runs.

Training: We use mini-batch training sampled from ℬ of size 1024, limiting the computation graph by having a fixed size ego-network of 10 and starting construction from the last layer [13]. The entities used in the first layer of the gated propagation are used for the autoencoder loss, such that we learn to represent not only users and items but also entities like genres and actors.

Scalability. In our embedding approach, both the calculation of the aggregation (eℎ+1 ) and of the prediction ( )̂ are all bounded by the number of nodes in the graph, while the calculation of the ego-network (e(+1) ) is bounded by the number of edges. As these steps arℎe applied sequentially and | |≪|ℰ | , we know that the complexity of our method is bounded by the ego-network aggregation complexity, more specifically, the linear transformation of the gate calculation. When naïvely applying the gates over all edges, the complexity is (|ℰ |) , where is the largest dimension utilized during graph convolutions – we note that |ℰ | is bounded by (| | 2|ℛ|). Yet, as W (eℎ‖e ) is equivalent to W1eℎ + W2 e we only need to compute the transformation for each unique (ℎ, ) and ( , ) pair instead of each unique (ℎ, , ) triple. Therefore, we can apply a MapReduce computation [16] to have at most 2| ||ℛ| calculations, leading to the complexity (| ||ℛ|)≪(|ℰ |) . Finally, our prediction is a dot product after graph convolutions; hence, our method can predict in (| e∗| ⋅ |ℐ |) for a single user as the vector dot product complexity is (| e∗|), which we do |ℐ | number of times, which is less than existing architectures with comparable approaches, e.g., PinSAGE.

5. Experiments

Inductive approaches are designed to provide recommendations in a cold-start setting, where ratings for new users are only known at inference time. Yet, as we will show, these baselines do not perform in this setting due to poor selection of learning metrics, evaluation methodologies, or other complexities. In the following, we aim at answering the questions: RQ1) Which design decisions afect the prediction performance compared to state of the art? RQ2) What is the efect of the negative sampling strategy in the evaluation? RQ3) How does relational gates afect performance?, and finally RQ4) How do the structure and data of the KG afect performance? defined in Section 4, we sampled 12,500 users from ML20m and 60,000 users from AB for training, named ML-S and Amazon-Book Subsampled (AB-S), respectively. We sample more users for AB-S due to few ratings per user. We then created two cold-start scenarios on ML-S: one adding 10% new users (i.e., 1250); and one where we treat all users not in ML-S as cold-start users, being ∼90% of the users in the original ML-20m dataset, allowing us to test the scalability of the inductive methods. For AB-S, we create one scenario adding the remaining users from the original dataset, corresponding to an additional ∼15% of the total number of users.

Methods. We compare to five methods: TopPop [ 2 ], a non-personalized common baseline [10] that recommends the most popular items; GraphSAGE [13], modified to recommend using cosine similarity between a user’s rated items and new items; PinSAGE[16], with a semi-supervised objective, i.e., items co-rated should be similar, analogous to the pin / board setting; IDCF [20], a two-step learning method, using key user embeddings to initialize new users; and BPR-MF [4], which we report as a reference transductive method retrained on the dataset including also the cold-start users, since it is fast to train and is competitive to state-of-the-art methods without requiring sequential data and shown to outperform the standard kNN method.

All models are implemented in PyTorch and optimized using the Adam optimizer . We save the best-performing state based on the validation set and stop after 50 successive epochs without improvement. For hyperparameter tuning of all models, we employ Asynchronous Successive Halving (ASHA) [44], all hyperparameter options and ASHA parameters available in our source code. We know that compared to PinSAGE, GInRec has only one extra hyperparameter, in the form of , for which we tune, as in subsection 4.4.

Evaluation metrics. Following other evaluation meth

ods [6], for each user in the test set, we rank all items not interacted with in the train and validation sets, only treating ratings in the test set as positive items. We measure Datasets. We adopt two real-world datasets: (i) MovieLens- NDCG, recall, precision, and coverage at 20 for each user, 20m (ML-20m) [41], a dataset with ratings on movies and reporting the average performance over all users. Let ℐ (ii) Amazon-Book (2014) (AB) [42], a dataset with reviews to be the top-k items recommended to a user , then we foonrebouoskest.hNeeMitihnedrRdeaatdaesretKhGas[2a2n]afsosrotchiaeteMdLK-2G0.mWdeatthaesreet-, cHaenndceefin,ae ncoaviveerargeecoams: me@nd=er as T|⋃op∈Popℐis|/ex|ℐpe|.cted to and for the AB dataset the KG constructed to evaluate perform poorly due to recommending the same set of KGAT [6]. These two graphs link reviewed items to items each time [45]. We further include I-NDCG [20] nodes in popular open-domain KGs such as DBpedia and which is the metric used to evaluate IDCF in the original WikiData. In both cases, we keep only items mapped to work, where X negative items (X=5 in IDCF) are sampled the KG leading to the statistics shown in Table 2. We per positive item in the test set instead of all possible adopt splitting ratios 0.8∶0.1∶0.1 for train, validation, and negative items as for NDCG. For all metrics, we remove test sets, respectively. We note that diferent versions items seen by the user during training from the set of of the AB dataset exist, and results cannot necessarily negative samples. We use ‘*’ to represent a statistically be directly compared between related works and our significant increase in performance using student t-test. dataset [20, 6, 3, 5, 43]. In our cold-start experiments, as RQ1 As Table 3 shows, GInRec is able to outperform than TopPop on both ML-S dataset with 1250 new users all methods on all metrics with statistical significance. and the AB-S dataset, its performance decreases when We also see contrasting results w.r.t. the original IDCF adding a large number of users to the ML-S dataset. We evaluation. IDCF was originally evaluated in a ranking ifnd our method to have similar increase in performance setting; however, the learned embeddings of IDCF are over all k’s in set {1, 5, 10, 20, 50}, but these results are not learned towards Cross-Entropy, a non-ranking, point- included here due to space constraints. wise learning objective. Such learning methodologies RQ2. When evaluating ranking performance, NDCG have been shown to perform poorly, with a similar or is the metric commonly adopted, but there are two alworse ranking than TopPop [ 2, 46 ]. Yet, IDCF uses the ternatives on which set of items to rank: either rank Cross-Entropy loss [47]. In the original work, IDCF out- all items in the dataset or just a subset. In other evalperforms PinSAGE by a small margin, yet we observe the uations [ 20, 1, 2 ], instead of ranking all items, only opposite in our evaluation. PinSAGE’s increased perfor- negative items are randomly sampled per each positive. mance in our evaluation is due to (i) a more appropriate While this would aim at making it equally hard to rank early stopping based on the evaluation metric instead of positive items for all test users, it has been proven to the loss function and (ii) our evaluation adopting a better produce unreliable comparisons of performances across learning objective for PinSAGE. methods [47]. This has also been witnessed when other

Figure 3 also shows GInRec outperforms all models in works re-evaluated BERT4Rec [26], which also utilized diferent user splits, we leave out the result on AB-S for negative subsampling, though using popularity-biased brevity, noting we get similar results. We demonstrate sampling instead of uniform sampling. Also in this verthat using KG information and relational gates provides sion, negative sampling leads to unreliable results [48] superior predictive power in all cases; given the improved ifnding even simple baselines outperforming this stateperformance over all popularity and sparsity groups. of-the-art method. Yet, in the original IDCF evaluation,

To study the method’s ability to make personalized this faulty method (here labeled I-NDCG) is adopted. Thus, recommendations, we utilize coverage [45]. We note hard-to-rank items are often missing from the evaluation that we are not able to perform statistical significance when subsampling negative items, and thus I-NDCG does testing with coverage as we only generate a single score not test the actual performance of the method as if all per dataset instead of one score per user. The metric is possible negative items are available. This presents an not useful by itself; a random model would have close to issue when considering the experimental evaluation of 1 in coverage. Having higher coverage but a far lower previous works. Therefore, here we once more compare ranking indicates more random recommendations while the two evaluation techniques: (a) ranking all items, (b) having low coverage but a high ranking score means subsampling negative items, and verify once more that low personalization and a popularity-biased dataset. In the latter methodology should be avoided since it produces Table 4, we can see a clear improvement over all other biased results. In Table 3, we see GraphSAGE outpermethods. Only GraphSAGE gets higher performance, yet, forms IDCF on I-NDCG in ML-S+1250, though clearly it is unable to make high-quality recommendations. Its performing worse in the appropriate NDCG@20. Hence, performance is therefore more random. IDCF performs this negative subsampling (I-NDCG) unfairly favors Toppoorly on all datasets w.r.t. coverage; this is correlated as to why it performs better on NDCG-I compared to Table 4 NDCG as we will discuss later. We also test with the Gini Coverage at 20 for all datasets.

Coeficient on the ML-S+1250 dataset, where 0 would be ML-S + 1250 ML-S + 90% AB-S + 10% an equal (uniform) distribution, and 1 is unequal. Here TBoPpRP-oMpF 00.0.071553847 00..0023222127 00..0000612245 GInRec get 0.959, BPR-MF 0.989, PinSAGE 0.991, Top- GraphSAGE 0.06307 0.31471 0.14243 Pop 0.993, and random having 0.241. Thus using this PinSAGE 0.02857 0.04169 0.12439 metric GInRec gets ≥3% more diverse distribution over IDCF 0.01566 0.02201 0.04348 PinSAGE and BPR-MF. While BPR-MF performs better GInRec 0.18540 0.42646 0.21460 Pop even above IDCF and IDCF over other methods. In- increase when using the semantic information carried by stead, when appropriately considering all items (NDCG) KG, both over the bipartite model, but also related works. as recommended [47], then GInRec, and other methods, Summarised, our gated aggregators can exploit the perform up to 3x times better than TopPop. relational information, as either removing the KG or the RQ3 & RQ4. The method’s results with diferent gat- relational information lead to a decrease in performance. ing mechanisms can be seen in Table 5. In the table, When adding many users, we even see a large decrease ‘w/o relation’ is the gating mechanism without relation in performance for the bipartite method, illustrating the type, i.e., efectively ignoring edge types, and ‘w/o gates’ scalability of our method and gated aggregation. is the method without the gating mechanism. Overall, the gating mechanism improves performance, as it adap- 6. Conclusion and future work tively selects information from neighboring nodes, and In this work, we devise a scalable gated GNN architecture it outperforms the two other models in all metrics. Dis- to perform inductive recommendation, with the ability regarding relation types leads to worse performance on to utilize high-order connectivities in CKGs. We show all datasets, and completely removing the gates leads that our method outperforms existing approaches. Furto dramatically lower performance. Thus, it is vital to ther, we showcase methodological limitations in previous design models that can exploit the semantic information evaluations. We conclude that this kind of architecture modeled by KGs. The Inner Product gate scales its neigh- deserves further study, especially given its ability to: (1) bors’ embeddings instead of selecting diferent parts as scale to large graphs and large numbers of users (easily the Concatenate gate and thus limits the flow from cer- extensible to distributed frameworks), and (2) maintain tain nodes. Yet, it is not able to select which part of the good prediction with new users and items despite its neighbor’s embeddings to propagate, thereby achieving lightweight inference methodology. a similar performance to the ’w/o gates’ method. Having the ability to limit flow for each dimension of the neighbor’s embedding is, therefore crucial for our method. Acknowledgments GInRec without gates is worse than PinSAGE, though still better than IDCF. Hence, only using the user’s inter- Matteo Lissandrini is supported by the European Union’s actions without a gating mechanism is still better than Horizon 2020 Research and Innovation Programme under the reconstruction used in IDCF. Even without relation the Marie Skłodowska-Curie grant agreement no. 838216. types, we see a large and statistically significant increase Katja Hose and Theis Jendal are supported by the Poul in performance. When looking at Figure 3, we see in Due Jensen Foundation and the Independent Research all cases that GInRec performs better than the bipartite Fund Denmark (DFF) under grant agreement no. DFFversion. In the first bin of all plots (i.e., the bins with 8048- 00051B. fewer or less popular ratings), we see a large performance

networks for web-scale recommender systems , in: [1]

He ,

Liao ,

Zhang ,

Nie ,

Hu , T.-S. Chua, SIGKDD' 18 , 2018 .

Neural collaborative filtering , in: TheWebConf' 17 , [17]

Wang ,

Zhang , L. Wu, H. Ma,

Hong ,

Wang ,

2017. Privileged graph distillation for cold start recom [2]

Cremonesi ,

Koren ,

Turrin , Performance of mendation , in: SIGIR'21 , 2021 .

recommender algorithms on top-n recommenda- [18]

Lee ,

Im ,

Jang ,

Cho ,

Chung , Melu:

tion tasks , in: RecSys'10 , 2010 . Meta-learned user preference estimator for cold[3 ]

He ,

Deng ,

Wang ,

Li ,

Zhang , M. Wang, start recommendation, in: SIGKDD'19 , 2019 .

Lightgcn: Simplifying and powering graph convo- [19]

Zhang , Y. Chen, Inductive matrix completion

lution network for recommendation, in: SIGIR'20, based on graph neural networks , in: ICLR'19 , 2019 .

2020 . [20]

Wu ,

Zhang ,

Gao ,

Yan ,

Zha , Towards [4]

Rendle ,

Freudenthaler ,

Gantner , L. Schmidt - open-world recommendation: An inductive model-

Thieme , Bpr: Bayesian personalized ranking from based collaborative filtering approach , in: ICML' 21 ,

implicit feedback , in: UAI'09 , 2009 . 2021 . [5]

Wang ,

He ,

Wang ,

Feng , T.-S. Chua, Neu- [21]

Wu ,

Cao ,

Shen ,

Tao , X. Cheng, INMO: A

ral graph collaborative filtering , in: SIGIR'19 , 2019 . model-agnostic and scalable module for inductive [6 ]

Wang ,

He ,

Cao , M. Liu, T.-S. Chua, Kgat: collaborative filtering , in: SIGIR'22 , 2022 .

Knowledge graph attention network for recommen- [22]

A. H.

Brams ,

A. L.

Jakobsen ,

T. E.

Jendal , M. Lissan-

dation, in: SIGKDD' 19 , 2019 . drini, P. Dolog,

Hose , Mindreader: Recommen[7]

Wang ,

Xu ,

He ,

Cao , T.-S. Chua, dation over knowledge graph entities with explicit

Explainable reasoning over knowledge graphs for user ratings , in: CIKM'20 , 2020 .

recommendation, in: AAAI' 19 , 2019 . [23]

Zhou ,

S.-H.

Yang ,

Zha , Functional matrix [8]

Wang ,

Zhang ,

Zhang , J. Leskovec, M. Zhao, factorizations for cold-start recommendation , in:

Li ,

Wang , Knowledge-aware graph neural SIGIR'11 , 2011 .

networks with label smoothness regularization for [24]

Xu ,

Jin ,

Z.-H.

Zhou , Speedup matrix comple-

recommender systems , in: SIGKDD'19 , 2019 . tion with side information: Application to multi [9]

Tao ,

Wei ,

Wang ,

He ,

Huang , T.-S. label learning, in: NeurIPS'13 , 2013 .

Chua , Mgat: Multimodal graph attention network [25] P.

Jain , I. S.

Dhillon , Provable inductive matrix

for recommendation, Information Processing & completion, arXiv preprint arXiv:1306.0626 ( 2013 ).

Management ( 2020 ). [26]

Sun , J. Liu,

Wu ,

Pei ,

Lin ,

Ou , P. Jiang, [10]

Palumbo ,

Monti , G. Rizzo,

Troncy , E. Baralis,

Bert4rec: Sequential recommendation with bidirec-

entity2rec: Property-specific knowledge graph em- tional encoder representations from transformer,

beddings for item recommendation , Expert Systems in: CIKM'19 , 2019 .

with Applications ( 2020 ). [27]

Hui ,

Zhang ,

Zhou ,

Wen ,

Nian , Person[11]

Wang ,

Zhang ,

Wang ,

Zhao ,

Li , X.

Xie, alized recommendation system based on knowledge

on the knowledge graph for recommender systems , ( 2022 ).

in: CIKM'18 , 2018 . [28]

Ma , M. Zhang,

Cao ,

Jin ,

Wang , Y. Liu, [12]

Yang ,

Dong , Hagerec: hierarchical atten- S. Ma,

Ren , Jointly learning explainable rules

knowledge graph for explainable recommendation , TheWebConf'19 , 2019 .

Knowledge-Based Systems ( 2020 ). [29]

Hochreiter ,

Schmidhuber , Long short-term [13] W. L.

Hamilton , R.

Ying , J.

Leskovec , Induc- memory, Neural computation ( 1997 ).

tive representation learning on large graphs , in: [30]

Cho , B. van Merriënboer ,

Gulcehre , D. Bah-

NeurIPS'17 , 2017 . danau,

Bougares ,

Schwenk ,

Bengio , Learn[14]

Zhang ,

Yao ,

Yu ,

Huang ,

Song , H. Chen, ing phrase representations using RNN encoder-

tion learning for personalization , ACM Transac- EMNLP'14 , 2014 .

tions on Information Systems ( 2021 ). [31]

Liu , I. Ounis,

Macdonald ,

Meng , A hetero[15]

Zhang ,

Chen ,

Zhang , G. Xu,

Gao , Geomet- geneous graph neural model for cold-start recom-

ric inductive matrix completion: A hyperbolic ap- mendation , in: SIGIR' 20 , 2020 .

proach with unified message passing , in: WSDM' 22 , [32]

Reimers , I. Gurevych , Sentence-bert: Sen-

2022. tence embeddings using siamese bert-networks , in: [16]

Ying ,

He ,

Chen ,

Eksombatchai ,

W. L.

EMNLP-IJCNLP ' 19 , 2019 . [33]

Trouillon ,

Welbl , S. Riedel, É. Gaussier,

prediction, in: ICML' 16 , 2016 . [34]

M. A.

Kramer , Nonlinear principal component anal-

journal ( 1991 ). [35]

R. H.

Hahnloser ,

Sarpeshkar ,

M. A.

Mahowald ,

silicon circuit , Nature 405 ( 2000 ). [36]

Li ,

Tarlow ,

Brockschmidt , R. S. Zemel,

Gated graph sequence neural networks , in: ICLR' 16 ,

2016 . [37]

Lin ,

Liu ,

Sun ,

Liu ,

Zhu , Learning en-

completion, in: AAAI' 15 , 2015 . [38]

Qiu ,

Tang , H. Ma,

Dong ,

Wang ,

Tang ,

learning, in: SIGKDD'18 , 2018 . [39]

T. N.

Kipf ,

Welling , Semi-supervised classifica-

tion with graph convolutional networks (

2017 ). [40]

Rendle ,

Krichene ,

Zhang , J. Anderson , Neu-

revisited, in: RecSys' 20 , 2020 . [41]

F. M.

Harper ,

J. A.

Konstan , The movielens datasets:

tive intelligent systems (tiis) ( 2015 ). [42]

Ni ,

Li , J. McAuley , Justifying recommendations

aspects, in: EMNLP-IJCNLP' 19 , 2019 . [43]

Zhu ,

Dai ,

Su ,

Ma , J. Liu,

Cai ,

Xiao ,

recommender systems , in: SIGIR'22 , 2022 . [44]

Li ,

Jamieson ,

Rostamizadeh , E. Gonina,

hyperparameter tuning ( 2018 ). [45]

Adomavicius ,

Kwon , Improving aggregate rec-

niques , IEEE Trans. Knowl. Data Eng . ( 2012 ). [46] J. Wu , X.

Wang , X.

Gao , J.

Chen , H.

Fu , T. Qiu,

arXiv:2201.02327 ( 2022 ). [47]

Krichene ,

Rendle , On sampled metrics for

item recommendation , in: SIGKDD'20 , 2020 . [48]

Latifi ,

Jannach ,

Ferraro , Sequential rec-

neighbors and sampled metrics , Inf. Sci . 609 ( 2022 )