1. Introduction

Transformer-Empowered Content-Aware Collaborative Filtering

Daxin Jiang

Weizhe Lin

Linjun Shou

Ming Gong

Pei Jian

jpei@cs.sfu.ca 2

Zhilin Wang

Bill Byrne

bill.byrne@eng.cam.ac.uk 0 0 Department of Engineering, University of Cambridge , Cambridge , United Kingdom 1 Microsoft STCA , Beijing , China 2 Simon Fraser University , British Columbia , Canada 3 University of Washington , Seattle , United States

Knowledge graph (KG) based Collaborative Filtering (CF) is an efective approach to personalize recommender systems for relatively static domains such as movies and books, by leveraging structured information from KG to enrich both item and user representations. This paper investigates the complementary power of unstructured content information (e.g. rich summary texts of items) in KG-based CF recommender systems. We introduce Content-aware KG-enhanced Meta-preference Networks that enhances the CF recommendation based on both structured information from KG as well as unstructured content features based on Transformer-empowered content-based filtering (CBF). Within this modeling framework, we demonstrate a powerful KG-based CF model and a CBF model (a variant of the well-known NRMS system) and employ a novel training scheme, Cross-System Contrastive Learning, to address the inconsistency of the two very diferent systems in fusing information. We present experimental results showing that enhancing collaborative filtering with Transformer-based features derived from content-based filtering ofers new improvements relative to strong baseline systems, improving the ability of KG-based CF systems to exploit item content information.

Knowledge graph recommender systems collaborative filtering

1. Introduction

Collaborative Filtering (CF) and Content-based Filtering CF systems study users’ interactions in order to leverage inter-item, inter-user, or user-item dependencies in making recommendations. The underlying notion is that users who interact with similar sets of items are likely to share preferences for other items. CBF models leverage descriptive attributes of items (e.g. item description and category) and users (e.g. age and gender). Users are characterized by the content information available in their browsing histories [2]. CBF is particularly well-suited to news recommendations, where millions of new items are produced every day. In contrast, CF systems are better suited to scenarios where the inventory of items grows slowly and where abundant user-item interactions are available. Movie and book recommender systems are WA, USA. nEvelop-O STCA. †This work was done during Weizhe Lin’s internship at Microsoft Nolan] - [director] - [Dunkirk (movie)]. KG-based CF models are particularly good at linking items to other related knowledge graph entities that serve as “item properties”. This approach leverages the structured content information from KGs (e.g. movie genre and actors) to complement CF features.

While KGs can readily incorporate structured content tent such as item descriptions, is largely unexploited.

Recent Transformer-based models, such as BERT [6] and GPT-2 [7], have shown great power in modeling descriptive content from natural language, which ofers new opportunities to enrich item/user representations with more expressive CBF features derived from Transformers. tion”, have a very similar set of structured properties including genre, writer, and director, but their descriptions in learning [17, 12, 18, 19, 20, 21]. Building upon graphprovide more fine-grained discriminative information, based CF models [22, 23], KG-based CF models fuse extermaking it clear that one is about physics and universe nal knowledge from auxiliary KGs to improve both the and the other is about adventures and dreams. accuracy and explainability of recommendation [5, 24].

Therefore, in this work, we ofer insights into the com- Items in interaction graphs are associated with auxiliary plementary power of unstructured CBF features derived KG entities with respect to their attributes (e.g. movie from Transformers (e.g. summary texts of books and directors). movies). We investigate how these content-aware CBF To exploit the KGs, Embedding-based Methods employ features can be efectively fused to complement CF learn- KG embedding methods (e.g TransE [25], TransH [26] ing, and how much value they can add to standard large- and TransR [27]) in order to enhance item represenscale KG-based CF recommender systems. tations with KG-aware entity embeddings [28, 29, 30].

However, computationally eficient approaches to en- For example, KTUP [30] trains item representations and rich KG-based CF models with unstructured CBF fea- TransH-powered KG completion simultaneously. Pathtures derived from Transformers are not yet well ad- based Methods follow the meta-paths manually designed dressed in the literature. The challenge mainly stems by domain experts to make KG-path-aware recommendafrom the need to capture the co-occurrence of graph node tions [31, 32, 33, 34], which is, however, not feasible for features by graph convolution operations. This opera- larger KGs with their enormous entity and path divertion requires representations of graph nodes to be back- sity. Convolution Methods [35, 36, 32, 37, 38] design conpropagated and updated after each forward pass, and thus volution mechanisms, mostly variants of Graph Neural it is prohibitively costly for large graphs where millions Networks [39, 40] (GNNs), to enhance item/user repreof item/user nodes require transformer-generated em- sentations with features aggregated from distant entities. beddings. Therefore, using pre-extracted features from KGIN [41] further embeds KG-relational embeddings in trained CBF systems is the most promising option. How- inter-node feature passing to achieve path-aware graph ever, conventional fusion schemes (such as Mixture of convolution.

Expert and early/late fusion) are shown to be vulnerable Content-based Filtering. CBF models match items in our experiments (see Sec. 4.4). We address this prob- to a user by considering the metadata (content-based lem by introducing Cross-System Contrastive Learning, information) of items with which the user has interwhich brings together the benefits of both structured and acted [42, 43, 44, 45, 46]. Most research in KG-based unstructured item properties. In this paper: CBF, a recently popular topic, focuses on enhancing the 1. We introduce a powerful KG CF model (KMPN) item representations with KG embeddings by mapping that outperforms strong baselines, and demon- relevant KG entities to the content of items, e.g., by entity strate the improvement brought by each system linking [47, 48]. However, these methods heavily rely component. We also introduce a Transformer- on word-level entity mapping with KG entities, which empowered CBF model (NRMS-BERT) that is prohibited for movies/books since their descriptions achieves good recommendation performance mostly consist of imaginary content, such as character with only summary texts of books and movies. names and fictional stories. 2. We propose to merge unstructured content-based Fusing CF and CBF. Hybrid CF-CBF systems are offeatures into KG-based CF through a simple but ef- ten achieved by weighting/combining [49, 50] or switchfective fusion framework based on Cross-System ing [51, 52, 53] between the ranking outputs of the two Contrastive Learning. systems. They can also pass a relatively coarser ranking list produced by one system into the other for refine3. Based on two realistic recommendation datasets, ment [54, 55]. The features derived from one system can we present extensive experiments showing the also be used to complement the other system by fusing value of incorporating unstructured CBF features with the output features (late fusion) [56] or augmenting derived from Transformers. the user/item input features (early fusion) [57, 58]. For example, CKE [29] produces augmented item represen2. Related Work tations by obtaining fixed textual features from unsupervised denoising auto-encoders. In contrast, we introCollaborative Filtering. Traditional CF models rely duce NRMS-BERT to obtain more expressive textual item on Matrix Factorization (MF) [8, 9, 10] and Factorization representations with supervised training and larger lanMachine (FM) [11, 12, 13] in learning user-item represen- guage models. Furthermore, these conventional fusing tations. Nearest neighbour approaches are also predomi- approaches (including late/early fusion and mixture of nant to CF, where the user-item ratings are interpolated experts) fail to perform well in our experiments (Sec. 4.4). from the ratings of similar items and users [14, 15, 16]. We address this by proposing a novel training scheme Recent models incorporate Deep Neural Networks (DNN) based on contrastive learning that complements a KGKnoledge Graph Interactions among Triplets

Preference Modelling ...

BERT

Attention

Pooling 3 NRMS-BERT 5 Prediction

Feature Passing Non-item KG Entity Item User ( 1 ) ( 2 ) User Browsed Items

Item Encoder

User Encoder 4 CCroonstsraSsytsivteemLearning 3.

Methodology 3.1. Data Notation

+ = {(, )| ∈ There are users {| ∈ description of the item.

Each (, )

pair indicates that user interacted with .

Each item ∈ carries unstructured data x , e.g. a text included in the KG: ⊂ .

The KG contains structured information that describes relations between real world entities. The KG is represented as a weighted heterogeneous graph = ( , ℰ ) with a node set consisting of nodes { } and an edge set ℰ containing all edges between nodes. The graph is also associated with a relation type mapping function ∶ ℰ → ℛ

that maps each edge to a type in the relation set ℛ consisting of relations. Note that all items are

The edges of the knowledge graph are triplets = {(ℎ, , )|ℎ, ∈ , ∈ ℛ}

, where tion of graph entities/nodes and ℛ is the relation set. Each triplet describes that a head entity ℎ is connected to a tail entity with the relation . For example, ( ℎ ℎ, . . , ℎ) ifes that Jack Nicholson is a film actor in the movie “The Shining”. To fully expose the relationships between heads and tails, the relation set is extended with reversed relation types, i.e., for any (ℎ, , ) triplet we allow inverse connection (, ′, ℎ) to be built, where ′ is the reverse of . The edge set ℰ is derived from these triplets.

speci , ∈ } is the set of user interactions. shown in Fig. 1 ( 1 ), ( 2 ), and ( 5 ). } and items {| ∈ }. links item features to users for recommendations, as is the collec- } , is the type of relation from to , and is a gated function that controls messages that flow from to : where the neighbouring set of : = { |( , , ) ∈

Network (KMPN)

This section introduces KG-enhanced Meta-Preference Network (thereafter KMPN). It is a KG-based CF model that aggregates features of all KG entities to items eficiently by exploiting relationships in the KG, and then

3.2.1. Gated Path Graph Convolution Networks

Associated with each KG node are feature vectors e(0) ∈ Rℎ. Each relation type ∈ ℛ is also associated with a relational embedding e . A Gated Path Graph Convolution Network is a cascade of convolution layers. For each KG node, a convolution layer aggregates features from its neighbors as follows: e (+1) = 1 | | { |( ∑ , , )∈ } e ⊙ e ,

() = ( e e ),

T where (.) is a sigmoid function that limits the gated value between 0 and 1. As a result, the message passed to a node is weighted by its importance to the receiving node and the relation type. Through stacking multiple layers of convolution, the final embedding at a node depends on the path along which the features are shared, as well as the importance of the message being transconvolutions, the embedding at a KG node after convo- these learnt preferences as much as possible, in order to lutions is an aggregation of all the intermediate output obtain diverse proxies bridging users and items. Though this assumption, they proposed to aggregate item embed- relation constraints to encourage diverse expression in the authors demonstrate a considerable improvement over baselines, we take the view that applying constraints to all dimensions of preference embeddings restricts their expressiveness, as they are trained to be very dissimilar and have diverse orientations in latent space.

We adopt a softer approach:

Soft Distance Correlation Loss, which firstly lowers the dimensionality of preference embeddings with Principal Component Analysis (PCA) [61] while keeping the most diferentiable features in embeddings, and then applies distance corlower dimensions: ê = ({ e ′| ′ ∈ }) ∈

Rℎ ;

( 7 ) . ( 8 ) embeddings: e =

∑ ′=0 e

( ′).

3.2.2. User Preference Modeling

Inspired by Wang et al. [41], we model users using a combination of preferences. Wang et al. [41] assumed that each user is influenced by multiple intents and that each intent is influenced by multiple movie attributes, such as the combination of the two relation types . . and . .

. Based on dings to users through “preferences”, and the embedding of each preference e is modelled by all types of edges: weight and e is the embedding of edge relation type .

We take the view that user preferences are not only limited to relations but can be extended to more general cases. We model each preference through a combination of a set of meta-preferences ℳ with in total metapreferences: each meta-preference ∈ ℳ

is associated is formed by these meta-preferences as follows: with a trainable embedding e ∈ Rℎ, and a preference e = ∑ e , ∈ℳ :

where the linear weights { | ∈ ℳ} are derived from trainable weights { ̂ | ∈ ℳ} for each preference =

exp ( ̂ ) ∑ ′∈ℳ exp ( ̂ ′)

As a result, meta-preferences reflect the general interests of all users. A particular user can be profiled by aggregating the embeddings of interacted items through these preferences: e (+1) = ∑ ∈

∑ (,)∈ + e () ⊙ e , =

exp (eTe() ) ∑ ′∈ exp (eT′e() ) e = ∑ ′=0 e

( ′).

In summary, each preference is formed by general and diverse meta-preferences, and users are further profiled by multiple preferences that focus on diferent aspects of item features. As with items, the final user embedding is: 3.2.3. Soft Distance Correlation ) ∉ +}. However, an item is not necessarily

= not all items have been viewed. We propose to adopt Reciprocal Ratio Negative Sampling (RRNS), where items with more user interactions are considered popular and are sampled less frequently based on the assumption that popular items are less likely to be hard negative samples for any user. The sampling distribution is given by a normalized reciprocal ratio of item interactions: − ∼ () ∝ 1 () for ∈

( 9 ) where () counts the interactions of all users with the item . where is the collection of preferences {} and is an attention mechanism that weights the interest of “not interesting” to a user if no interaction happens, as users over diferent preferences: Having modelled users through preferences, Wang et al.

The training set therefore consists of positive and nega[41] added an additional loss that utilizes Distance Corre- tive samples: = {(, +, −)|(, + ) ∈ +, (, − ) ∈ −}. lation (DCorr) [59, 60] to separate the representations of Pairwise BPR loss [9] is adopted to train the model, which exploits a contrastive learning concept to assign higher scores to users’ browsed items than those items in which the users are not interested: ℒ =

∑ (, +, −)∈

− ln( ( ̂ + − ̂ −)).

Together with commonly-used embedding L2 regularization and Soft Distance Correlation loss, the final loss is given by: 1 2 ℒ = ℒ + 1 ||Θ||22 + 2ℒ ,

( 11 ) parameters that control loss weights. where Θ = {e , e +, e −|(, +, −) ∈ } , and ||Θ||22 is the

L2-norm of user/item embeddings. 1 and 2 are hyper

3.3. Neural Recommendation with Multi-Head Self-Attention

Inspired by NRMS [43] that is powerful in news recThe rating is the dot product of user and item embeddings: ̂ ( 10 ) samples and negative samples are ̂ + and 1̂−,..., ̂ −, following [43], the loss is the log click probability of item : = (e )T ⋅ e . Assume that the scores of the positive ℒ = − ∑ log ( ∈ exp ( +̂) + ∑=1,.., exp ( −̂)) exp ( +̂) ( 16 )

KMPN (CKMPN) 3.4. Fusing CF and CBF: Content-aware

To fuse the information from a CBF model (NRMS-BERT) to a CF model (KMPN), we must bridge some inconsistencies between the two types of models. CBF models that utilize large transformers cannot be co-optimized with KG-based CF models, as graph convolution requires all embeddings to be present before convolution and this ommendations, we propose a variant of NRMS, NRMS- requires enormous GPU memory for even one single forBERT, that further utilizes a fine-tuned Transformer ward pass. As a result, a more eficient solution merges (BERT) for extracting contextual information from de- the pre-trained CBF features into the training of the KGe = ( x ) ∈ Rℎ.

( 12 ) than KMPN. scriptions of items, as shown in Fig. 1 ( 3 ).

3.3.1. Item Encoder

The item encoder encodes the text description string x of any item ∈ through BERT into embeddings of size ℎ by extracting the embeddings of <CLS> at the last layer: ered to E = [e ,1 , ..., e

, ] ∈ R×ℎ .

For each user , the item encoder encodes one positive item e + and negative items e −, ..., e −. items 1 are randomly sampled from the user’s browsed items ,1 ,..., , . These browsed items are encoded and gath-

3.3.2. User Encoder

The user encoder uses items with which users interacted to produce a content-aware user representation. The final user representation is a weighted sum of the browsed items: =1 e = ∑ e

, tained by passing features through two linear layers: where is the attention weight assigned to , ob =

exp (Â ) ∑ ′=1,.., exp (Â ′)

; Â = tanh(E A 1 + b 1 )A 2 + b 2

∈ R×1 b 2 layers, respectively.

where A 1 ∈ Rℎ× 12 ℎ, b 1 ∈ R 2 ℎ, A 2 ∈ R 21 ℎ×1, and

1 ∈ R1 are weights and biases of two fully-connected + − ℒ , −

, +

Cross-System Contrastive Loss is adopted to encourage the KMPN system to learn to incorporate contentsensitive features from NRMS-BERT features: − ln ( ((e − ln ( ((e )T ⋅ (e

+ )T ⋅ (e + − e − − e− , and

This loss encourages KMPN to produce item embeddings that interact not only with KMPN’s own user embeddings, but also with NRMS-BERT’s user embeddings.

Similarly, user embeddings of KMPN are trained to inlearn mutual expressiveness with e teract with items of NRMS-BERT. This allows e , but without

to approaching the two embeddings directly using similar( 15 ) ity (e.g. cosine-similarity), which we found not to work well (discussed in Sec. 4.4). In this case, e

as an ‘anchor’ with which the item embeddings of two systems learn to share commons and increase their mutu

serves ality. This loss encourages e and e

to lie in the state-of-the-art performance, while best performance of the proposed models is marked in bold. The average of 3 runs is reported to mitigate experimental randomness. Metrics with (*) are significantly higher than KMPN ( < 0.05 ).

On Amazon-Book-Extended BPRMF CKE KGAT KGIN KMPN (ours) - w/o Soft DCrr - w/o Soft DCorr and RRNS NRMS-BERT (ours) CKMPN ( = 0.2) (ours) CKMPN ( = 0.1) (ours) Improv. (%) CKMPN v.s. Best Baselines On Movie-KG-Dataset BPRMF CKE KGAT KGIN KMPN ( = 0.5, = 64) (ours) NRMS-BERT (ours) CKMPN ( = 0.01) (ours) CKMPN (ours) (on the cold-start set)

Recall 0.2433 0.2413

ndcg 0.0957 0.0948 Hit Ratio same hidden space hyperplane on which features have the same dot-product results with e

encourages KMPN to grow embeddings in the same region of hidden space, leading to mutual expressiveness across the two systems. Finally, the optimization target

. This constraint is: ℒ = ℒ + ℒ ,

(18) where

controls the weight of the Cross-System Contrastive Loss. This fusion scheme can be applied to any models with similar CF/CBF mechanisms.

Experiments 4.1. Datasets

We use the two datasets introduced in [62]: ( 1 ) AmazonBook-Extended collects book descriptions from multiple data sources for the popular Amazon-Book dataset. It contains 70,679 users, 24,915 items along with a KG of 88,572 nodes and 2,557,746 triplets. ( 2 ) Movie-KG-Dataset is a newly collected dataset that contains 125,218 users, 50,000 items with a KG of 250,327 nodes and 12,055,581 triplets. Descriptions of movies are provided to enable content-based recommendations.

4.2. Training Details

All experiments were run on 8 NVIDIA A100 GPUs with batch size 8192 × 8 for KMPN/CKMPN and 4 × 8 for NRMS-BERT. Adam [63] is used to optimize models. KMPN/CKMPN is trained for 2000 epochs with linearly decayed learning rates from 10−3 to 0 for Amazon-BookExtended and 5 × 10−4 to 0 for Movie-KG-Dataset. Training takes 4 hours on Amazon-Book-Extended and 12 hours on Movie-KG-Dataset. NRMS-BERT is trained for 10 epochs at a constant learning rate of 10−4. Training takes 20 hours on Amazon-Book-Extended and 120 hours on Movie-KG-Dataset.

Codes and pre-trained models will be released at https://github.com/LinWeizheDragon/Content-AwareKnowledge-Enhanced-Meta-Preference-Networks-forRecommendation.

4.3. Evaluation Metrics and Baselines

Following common practice [21, 37, 41, 64], we report metrics for evaluating model performance: ( 1 ) Recall@K : within top- recommendations, how well the system recalls the test-set browsed items for each user; ( 2 ) ndcg@K (Normalized Discounted Cumulative Gain) [64]: increases when relevant items appear earlier in the recommended list; ( 3 ) HitRatio@K : how likely a user finds at least one interesting item in the recommended top-K items.

We take the performance of several recently published recommender systems as points for comparison1. We carefully reproduced all these baseline systems from their repositories2.

BPRMF [9]: a strong Matrix Factorization (MF) method that applies a generic optimization criterion BPROpt for personalized ranking. Limited by space, other MF models (e.g. FM [65], NFM [12]) are not presented since BPRMF outperformed them.

CKE [29]: a CF model that leverages heterogeneous information in a knowledge base for recommendation.

KGAT [37]: Knowledge Graph Attention Network (KGAT) which explicitly models high-order KG connectivities in KG. The models’ user/item embeddings were initialized from the pre-trained BPRMF weights.

KGIN [41]: a state-of-the-art KG-based CF model that models users’ latent intents (preferences) as a combination of KG relations.

4.4. Performance on Amazon Dataset

Comparison with baselines. Performance of models is presented in Table 1. Our proposed KG-based CF model, KMPN, achieved a substantial improvement on all metrics over the performance of the existing state-of-the-art model KGIN; for example, Recall@20 was improved from 0.1654 to 0.1719, Recall@100 from 0.3298 to 0.3405, and ndcg@100 from 0.1267 to 0.1315. All relative improvements mentioned in our discussions are statistically significant ( < 0.05 ).

NRMS-BERT models user-item preferences using only item summary texts, without external information from a knowledge base. It still achieves 0.1142 in Recall@20 and 0.4273 Hit Ratio@100, not far from the KGIN baseline at 0.5040 Hit Ratio@100.

CKMPN further improves all @60/@100 metrics while keeping the model’s performance of @20. For example, with similar Recall@20, CKMPN (0.3461 Recall@100) outperforms KMPN (0.3405 Recall@100) by 1.6% with statistical significance < 0.05 . This demonstrates that even though KMPN achieves higher performance relative to NRMS-BERT, gathering item and user embeddings from one system (KMPN) with those of the other system (NRMS-BERT), through proxies (Cross-System CL), can still encourage KMPN to learn and fuse content-aware information from the learned representations of a CBF model and presents more relevant items in the top-100 list. 1They are also baseline systems being compared in a recent paper [41] (WWW’21). 2As a result, the results reported here may difer from those of the original papers.

Comparison with hybrid methods: Conventional feature fusion methods are popular and convenient options for combining one system into the training of another (as surveyed in Sec. 2). In fusing a pre-trained NRMSBERT with KMPN, we demonstrate the efectiveness of our proposed fusion framework CKMPN by comparing it with these conventional approaches. • Early Fusion: CBF features are concatenated to the trainable user/item embeddings of KMPN before the graph convolution layers. • Late Fusion: CBF features are fused to the output user/item embeddings of KMPN after the graph convolution layers. Many feature aggregation methods were experimented and the best of them are reported in Table 2: ( 1 ) concat+linear: CF features are concatenated with CBF features, and they pass through 3 MLP layers into embeddings of size R2×ℎ. ( 2 ) MultiHeadAtt: CF and CBF features passed through 3 Multi-head Self

Attention blocks into embeddings of size R2×ℎ. • Cos-Sim: An auxiliary loss grounded on cosinesimilarity is incorporated in training to encourage the user/item embeddings of KMPN to approach those of

NRMS-BERT. • Mixture of Expert (MoE): a hybrid system where the output scores of two systems, KMPN and NRMSBERT, pass through 3 layers of a Multi-Layer Perception (MLP) to obtain final item ratings.

It can be concluded that these feature aggregation approaches do not perform well in fusing pre-trained CBF features into KG-based CF training. ( 1 ) The performance of Late Fusion shows that when the already-learned NRMS-BERT item/user embeddings pass through new layers, these layers undo the learned representations from NRMS-BERT and lead to only degraded performance. ( 2 ) Cos-Sim shows that the auxiliary loss based on cosine-similarity places a reliance on NRMS-BERT’s features, which damages the KMPN training by limiting the expressiveness of KMPN to that of NRMS-BERT. As a result, the performance is decreased from 0.2793 (KMPN) to 0.2436 (Cos-Sim) recall@60.

Though NRMS-BERT alone achieves much lower metrics than KMPN (0.1142 vs 0.1719 Recall@20), MoE, where scores of two systems are merged by MLP layers, achieves 0.1723 Recall@20, showing that the scoring of two systems is complementary. However, MoE’s performance deteriorates on @60/100. A case study is presented later in Sec. 4.6 to show that the scoring of one system can possibly be extreme to overwhelm the final rating under the MoE setting. In contrast, our CKMPN steadily achieves better performance in @60/100 results relative to KMPN, showing that our method is an indepth collaboration of two systems instead of a simple aggregation of system outputs as in MoE. 0.1148 0.1164 0.1175 0.1026 0.1161 0.1197 0.172 0.170 llca0.168 e R 0.166

In conclusion, Cross-System CL significantly enhances recover in dimensions from the standard Distance CorKMPN’s ability to present more relevant items in the relation (DCorr) constraint. As shown in Fig. 2b, = 0 top-100 list through the fusion of unstructured content- (left) removes the DCorr constraint completely, while based features. It complements the aforementioned short- = 1 (right) reduces to a standard DCorr Loss. As ages of conventional fusing methods by merging features approaches 0, the DCorr constraint becomes too loose without corrupting the already-learned representation to encourage the diversity of preferences, leading to a and without directly approaching two systems’ outputs. dramatically decreased performance. The performance peaks at = 0.5 , where half of ℎ dimensions are relaxed 4.5. Contributions of Components from the standard Dcorr constraint, and preference embeddings are still able to grow diversely in the remaining To support the rationale of our designs, Ablation studies half dimensions. This suggests that our softer version of and hyperparameter evaluation are presented to explore DCorr constraint is beneficial to user modeling. the efects of each proposed component. Efects of RRNS. As shown in Table 1, without RecipEfects of Meta Preferences. An important research rocal Ratio Negative Sampling, Recall@20 of KMPN (w/o question is how the design of modeling users through SoftDcorr) is decreased from 0.1704 to 0.1690. In line meta-preferences improves the model performance. As with our intuition, reducing the probability of sampling shown in Fig. 2a, removing meta-preference modeling of popular items as negative samples for training can yield users from KMPN ( = 0) dramatically decreases the benefits in model learning. This demonstrates that while performance, showing that modeling users’ preferences viewed-but-not-clicked (hard negative) samples are not is necessary. = 16 achieves worse performance than available to the model, our proposed sampling strategy ≥ 32 since a small number of meta preferences limits enhances the quality of negative samples. the model’s capacity of modeling users. The performance Efects of Cross-System Contrastive Learning. The on all metrics increases until it peaks at = 64, and system performance of top-20 does not drop much for then it starts to decrease at ≥ 128. This suggests that ≤ 0.2 (Fig. 2c) whereas the performance at top-100 including too many meta preferences induces overfitting increases dramatically for ≤ 0.2 relative to a sysand does not further improve the system. It is a good tem without Cross-System CL ( = 0) (Fig. 2d). This model property in practice since a moderate = 64 is suggests that by incorporating Cross-System CL in our suficient for achieving the best performance. training with a reasonable , CKMPN is more capable Efects of Soft Distance Correlation Loss. The hyper- of finding relevant items for users. parameter controls the number of principal components to keep after PCA dimension reduction. The lower the ratio, the more flexibility the preference embeddings will from the standard test set, showing that our model still functions in the cold-start setting.

5. Conclusion

As shown in Table 1(bottom), the same performance boost is observed in KMPN relative to baselines. For example, KMPN achieves 0.1434 Recall@20 and 0.1073 ndcg@20, which is higher than 0.1403 Recall@20 and 0.1006 ndcg@20 of the baselines. CKMPN also achieves the best performance by incorporating content-based features from NRMS-BERT. It outperforms KMPN in all metrics, showing a significant improvement in ndcg@100 (from 0.1367 to 0.1482) and Hit Ratio@100 (from 0.3602 to 0.3668) in particular. Therefore, we can conclude that our method is applicable in multiple diferent datasets.

We present KMPN, a powerful KG-based CF model that outperforms strong baseline models. To investigate the complementary power of unstructured content-based information, we further propose a novel approach CrossSystem Contrastive Learning, which combines CF and CBF, two distinct paradigms to achieve a substantial improvement relative to models in literature. This suggests that KG-based CF models can benefit from the incorpoTable 3 ration of unstructured content information derived from Case study for a user who have browsed the movie Tenet Transformers. (2020). Source Code (2011) has a similar genre, while Dunkirk Our proposed CKMPN has thus far achieved sub(2017) has the same director. Y/N: whether or not the movie stantial improvements on both datasets, especially on appears in the top-100 recommendation list of the models. top-60/100 metrics. Industrial recommender systems NRMS: NRMS-BERT; MoE: Mixture of Expert. usually follow a 2-step pipeline where a relatively large Item KMPN NRMS MoE CKMPN amount of items = 60, 100 is firstly recalled by a Recall

Model and then a Ranking Model is adopted to refine the SoDurucnekCirokd(e20(21071)1) NY NY NN YY list ranking. This improvement presents more relevant items in the relatively coarser Recall output, which is appealing to industrial applications. Also, CKMPN is much more preferred than the Mixture of Expert model in industrial applications, since it still produces independent user/item representations. This feature enables the fast and eficient match of users and items in hidden space with (()) query time complexity [66].

An example output of systems is presented in Table 3. Y/N indicates whether or not the movie appears in the top-100 recommendation list of the four models (KMPN/NRMS-BERT/Mixture of Expert (MoE)/CKMPN).

This user has browsed Tenet (2020) directed by Christopher Nolan. The movie Source Code (2011) and Tenet are both about time travel, but they have quite diferent film crews. As a result, Source Code was considered positive by NRMS-BERT which evaluates on the movie description, but was considered negative by KG-based KMPN. Combining the scores of both systems, MoE did not recommend the movie. However, CKMPN complemented the failure of KMPN and gave a high score for this movie, by learning a content-aware item representation based on the representation of NRMS-BERT through Cross-System CL. In contrast, Dunkirk (2017) is about war and history which is not in the same topic as Tenet.

However, since they were directed by the same director, KMPN and CKMPN both recommended this movie, while MoE’s prediction was negatively afected by NRMSBERT. This case study suggests that our Cross-System CL approach is an efective in-depth collaboration of two systems, outperforming the direct mixture of KMPN and NRMS-BERT.

We also present the model performance on the coldstart test set of the Movie-KG-dataset, where users are completely unseen in the training. As shown in the last section of Table 1 (bottom), our best model CKMPN still achieved good performance for unseen users on all metrics, e.g., 0.1024 on Recall@20 and 0.3380 on Hit Ratio@100. The performance did not deteriorate much resentation learning for knowledge graph, in: Pro- dation, in: Proceedings of the 25th ACM SIGKDD ceedings of International Joint Conference on Arti- International Conference on Knowledge Discovery ifcial Intelligent (IJCAI), 2016, pp. 4–17. & Data Mining, 2019, pp. 950–958. [28] H. Wang, F. Zhang, M. Zhao, W. Li, X. Xie, M. Guo, [38] Z. Wang, G. Lin, H. Tan, Q. Chen, X. Liu, Ckan: Multi-task feature learning for knowledge graph Collaborative knowledge-aware attentive network enhanced recommendation, in: The World Wide for recommender systems, in: Proceedings of the Web Conference, 2019, pp. 2000–2010. 43rd International ACM SIGIR Conference on Re[29] F. Zhang, N. J. Yuan, D. Lian, X. Xie, W.-Y. Ma, Col- search and Development in Information Retrieval, laborative knowledge base embedding for recom- 2020, pp. 219–228. mender systems, in: Proceedings of the 22nd ACM [39] W. L. Hamilton, R. Ying, J. Leskovec, Inductive SIGKDD international conference on knowledge representation learning on large graphs, in: Prodiscovery and data mining, 2016, pp. 353–362. ceedings of the 31st International Conference on [30] Y. Cao, X. Wang, X. He, Z. Hu, T.-S. Chua, Uni- Neural Information Processing Systems, 2017, pp. fying knowledge graph learning and recommen- 1025–1035. dation: Towards a better understanding of user [40] P. Veličković, G. Cucurull, A. Casanova, A. Romero, preferences, in: The world wide web conference, P. Liò, Y. Bengio, Graph Attention Networks, Inter2019, pp. 151–161. national Conference on Learning Representations [31] B. Hu, C. Shi, W. X. Zhao, P. S. Yu, Leveraging meta- (2018).

path based context for top-n recommendation with [41] X. Wang, T. Huang, D. Wang, Y. Yuan, Z. Liu, X. He, a neural co-attention model, in: Proceedings of T.-S. Chua, Learning intents behind interactions the 24th ACM SIGKDD International Conference with knowledge graph for recommendation, in: on Knowledge Discovery & Data Mining, 2018, pp. Proceedings of the Web Conference 2021, 2021, pp. 1531–1540. 878–887. [32] J. Jin, J. Qin, Y. Fang, K. Du, W. Zhang, Y. Yu, [42] J. Liu, P. Dolan, E. R. Pedersen, Personalized news Z. Zhang, A. J. Smola, An eficient neighborhood- recommendation based on click behavior, in: Probased interaction model for recommendation on ceedings of the 15th international conference on heterogeneous graph, in: Proceedings of the 26th Intelligent user interfaces, 2010, pp. 31–40. ACM SIGKDD International Conference on Knowl- [43] C. Wu, F. Wu, S. Ge, T. Qi, Y. Huang, X. Xie, Neuedge Discovery & Data Mining, 2020, pp. 75–84. ral news recommendation with multi-head self[33] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandel- attention, in: Proceedings of the 2019 Conferwal, B. Norick, J. Han, Personalized entity recom- ence on Empirical Methods in Natural Language mendation: A heterogeneous information network Processing and the 9th International Joint Conapproach, in: Proceedings of the 7th ACM interna- ference on Natural Language Processing (EMNLPtional conference on Web search and data mining, IJCNLP), Association for Computational Linguis2014, pp. 283–292. tics, Hong Kong, China, 2019, pp. 6389–6394. URL: [34] H. Zhao, Q. Yao, J. Li, Y. Song, D. L. Lee, Meta-graph https://aclanthology.org/D19-1671. doi:10.18653/ based recommendation fusion over heterogeneous v1/D19-1671. information networks, in: Proceedings of the 23rd [44] S. Okura, Y. Tagami, S. Ono, A. Tajima, EmbeddingACM SIGKDD international conference on knowl- based news recommendation for millions of users, edge discovery and data mining, 2017, pp. 635–644. in: Proceedings of the 23rd ACM SIGKDD interna[35] H. Wang, F. Zhang, M. Zhang, J. Leskovec, M. Zhao, tional conference on knowledge discovery and data W. Li, Z. Wang, Knowledge-aware graph neural mining, 2017, pp. 1933–1942. networks with label smoothness regularization for [45] J. Lian, F. Zhang, X. Xie, G. Sun, Towards better reprecommender systems, in: Proceedings of the 25th resentation learning for personalized news recomACM SIGKDD international conference on knowl- mendation: a multi-channel deep fusion approach., edge discovery & data mining, 2019, pp. 968–977. in: IJCAI, 2018, pp. 3805–3811. [36] H. Wang, M. Zhao, X. Xie, W. Li, M. Guo, Knowl- [46] C. Wu, F. Wu, M. An, J. Huang, Y. Huang, X. Xie, edge graph convolutional networks for recom- Npa: neural news recommendation with personalmender systems, in: The World Wide Web ized attention, in: Proceedings of the 25th ACM Conference, WWW ’19, Association for Com- SIGKDD international conference on knowledge puting Machinery, New York, NY, USA, 2019, p. discovery & data mining, 2019, pp. 2576–2584. 3307–3313. URL: https://doi.org/10.1145/3308558. [47] D. Liu, J. Lian, S. Wang, Y. Qiao, J.-H. Chen, G. Sun, 3313417. doi:10.1145/3308558.3313417. X. Xie, Kred: Knowledge-aware document represen[37] X. Wang, X. He, Y. Cao, M. Liu, T.-S. Chua, Kgat: tation for news recommendations, in: Fourteenth Knowledge graph attention network for recommen- ACM Conference on Recommender Systems, 2020, pp. 200–209. and testing dependence by correlation of distances, [48] H. Wang, F. Zhang, X. Xie, M. Guo, Dkn: Deep The annals of statistics 35 (2007) 2769–2794. knowledge-aware network for news recommenda- [61] H. Hotelling, Analysis of a complex of statistical tion, in: Proceedings of the 2018 world wide web variables into principal components., Journal of conference, 2018, pp. 1835–1844. educational psychology 24 (1933) 417. [49] S. H. Choi, Y.-S. Jeong, M. K. Jeong, A hybrid rec- [62] W. Lin, L. Shou, M. Gong, P. Jian, Z. Wang, B. Byrne, ommendation method with reduced data for large- D. Jiang, Combining unstructured content and scale application, IEEE Transactions on Systems, knowledge graphs into recommendation datasets, Man, and Cybernetics, Part C (Applications and in: 4th Edition of Knowledge-aware and ConversaReviews) 40 (2010) 557–566. tional Recommender Systems (KaRS) Workshop @ [50] L. M. De Campos, J. M. Fernández-Luna, J. F. Huete, RecSys 2022, 2022.

M. A. Rueda-Morales, Combining content-based [63] D. P. Kingma, J. Ba, Adam: A method for stochastic and collaborative recommendations: A hybrid ap- optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd proach based on bayesian networks, Interna- International Conference on Learning Representational journal of approximate reasoning 51 (2010) tions, ICLR 2015, San Diego, CA, USA, May 7-9, 785–799. 2015, Conference Track Proceedings, 2015. URL: [51] D. Billsus, M. J. Pazzani, J. Chen, A learning agent http://arxiv.org/abs/1412.6980. for wireless news access, in: Proceedings of the 5th [64] W. Krichene, S. Rendle, On sampled metrics for item international conference on Intelligent user inter- recommendation, in: Proceedings of the 26th ACM faces, 2000, pp. 33–36. SIGKDD International Conference on Knowledge [52] M. Ghazanfar, A. Prugel-Bennett, Building Discovery & Data Mining, 2020, pp. 1748–1757. switching hybrid recommender system using ma- [65] S. Rendle, Z. Gantner, C. Freudenthaler, L. Schmidtchine learning classifiers and collaborative filtering, Thieme, Fast context-aware recommendations IAENG International Journal of Computer Science with factorization machines, in: Proceedings 37 (2010). of the 34th International ACM SIGIR Confer[53] J. M. Noguera, M. J. Barranco, R. J. Segura, ence on Research and Development in InformaL. Martínez, A mobile 3d-gis hybrid recommender tion Retrieval, SIGIR ’11, Association for Comsystem for tourism, Information Sciences 215 (2012) puting Machinery, New York, NY, USA, 2011, 37–52. p. 635–644. URL: https://doi.org/10.1145/2009916. [54] A. S. Lampropoulos, P. S. Lampropoulou, G. A. 2010002. doi:10.1145/2009916.2010002.

Tsihrintzis, A cascade-hybrid music recommender [66] Y. A. Malkov, D. A. Yashunin, Eficient and robust system for mobile services based on musical genre approximate nearest neighbor search using hierarclassification and personality diagnosis, Multime- chical navigable small world graphs, IEEE transacdia Tools and Applications 59 (2012) 241–258. tions on pattern analysis and machine intelligence [55] I. A. Christensen, S. N. Schiafino, A hybrid ap- 42 (2018) 824–836.

proach for group profiling in recommender systems (2014). [56] P. Bedi, P. Vashisth, P. Khurana, et al., Modeling user preferences in a hybrid recommender system using type-2 fuzzy sets, in: 2013 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), IEEE, 2013, pp. 1–8. [57] R. J. Mooney, L. Roy, Content-based book recommending using learning for text categorization, in: Proceedings of the fith ACM conference on Digital libraries, 2000, pp. 195–204. [58] X. Li, T. Murata, Multidimensional clustering based collaborative filtering approach for diversified recommendation, in: 2012 7th International Conference on Computer Science & Education (ICCSE),

IEEE, 2012, pp. 905–910. [59] G. J. Székely, M. L. Rizzo, Brownian distance covariance, The annals of applied statistics 3 (2009) 1236–1265. [60] G. J. Székely, M. L. Rizzo, N. K. Bakirov, Measuring

[1]

Takács ,

Pilászy ,

Németh ,

Tikk , Scalable collaborative filtering approaches for large recommender systems , The Journal of Machine Learning Research 10 ( 2009 ) 623 - 656 .

[2]

P. B.

Thorat ,

Goudar , S. Barve, Survey on collaborative filtering, content-based filtering and hybrid recommendation system , International Journal of Computer Applications 110 ( 2015 ) 31 - 36 .

[3]

Bennett ,

Lanning , et al., The netflix prize , in: Proceedings of KDD cup and workshop , volume 2007 , Citeseer, 2007 , p. 35 .

[4]

Pilászy ,

Tikk , Recommending new movies: even a few ratings are more valuable than metadata , in: Proceedings of the third ACM conference on Recommender systems , 2009 , pp. 93 - 100 .

[5]

Guo ,

Zhuang ,

Qin ,

Zhu ,

Xie ,

Xiong ,

He , A survey on knowledge graph-based recommender systems , IEEE Transactions on Knowledge and Data Engineering ( 2020 ).

[6]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: Proceedings of the 2019 Conference of the North American Chap- Electronic Commerce , 2000 , pp. 158 - 167 . ter of the Association for Computational Linguis- [17]

Guo ,

TANG ,

Ye ,

Li ,

He , Deepfm: tics: Human Language Technologies, Volume 1 A factorization-machine based neural network for (Long and Short Papers), Association for Com- ctr prediction , in: Proceedings of the Twentyputational Linguistics , Minneapolis, Minnesota, Sixth International Joint Conference on Artifi2019, pp. 4171 - 4186 . URL: https://aclanthology.org/ cial Intelligence, IJCAI-17 , 2017 , pp. 1725 - 1731 . N19 - 1423 . doi: 10 .18653/v1/ N19 -1423. URL: https://doi.org/10.24963/ijcai. 2017 /239. doi:10.

[7]

Radford , J. Wu ,

Child ,

Luan ,

Amodei , 24963 /ijcai. 2017 /239. I. Sutskever , et al., Language models are unsuper- [18]

Zhang , T. Du,

Wang , Deep learning over multivised multitask learners , OpenAI blog 1 ( 2019 ) 9. ifeld categorical data , in: European conference on

[8]

Koren ,

Bell ,

Volinsky , Matrix factorization information retrieval, Springer, 2016 , pp. 45 - 57 . techniques for recommender systems , Computer [19] H .-T. Cheng, L. Koc, J.

Harmsen , T.

Shaked , T. Chan42 ( 2009 ) 30 - 37 . dra, H. Aradhye, G. Anderson,

Corrado , W. Chai,

[9]

Rendle ,

Freudenthaler ,

Gantner ,

Schmidt- M. Ispir ,

Anil ,

Haque ,

Hong ,

Jain , Thieme, Bpr: Bayesian personalized ranking from X. Liu ,

Shah , Wide deep learning for recimplicit feedback , in: Proceedings of the Twenty- ommender systems, in: Proceedings of the 1st Fifth Conference on Uncertainty in Artificial Intel- Workshop on Deep Learning for Recommender ligence, UAI '09 , AUAI Press, Arlington, Virginia, Systems, DLRS 2016, Association for ComputUSA, 2009 , p. 452 - 461 . ing Machinery, New York, NY, USA, 2016 , p.

[10]

Koren , Factorization meets the neighborhood: 7 - 10 . URL: https://doi.org/10.1145/2988450.2988454. A multifaceted collaborative filtering model , in: doi:10.1145/2988450.2988454. Proceedings of the 14th ACM SIGKDD Interna- [20]

Qu ,

Cai ,

Ren ,

Zhang ,

Yu ,

Wen , tional Conference on Knowledge Discovery and

Wang , Product-based neural networks for user Data Mining, KDD '08, Association for Com- response prediction , in: 2016 IEEE 16th Internaputing Machinery , New York, NY, USA, 2008 , tional Conference on Data Mining (ICDM), IEEE , p. 426 - 434 . URL: https://doi.org/10.1145/1401890. 2016 , pp. 1149 - 1154 . 1401944. doi: 10 .1145/1401890.1401944. [21]

He ,

Liao ,

Zhang ,

Nie ,

Hu , T.-S. Chua,

[11]

Rendle , Factorization machines with libfm, ACM Neural collaborative filtering , in: Proceedings of Transactions on Intelligent Systems and Technol- the 26th international conference on world wide ogy (TIST) 3 ( 2012 ) 1 - 22 . web, 2017 , pp. 173 - 182 .

[12]

He , T.-S. Chua, Neural factorization ma- [22]

Ying ,

He ,

Chen ,

Eksombatchai , W. L. chines for sparse predictive analytics , in: Pro- Hamilton, J. Leskovec, Graph convolutional neuceedings of the 40th International ACM SIGIR ral networks for web-scale recommender systems , Conference on Research and Development in In- in : Proceedings of the 24th ACM SIGKDD Internaformation Retrieval , SIGIR '17, Association

for

tional Conference on Knowledge Discovery & Data Computing Machinery , New York, NY, USA, 2017 , Mining, 2018 , pp. 974 - 983 . p. 355 - 364 . URL: https://doi.org/10.1145/3077136. [23]

He ,

Deng ,

Wang ,

Li ,

Zhang , M. Wang, 3080777 . doi: 10 .1145/3077136.3080777. Lightgcn

: Simplifying and powering graph convolu-

[13]

R. J.

Oentaryo ,

E.-P.

Lim ,

J.-W.

Low ,

Lo , M. Fine- tion network for recommendation , in: Proceedings gold, Predicting response in mobile advertising of the 43rd International ACM SIGIR conference on with hierarchical importance-aware factorization research and development in Information Retrieval, machine , in: Proceedings of the 7th ACM interna- 2020 , pp. 639 - 648 . tional conference on Web search and data mining , [24]

Chicaiza ,

Valdiviezo-Diaz , A comprehensive 2014 , pp. 123 - 132 . survey of knowledge graph-based recommender

[14]

Verstrepen ,

Goethals , Unifying nearest neigh- systems: Technologies, development, and contribubors collaborative filtering , in: Proceedings of the tions, Information 12 ( 2021 ) 232 . 8th ACM Conference on Recommender systems, [25]

Bordes ,

Usunier ,

Garcia-Duran ,

Weston , 2014 , pp. 177 - 184 . O. Yakhnenko , Translating embeddings for mod-

[15]

Deshpande , G. Karypis, Item-based top-n rec- eling multi-relational data, Advances in neural ommendation algorithms , ACM Transactions on information processing systems 26 ( 2013 ). Information Systems - TOIS 22 ( 2004 ) 143 - 177 . [26]

Wang ,

Zhang ,

Feng ,

Chen , Knowledge doi: 10 .1145/963770.963776. graph embedding by translating on hyperplanes , in:

[16]

Sarwar , G. Karypis,

Konstan ,

Riedl , Analy- Proceedings of the AAAI Conference on Artificial sis of recommendation algorithms for e-commerce, Intelligence , volume 28 , 2014 . in: Proceedings of the 2nd ACM Conference on [27]

Wang ,

Li ,

Liu ,

Tang , Text-enhanced rep-