Transformer-Empowered Content-Aware Collaborative Filtering Weizhe Lin1,† , Linjun Shou2 , Ming Gong2 , Pei Jian3 , Zhilin Wang4 , Bill Byrne1 and Daxin Jiang2 1 Department of Engineering, University of Cambridge, Cambridge, United Kingdom 2 Microsoft STCA, Beijing, China 3 Simon Fraser University, British Columbia, Canada 4 University of Washington, Seattle, United States Abstract Knowledge graph (KG) based Collaborative Filtering (CF) is an effective approach to personalize recommender systems for relatively static domains such as movies and books, by leveraging structured information from KG to enrich both item and user representations. This paper investigates the complementary power of unstructured content information (e.g. rich summary texts of items) in KG-based CF recommender systems. We introduce Content-aware KG-enhanced Meta-preference Networks that enhances the CF recommendation based on both structured information from KG as well as unstructured content features based on Transformer-empowered content-based filtering (CBF). Within this modeling framework, we demonstrate a powerful KG-based CF model and a CBF model (a variant of the well-known NRMS system) and employ a novel training scheme, Cross-System Contrastive Learning, to address the inconsistency of the two very different systems in fusing information. We present experimental results showing that enhancing collaborative filtering with Transformer-based features derived from content-based filtering offers new improvements relative to strong baseline systems, improving the ability of KG-based CF systems to exploit item content information. Keywords Knowledge graph, recommender systems, collaborative filtering 1. Introduction examples of such scenarios and serve as the focus of this paper. Collaborative Filtering (CF) and Content-based Filtering In the Netflix Prize competition (2006-2009) [3], CF (CBF) are two leading recommendation techniques [1]. features (ratings and user-item interactions) were shown CF systems study users’ interactions in order to lever- to be more valuable than CBF features (e.g. movies’ meta- age inter-item, inter-user, or user-item dependencies in data) in recommendation [4]. However, recent work making recommendations. The underlying notion is that has shown that CF systems can benefit from the in- users who interact with similar sets of items are likely to corporation of external knowledge graphs (KGs) to en- share preferences for other items. CBF models leverage rich the user/item representations with structured CBF descriptive attributes of items (e.g. item description and features [5]. Knowledge graphs consist of knowledge category) and users (e.g. age and gender). Users are char- triplets; each triplet has a head entity, a tail entity, and a acterized by the content information available in their link that describes their relationship, e.g. [Christopher browsing histories [2]. CBF is particularly well-suited to Nolan] - [director] - [Dunkirk (movie)]. KG-based CF news recommendations, where millions of new items are models are particularly good at linking items to other produced every day. In contrast, CF systems are better related knowledge graph entities that serve as “item prop- suited to scenarios where the inventory of items grows erties”. This approach leverages the structured content slowly and where abundant user-item interactions are information from KGs (e.g. movie genre and actors) to available. Movie and book recommender systems are complement CF features. While KGs can readily incorporate structured content 4th Edition of Knowledge-aware and Conversational Recommender Sys- information and external knowledge, unstructured con- tems (KaRS) Workshop @ RecSys 2022, September 18–23 2023, Seattle, tent such as item descriptions, is largely unexploited. WA, USA. † This work was done during Weizhe Lin’s internship at Microsoft Recent Transformer-based models, such as BERT [6] and STCA. GPT-2 [7], have shown great power in modeling descrip- Envelope-Open wl356@cam.ac.uk (W. Lin); lisho@microsoft.com (L. Shou); tive content from natural language, which offers new migon@microsoft.com (M. Gong); jpei@cs.sfu.ca (P. Jian); opportunities to enrich item/user representations with zhilinw@uw.edu (Z. Wang); bill.byrne@eng.cam.ac.uk (B. Byrne); more expressive CBF features derived from Transformers. djiang@microsoft.com (D. Jiang) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License For example, the two movies “Interstellar” and “Incep- CEUR Workshop Proceedings Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 tion”, have a very similar set of structured properties in- cluding genre, writer, and director, but their descriptions in learning [17, 12, 18, 19, 20, 21]. Building upon graph- provide more fine-grained discriminative information, based CF models [22, 23], KG-based CF models fuse exter- making it clear that one is about physics and universe nal knowledge from auxiliary KGs to improve both the and the other is about adventures and dreams. accuracy and explainability of recommendation [5, 24]. Therefore, in this work, we offer insights into the com- Items in interaction graphs are associated with auxiliary plementary power of unstructured CBF features derived KG entities with respect to their attributes (e.g. movie from Transformers (e.g. summary texts of books and directors). movies). We investigate how these content-aware CBF To exploit the KGs, Embedding-based Methods employ features can be effectively fused to complement CF learn- KG embedding methods (e.g TransE [25], TransH [26] ing, and how much value they can add to standard large- and TransR [27]) in order to enhance item represen- scale KG-based CF recommender systems. tations with KG-aware entity embeddings [28, 29, 30]. However, computationally efficient approaches to en- For example, KTUP [30] trains item representations and rich KG-based CF models with unstructured CBF fea- TransH-powered KG completion simultaneously. Path- tures derived from Transformers are not yet well ad- based Methods follow the meta-paths manually designed dressed in the literature. The challenge mainly stems by domain experts to make KG-path-aware recommenda- from the need to capture the co-occurrence of graph node tions [31, 32, 33, 34], which is, however, not feasible for features by graph convolution operations. This opera- larger KGs with their enormous entity and path diver- tion requires representations of graph nodes to be back- sity. Convolution Methods [35, 36, 32, 37, 38] design con- propagated and updated after each forward pass, and thus volution mechanisms, mostly variants of Graph Neural it is prohibitively costly for large graphs where millions Networks [39, 40] (GNNs), to enhance item/user repre- of item/user nodes require transformer-generated em- sentations with features aggregated from distant entities. beddings. Therefore, using pre-extracted features from KGIN [41] further embeds KG-relational embeddings in trained CBF systems is the most promising option. How- inter-node feature passing to achieve path-aware graph ever, conventional fusion schemes (such as Mixture of convolution. Expert and early/late fusion) are shown to be vulnerable Content-based Filtering. CBF models match items in our experiments (see Sec. 4.4). We address this prob- to a user by considering the metadata (content-based lem by introducing Cross-System Contrastive Learning, information) of items with which the user has inter- which brings together the benefits of both structured and acted [42, 43, 44, 45, 46]. Most research in KG-based unstructured item properties. In this paper: CBF, a recently popular topic, focuses on enhancing the 1. We introduce a powerful KG CF model (KMPN) item representations with KG embeddings by mapping that outperforms strong baselines, and demon- relevant KG entities to the content of items, e.g., by entity strate the improvement brought by each system linking [47, 48]. However, these methods heavily rely component. We also introduce a Transformer- on word-level entity mapping with KG entities, which empowered CBF model (NRMS-BERT) that is prohibited for movies/books since their descriptions achieves good recommendation performance mostly consist of imaginary content, such as character with only summary texts of books and movies. names and fictional stories. Fusing CF and CBF. Hybrid CF-CBF systems are of- 2. We propose to merge unstructured content-based ten achieved by weighting/combining [49, 50] or switch- features into KG-based CF through a simple but ef- ing [51, 52, 53] between the ranking outputs of the two fective fusion framework based on Cross-System systems. They can also pass a relatively coarser ranking Contrastive Learning. list produced by one system into the other for refine- 3. Based on two realistic recommendation datasets, ment [54, 55]. The features derived from one system can we present extensive experiments showing the also be used to complement the other system by fusing value of incorporating unstructured CBF features with the output features (late fusion) [56] or augmenting derived from Transformers. the user/item input features (early fusion) [57, 58]. For example, CKE [29] produces augmented item represen- 2. Related Work tations by obtaining fixed textual features from unsu- pervised denoising auto-encoders. In contrast, we intro- Collaborative Filtering. Traditional CF models rely duce NRMS-BERT to obtain more expressive textual item on Matrix Factorization (MF) [8, 9, 10] and Factorization representations with supervised training and larger lan- Machine (FM) [11, 12, 13] in learning user-item represen- guage models. Furthermore, these conventional fusing tations. Nearest neighbour approaches are also predomi- approaches (including late/early fusion and mixture of nant to CF, where the user-item ratings are interpolated experts) fail to perform well in our experiments (Sec. 4.4). from the ratings of similar items and users [14, 15, 16]. We address this by proposing a novel training scheme Recent models incorporate Deep Neural Networks (DNN) based on contrastive learning that complements a KG- 2 Soft Distance Correlation Loss 1 KMPN ... 5 Prediction Knoledge Graph Interactions among Triplets Preference Modelling 3 NRMS-BERT Feature Passing BERT Attention Non-item KG Pooling Entity Item Cross System 4 User User Browsed Items Item Encoder User Encoder Contrastive Learning Figure 1: Framework pipeline. (1) KMPN: leverages meta preferences to model users from knowledge graph entities and interacted items; (2) Soft Distance Correlation: encourage preference embeddings to separate at low dimensions; (3) NRMS- BERT: extracts content-based features; (4) Cross System Contrastive Learning: encourages user/item embeddings to learn mutual information from content-based representations; (5) Rating: uses the dot product of KMPN user/item features. based CF model with these Transformer-based represen- 3.2. KG-Enhanced Meta-Preference tations. Network (KMPN) This section introduces KG-enhanced Meta-Preference 3. Methodology Network (thereafter KMPN). It is a KG-based CF model that aggregates features of all KG entities to items effi- 3.1. Data Notation ciently by exploiting relationships in the KG, and then There are 𝑁𝑢 users {𝑢|𝑢 ∈ 𝒮𝑈 } and 𝑁𝑖 items {𝑖|𝑖 ∈ 𝒮𝐼 }. links item features to users for recommendations, as 𝒫 + = {(𝑢, 𝑖)|𝑢 ∈ 𝒮𝑈 , 𝑖 ∈ 𝒮𝐼 } is the set of user interactions. shown in Fig. 1 (1), (2), and (5). Each (𝑢, 𝑖) pair indicates that user 𝑢 interacted with 𝑖. Each item 𝑖 ∈ 𝒮𝐼 carries unstructured data x𝑖 , e.g. a text 3.2.1. Gated Path Graph Convolution Networks description of the item. The KG contains structured information that describes Associated with each KG node 𝑣𝑖 are feature vectors (0) relations between real world entities. The KG is repre- e𝑖 ∈ Rℎ . Each relation type 𝑟 ∈ ℛ is also associated sented as a weighted heterogeneous graph 𝒢 = (𝒱 , ℰ ) with a relational embedding e𝑟 . A Gated Path Graph with a node set 𝒱 consisting of 𝑁𝑣 nodes {𝑣} and an edge Convolution Network is a cascade of 𝐿 convolution lay- set ℰ containing all edges between nodes. The graph is ers. For each KG node, a convolution layer aggregates also associated with a relation type mapping function features from its neighbors as follows: 𝜙 ∶ ℰ → ℛ that maps each edge to a type in the relation (𝑙+1) 1 (𝑙) e𝑖 = ∑ 𝛾 e ⊙ e𝑗 , (1) set ℛ consisting of 𝑁𝑟 relations. Note that all items are |𝒩𝑖 | {𝑣 |(𝑣 ,𝑟 ,𝑣 )∈𝒯 } 𝑖𝑗 𝑟𝑖𝑗 𝑗 𝑖 𝑖𝑗 𝑗 included in the KG: 𝒮𝑖 ⊂ 𝒱. The edges of the knowledge graph are triplets 𝒯 = where the neighbouring set of 𝑖: 𝒩𝑖 = {𝑣𝑗 |(𝑣𝑖 , 𝑟𝑖𝑗 , 𝑣𝑗 ) ∈ {(ℎ, 𝑟, 𝑡)|ℎ, 𝑡 ∈ 𝒱 , 𝑟 ∈ ℛ}, where 𝒱 is the collec- 𝒯 }, 𝑟𝑖𝑗 is the type of relation from 𝑣𝑖 to 𝑣𝑗 , and 𝛾𝑖𝑗 is a gated tion of graph entities/nodes and ℛ is the relation function that controls messages that flow from 𝑣𝑗 to 𝑣𝑖 : set. Each triplet describes that a head entity ℎ is con- 𝛾𝑖𝑗 = 𝜎(eT𝑖 e𝑟𝑖𝑗 ), (2) nected to a tail entity 𝑡 with the relation 𝑟. For exam- ple, (𝑇 ℎ𝑒 𝑆ℎ𝑖𝑛𝑖𝑛𝑔, 𝑓 𝑖𝑙𝑚.𝑓 𝑖𝑙𝑚.𝑎𝑐𝑡𝑜𝑟, 𝐽 𝑎𝑐𝑘 𝑁 𝑖𝑐ℎ𝑜𝑙𝑠𝑜𝑛) speci- where 𝜎(.) is a sigmoid function that limits the gated fies that Jack Nicholson is a film actor in the movie “The value between 0 and 1. As a result, the message passed Shining”. To fully expose the relationships between heads to a node is weighted by its importance to the receiving and tails, the relation set is extended with reversed re- node and the relation type. Through stacking multiple lation types, i.e., for any (ℎ, 𝑟, 𝑡) triplet we allow inverse layers of convolution, the final embedding at a node de- connection (𝑡, 𝑟 ′ , ℎ) to be built, where 𝑟 ′ is the reverse of pends on the path along which the features are shared, 𝑟. The edge set ℰ is derived from these triplets. as well as the importance of the message being trans- mitted. To overcome the over-smoothing issue of graph convolutions, the embedding at a KG node after 𝑙 convo- these learnt preferences as much as possible, in order to lutions is an aggregation of all the intermediate output obtain diverse proxies bridging users and items. Though 𝑙 (𝑙 ′ ) the authors demonstrate a considerable improvement embeddings: e𝑙𝑖 = ∑𝑙 ′ =0 e𝑖 . over baselines, we take the view that applying constraints 3.2.2. User Preference Modeling to all dimensions of preference embeddings restricts their expressiveness, as they are trained to be very dissimilar Inspired by Wang et al. [41], we model users using a and have diverse orientations in latent space. combination of preferences. Wang et al. [41] assumed We adopt a softer approach: Soft Distance Corre- that each user is influenced by multiple intents and lation Loss, which firstly lowers the dimensionality of that each intent is influenced by multiple movie at- preference embeddings with Principal Component Anal- tributes, such as the combination of the two relation ysis (PCA) [61] while keeping the most differentiable types 𝑓 𝑖𝑙𝑚.𝑓 𝑖𝑙𝑚.𝑑𝑖𝑟𝑒𝑐𝑡𝑜𝑟 and 𝑓 𝑖𝑙𝑚.𝑓 𝑖𝑙𝑚.𝑔𝑒𝑛𝑟𝑒. Based on features in embeddings, and then applies distance cor- this assumption, they proposed to aggregate item embed- relation constraints to encourage diverse expression in dings to users through “preferences”, and the embedding lower dimensions: of each preference e𝑝 is modelled by all types of edges: e𝑝 = ∑𝑟∈ℛ 𝛼𝑟𝑝 e𝑟 , where 𝛼𝑟𝑝 is a Softmax-ed trainable ê𝑝 = 𝑃𝐶𝐴({e𝑝 ′ |𝑝 ′ ∈ 𝒫 }) ∈ Rℎ𝜖 ; (7) weight and e𝑟 is the embedding of edge relation type 𝑟. We take the view that user preferences are not only 𝐷𝐶𝑜𝑣(e𝑝̂ , e𝑝̂ ′ ) ℒ𝑆𝑜𝑓 𝑡𝐷𝐶𝑜𝑟𝑟 = ∑ . (8) limited to relations but can be extended to more general 𝑝,𝑝 ′ ∈𝒫 ,𝑝≠𝑝 ′ 𝐷𝑉 𝑎𝑟(ê𝑝 ) ⋅ 𝐷𝑉 𝑎𝑟(ê𝑝 ′ ) cases. We model each preference 𝑝 through a combina- √ tion of a set of meta-preferences ℳ with in total 𝑁𝑚 meta- where 𝜖 controls the ratio of principal components to preferences: each meta-preference 𝑚 ∈ ℳ is associated keep after PCA. 𝐷𝐶𝑜𝑣(⋅) computes distance covariance with a trainable embedding e𝑚 ∈ Rℎ , and a preference 𝑝 and 𝐷𝑉 𝑎𝑟(⋅) measures distance variance [59, 60]. is formed by these meta-preferences as follows: Of course, 𝜖 = 1 yields the original DCorr Loss pro- posed in [41]. Through encouraging diverse expression e𝑝 = ∑ 𝛽𝑝𝑚 e𝑚 , (3) at only lower dimensions, preferences have retained the 𝑚∈ℳ flexibility in higher dimensions. where the linear weights {𝛽𝑝𝑚 |𝑚 ∈ ℳ} are derived from trainable weights {𝛽𝑝𝑚 ̂ |𝑚 ∈ ℳ} for each preference 3.2.4. Model Optimization with Reciprocal Ratio 𝑝: ̂ exp (𝛽𝑝𝑚 ) Negative Sampling (RRNS) 𝛽𝑝𝑚 = . (4) ∑𝑚′ ∈ℳ exp (𝛽𝑝𝑚 ̂ ′) Following the common approach, the dot product be- tween user and item embeddings is used for rating: As a result, meta-preferences reflect the general in- ̂ = (e𝐿𝑢 )T ⋅ e𝐿𝑖 . 𝑦𝑢𝑖 terests of all users. A particular user can be profiled by Both of the datasets we study do not provide hard aggregating the embeddings of interacted items through negative samples: i.e., we do not have samples of items these preferences: with which viewers chose not to interact. A common (𝑙+1) (𝑙) e𝑢 = ∑ 𝛼𝑝 ∑ e 𝑖 ⊙ e 𝑝 , (5) practice to synthesize negative examples is to randomly 𝑝∈𝒫 (𝑢,𝑖)∈𝒫 + sample from users’ unobserved counterparts 𝒫 − = where 𝒫 is the collection of 𝑁𝑝 preferences {𝑝} and 𝛼𝑝 {(𝑢, 𝑖− )|(𝑢, 𝑖− ) ∉ 𝒫 + }. However, an item is not necessarily is an attention mechanism that weights the interest of “not interesting” to a user if no interaction happens, as users over different preferences: not all items have been viewed. We propose to adopt (𝑙) exp (eT𝑝 e𝑢 ) Reciprocal Ratio Negative Sampling (RRNS), where items 𝛼𝑝 = . (6) with more user interactions are considered popular and (𝑙) ∑𝑝 ′ ∈𝒫 exp (eT𝑝 ′ e𝑢 ) are sampled less frequently based on the assumption that In summary, each preference is formed by general and popular items are less likely to be hard negative samples diverse meta-preferences, and users are further profiled for any user. The sampling distribution is given by a by multiple preferences that focus on different aspects of normalized reciprocal ratio of item interactions: item features. As with items, the final user embedding is: 1 𝑙 (𝑙 ′ ) e𝑙𝑢 = ∑𝑙 ′ =0 e𝑢 . 𝑖− ∼ 𝑃(𝑖) ∝ for 𝑖 ∈ 𝑆𝐼 (9) 𝑐(𝑖) where 𝑐(𝑖) counts the interactions of all users with the 3.2.3. Soft Distance Correlation item 𝑖. Having modelled users through preferences, Wang et al. The training set therefore consists of positive and nega- [41] added an additional loss that utilizes Distance Corre- tive samples: 𝒰 = {(𝑢, 𝑖+ , 𝑖− )|(𝑢, 𝑖+ ) ∈ 𝒫 + , (𝑢, 𝑖− ) ∈ 𝒫 − }. lation (DCorr) [59, 60] to separate the representations of Pairwise BPR loss [9] is adopted to train the model, which exploits a contrastive learning concept to assign higher 3.3.3. Model Optimization scores to users’ browsed items than those items in which The rating is the dot product of user and item embeddings: the users are not interested: ̂ = (e𝑢 )T ⋅ e𝑖 . Assume that the scores of the positive 𝑦𝑢𝑖 ℒ𝐵𝑃𝑅 = ∑ − ln(𝜎(𝑦𝑢𝑖 ̂ + − 𝑦𝑢𝑖 ̂ − )). (10) samples and negative samples are 𝑦̂ + and 𝑦1̂ − ,...,𝑦𝐾 ̂ − , fol- (𝑢,𝑖+ ,𝑖− )∈𝒰 lowing [43], the loss is the log click probability of item 𝑖: Together with commonly-used embedding L2 regular- exp (𝑦̂ + ) ization and Soft Distance Correlation loss, the final loss ℒ𝑁 𝑅𝑀𝑆 = − ∑ log ( − ) is given by: 𝑖∈𝒮𝑖 exp (𝑦̂ + ) + ∑ 𝑘=1,..,𝐾 exp (𝑦𝑘̂ ) (16) 1 ℒ𝐾 𝑀𝑃𝑁 = ℒ𝐵𝑃𝑅 + 𝜆1 ||Θ||22 + 𝜆2 ℒ𝑆𝑜𝑓 𝑡𝐷𝐶𝑜𝑟𝑟 , (11) 2 3.4. Fusing CF and CBF: Content-aware where Θ = {e𝐿𝑢 , e𝐿𝑖+ , e𝐿𝑖− |(𝑢, 𝑖+ , 𝑖− ) ∈ 𝒰}, and ||Θ||22 is the KMPN (CKMPN) L2-norm of user/item embeddings. 𝜆1 and 𝜆2 are hyper- parameters that control loss weights. To fuse the information from a CBF model (NRMS-BERT) to a CF model (KMPN), we must bridge some inconsis- 3.3. Neural Recommendation with tencies between the two types of models. CBF models that utilize large transformers cannot be co-optimized Multi-Head Self-Attention with KG-based CF models, as graph convolution requires Inspired by NRMS [43] that is powerful in news rec- all embeddings to be present before convolution and this ommendations, we propose a variant of NRMS, NRMS- requires enormous GPU memory for even one single for- BERT, that further utilizes a fine-tuned Transformer ward pass. As a result, a more efficient solution merges (BERT) for extracting contextual information from de- the pre-trained CBF features into the training of the KG- scriptions of items, as shown in Fig. 1 (3). CF component, enriching the learned representations. In line with our aim to use a CF model for movie and 3.3.1. Item Encoder book recommendations, we present a novel and efficient The item encoder encodes the text description string x𝑖 approach for training a better KMPN: Cross-System Con- of any item 𝑖 ∈ 𝒮𝑖 through BERT into embeddings of size trastive Learning, as shown in Fig. 1 (4). KMPN is still ℎ by extracting the embeddings of at the last layer: used as the backbone and it is trained with the aid of a pre-trained NRMS-BERT, not requiring more parameters e𝑖 = 𝐵𝐸𝑅𝑇 (x𝑖 ) ∈ Rℎ . (12) than KMPN. For each user 𝑢, the item encoder encodes one posi- In KMPN training, for users and items in (𝑢, 𝑖+ , 𝑖− ) ∈ 𝒰, tive item e𝑖+ and 𝐾 negative items e𝑖1 , ..., e𝑖𝐾 . 𝐵 items − − embeddings are generated from NRMS-BERT: 𝑒𝑢𝑁 𝑅𝑀𝑆 , 𝑁 𝑅𝑀𝑆 𝑁 𝑅𝑀𝑆 , and from KMPN: 𝑒𝑢𝐾 𝑀𝑃𝑁 , 𝑒𝑖𝐾+ 𝑀𝑃𝑁 , and are randomly sampled from the user’s browsed items 𝑒𝑖+ , 𝑒𝑖− 𝑖𝑢,1 ,...,𝑖𝑢,𝐵 . These browsed items are encoded and gath- 𝑒𝑖− 𝐾 𝑀𝑃𝑁 . ered to E𝑢 = [e𝑖𝑢,1 , ..., e𝑖𝑢,𝐵 ] ∈ R𝐵×ℎ . Cross-System Contrastive Loss is adopted to encour- age the KMPN system to learn to incorporate content- 3.3.2. User Encoder sensitive features from NRMS-BERT features: The user encoder uses items with which users interacted ℒ𝐶𝑆 = ∑ − ln (𝜎((e𝐾 𝑀𝑃𝑁 )T ⋅ (e𝑁 𝑅𝑀𝑆 − e𝑁− 𝑅𝑀𝑆 ))) 𝑢 𝑖+ 𝑖 to produce a content-aware user representation. The final (𝑢,𝑖+ ,𝑖− )∈𝒰 user representation is a weighted sum of the 𝐵 browsed − ln (𝜎((e𝑁 𝑅𝑀𝑆 )T ⋅ (e𝐾 𝑀𝑃𝑁 − e𝐾−𝑀𝑃𝑁 ))) 𝑢 𝑖+ 𝑖 items: 𝐵 (17) e𝑢 = ∑ 𝛼𝑏 e𝑖𝑢,𝑏 (13) 𝑏=1 This loss encourages KMPN to produce item embed- dings that interact not only with KMPN’s own user em- where 𝛼𝑏 is the attention weight assigned to 𝑖𝑢,𝑏 ob- beddings, but also with NRMS-BERT’s user embeddings. tained by passing features through two linear layers: Similarly, user embeddings of KMPN are trained to in- exp (Â𝑏 ) teract with items of NRMS-BERT. This allows e𝐾 𝑀𝑃𝑁 to 𝛼𝑏 = ; (14) learn mutual expressiveness with e𝑁 𝑅𝑀𝑆 , but 𝑖without ∑𝑏′ =1,..,𝐵 exp (Â𝑏′ ) 𝑖 approaching the two embeddings directly using similar- Â = tanh(E𝑢 A𝑓 𝑐1 + b𝑓 𝑐1 )A𝑓 𝑐2 + b𝑓 𝑐2 ∈ R𝐵×1 (15) ity (e.g. cosine-similarity), which we found not to work well (discussed in Sec. 4.4). In this case, e𝑁 𝑢 𝑅𝑀𝑆 serves ℎ× 12 ℎ 1 ℎ 1 ℎ×1 as an ‘anchor’ with which the item embeddings of two where A𝑓 𝑐1 ∈ R , b𝑓 𝑐1 ∈ R 2 , A𝑓 𝑐2 ∈ R 2 , and 1 systems learn to share commons and increase their mutu- b𝑓 𝑐2 ∈ R are weights and biases of two fully-connected ality. This loss encourages e𝐾𝑖 𝑀𝑃𝑁 and e𝑁 𝑅𝑀𝑆 to lie in the 𝑖 layers, respectively. Table 1 Model performance on Amazon-Book-Extended (top) and Movie-KG-dataset (bottom). Numbers underlined represent existing state-of-the-art performance, while best performance of the proposed models is marked in bold. The average of 3 runs is reported to mitigate experimental randomness. Metrics with (*) are significantly higher than KMPN (𝑝 < 0.05). On Amazon-Book-Extended Recall ndcg Hit Ratio @20 @60 @100 @20 @60 @100 @20 @60 @100 BPRMF 0.1352 0.2433 0.3088 0.0696 0.0957 0.1089 0.2376 0.3984 0.4816 CKE 0.1347 0.2413 0.3070 0.0691 0.0948 0.1081 0.2373 0.3963 0.4800 KGAT 0.1527 0.2595 0.3227 0.0807 0.1066 0.1194 0.2602 0.4156 0.4931 KGIN 0.1654 0.2691 0.3298 0.0893 0.1145 0.1267 0.2805 0.4289 0.5040 KMPN (ours) 0.1719 0.2793 0.3405 0.0931 0.1189 0.1315 0.2910 0.4421 0.5166 - w/o Soft DCrr 0.1704 0.2790 0.3396 0.0924 0.1185 0.1310 0.2881 0.4419 0.5152 - w/o Soft DCorr and RRNS 0.1690 0.2774 0.3391 0.0913 0.1177 0.1302 0.2872 0.4414 0.5155 NRMS-BERT (ours) 0.1142 0.2083 0.2671 0.0592 0.0817 0.0935 0.2057 0.3487 0.4273 CKMPN (𝜆𝐶𝑆 = 0.2) (ours) 0.1699 0.2812 0.3461 0.0922 0.1190 0.1319 0.2880 0.4460 0.5235 CKMPN (𝜆𝐶𝑆 = 0.1) (ours) 0.1718 0.2821* 0.3460* 0.0928 0.1197* 0.1326* 0.2908 0.4474* 0.5244* Improv. (%) CKMPN v.s. Best Baselines 3.90 4.82 4.94 4.31 4.55 4.59 3.72 4.33 4.04 On Movie-KG-Dataset Recall ndcg Hit Ratio @20 @60 @100 @20 @60 @100 @20 @60 @100 BPRMF 0.1387 0.1944 0.2206 0.0961 0.1137 0.1192 0.1980 0.2785 0.3236 CKE 0.1369 0.1898 0.2150 0.0940 0.1108 0.1160 0.1950 0.2707 0.3155 KGAT 0.1403 0.1928 0.2185 0.1006 0.1173 0.1226 0.1997 0.2742 0.3196 KGIN 0.1351 0.2119 0.2445 0.0982 0.1254 0.1322 0.2194 0.3081 0.3643 KMPN (𝜖 = 0.5, 𝑁𝑚 = 64) (ours) 0.1434 0.2130 0.2427 0.1073 0.1305 0.1367 0.2193 0.3098 0.3602 NRMS-BERT (ours) 0.1241 0.1669 0.1890 0.1034 0.1213 0.1257 0.1728 0.2369 0.2773 CKMPN (𝜆𝐶𝑆 = 0.01) (ours) 0.1457 0.2157 0.2462 0.1149 0.1417 0.1482 0.2266 0.3153 0.3668 CKMPN (ours) (on the cold-start set) 0.1024 0.1741 0.2130 0.0570 0.0729 0.0808 0.1812 0.2839 0.3380 same hidden space hyperplane on which features have 4.2. Training Details the same dot-product results with e𝑁 𝑢 𝑅𝑀𝑆 . This constraint All experiments were run on 8 NVIDIA A100 GPUs encourages KMPN to grow embeddings in the same re- with batch size 8192 × 8 for KMPN/CKMPN and 4 × 8 gion of hidden space, leading to mutual expressiveness for NRMS-BERT. Adam [63] is used to optimize models. across the two systems. Finally, the optimization target KMPN/CKMPN is trained for 2000 epochs with linearly is: ℒ𝐶𝐾 𝑀𝑃𝑁 = ℒ𝐾 𝑀𝑃𝑁 + 𝜆𝐶𝑆 ℒ𝐶𝑆 , (18)decayed learning rates from 10−3 to 0 for Amazon-Book- Extended and 5 × 10−4 to 0 for Movie-KG-Dataset. Train- where 𝜆𝐶𝑆 controls the weight of the Cross-System ing takes 4 hours on Amazon-Book-Extended and 12 Contrastive Loss. This fusion scheme can be applied to hours on Movie-KG-Dataset. NRMS-BERT is trained for any models with similar CF/CBF mechanisms. 10 epochs at a constant learning rate of 10−4 . Training takes 20 hours on Amazon-Book-Extended and 120 hours 4. Experiments on Movie-KG-Dataset. Codes and pre-trained models will be released at 4.1. Datasets https://github.com/LinWeizheDragon/Content-Aware- Knowledge-Enhanced-Meta-Preference-Networks-for- We use the two datasets introduced in [62]: (1) Amazon- Recommendation. Book-Extended collects book descriptions from multiple data sources for the popular Amazon-Book dataset. It 4.3. Evaluation Metrics and Baselines contains 70,679 users, 24,915 items along with a KG of 88,572 nodes and 2,557,746 triplets. (2) Movie-KG-Dataset Following common practice [21, 37, 41, 64], we report is a newly collected dataset that contains 125,218 users, metrics for evaluating model performance: (1) Recall@K : 50,000 items with a KG of 250,327 nodes and 12,055,581 within top-𝐾 recommendations, how well the system triplets. Descriptions of movies are provided to enable recalls the test-set browsed items for each user; (2) content-based recommendations. ndcg@K (Normalized Discounted Cumulative Gain) [64]: increases when relevant items appear earlier in the rec- ommended list; (3) HitRatio@K : how likely a user finds Comparison with hybrid methods: Conventional fea- at least one interesting item in the recommended top-K ture fusion methods are popular and convenient options items. for combining one system into the training of another We take the performance of several recently published (as surveyed in Sec. 2). In fusing a pre-trained NRMS- recommender systems as points for comparison1 . We BERT with KMPN, we demonstrate the effectiveness of carefully reproduced all these baseline systems from their our proposed fusion framework CKMPN by comparing repositories2 . it with these conventional approaches. BPRMF [9]: a strong Matrix Factorization (MF) • Early Fusion: CBF features are concatenated to the method that applies a generic optimization criterion BPR- trainable user/item embeddings of KMPN before the Opt for personalized ranking. Limited by space, other graph convolution layers. MF models (e.g. FM [65], NFM [12]) are not presented • Late Fusion: CBF features are fused to the output since BPRMF outperformed them. user/item embeddings of KMPN after the graph con- CKE [29]: a CF model that leverages heterogeneous volution layers. Many feature aggregation methods information in a knowledge base for recommendation. were experimented and the best of them are reported KGAT [37]: Knowledge Graph Attention Network in Table 2: (1) concat+linear: CF features are concate- (KGAT) which explicitly models high-order KG connec- nated with CBF features, and they pass through 3 MLP tivities in KG. The models’ user/item embeddings were layers into embeddings of size R2×ℎ . (2) MultiHeadAtt: initialized from the pre-trained BPRMF weights. CF and CBF features passed through 3 Multi-head Self- KGIN [41]: a state-of-the-art KG-based CF model that Attention blocks into embeddings of size R2×ℎ . models users’ latent intents (preferences) as a combina- • Cos-Sim: An auxiliary loss grounded on cosine- tion of KG relations. similarity is incorporated in training to encourage the user/item embeddings of KMPN to approach those of 4.4. Performance on Amazon Dataset NRMS-BERT. Comparison with baselines. Performance of models is • Mixture of Expert (MoE): a hybrid system where presented in Table 1. Our proposed KG-based CF model, the output scores of two systems, KMPN and NRMS- KMPN, achieved a substantial improvement on all met- BERT, pass through 3 layers of a Multi-Layer Percep- rics over the performance of the existing state-of-the-art tion (MLP) to obtain final item ratings. model KGIN; for example, Recall@20 was improved from It can be concluded that these feature aggregation ap- 0.1654 to 0.1719, Recall@100 from 0.3298 to 0.3405, and proaches do not perform well in fusing pre-trained CBF ndcg@100 from 0.1267 to 0.1315. All relative improve- features into KG-based CF training. (1) The performance ments mentioned in our discussions are statistically of Late Fusion shows that when the already-learned significant (𝑝 < 0.05). NRMS-BERT item/user embeddings pass through new NRMS-BERT models user-item preferences using only layers, these layers undo the learned representations item summary texts, without external information from a from NRMS-BERT and lead to only degraded perfor- knowledge base. It still achieves 0.1142 in Recall@20 and mance. (2) Cos-Sim shows that the auxiliary loss based 0.4273 Hit Ratio@100, not far from the KGIN baseline at on cosine-similarity places a reliance on NRMS-BERT’s 0.5040 Hit Ratio@100. features, which damages the KMPN training by limiting CKMPN further improves all @60/@100 metrics while the expressiveness of KMPN to that of NRMS-BERT. As a keeping the model’s performance of @20. For exam- result, the performance is decreased from 0.2793 (KMPN) ple, with similar Recall@20, CKMPN (0.3461 Recall@100) to 0.2436 (Cos-Sim) recall@60. outperforms KMPN (0.3405 Recall@100) by 1.6% with Though NRMS-BERT alone achieves much lower met- statistical significance 𝑝 < 0.05. This demonstrates that rics than KMPN (0.1142 vs 0.1719 Recall@20), MoE, even though KMPN achieves higher performance rela- where scores of two systems are merged by MLP lay- tive to NRMS-BERT, gathering item and user embeddings ers, achieves 0.1723 Recall@20, showing that the scor- from one system (KMPN) with those of the other system ing of two systems is complementary. However, MoE’s (NRMS-BERT), through proxies (Cross-System CL), can performance deteriorates on @60/100. A case study is still encourage KMPN to learn and fuse content-aware presented later in Sec. 4.6 to show that the scoring of one information from the learned representations of a CBF system can possibly be extreme to overwhelm the final model and presents more relevant items in the top-100 rating under the MoE setting. In contrast, our CKMPN list. steadily achieves better performance in @60/100 results 1 They are also baseline systems being compared in a recent paper relative to KMPN, showing that our method is an in- [41] (WWW’21). depth collaboration of two systems instead of a simple 2 As a result, the results reported here may differ from those of the aggregation of system outputs as in MoE. original papers. Table 2 Comparison with conventional feature fusion approaches (Amazon-Book-Extended). R: Recall; HR: Hit Ratio. Fusion Approach R@20 R@60 ndcg@60 HR@60 Early Fusion (concat) 0.1661 0.2708 0.1148 0.4299 Late Fusion (concat+linear) 0.1679 0.2769 0.1164 0.4381 Late Fusion (MultiHeadAtt) 0.1692 0.2778 0.1175 0.4385 Cos-Sim 0.1436 0.2436 0.1026 0.4001 Mixture of Expert 0.1723 0.2791 0.1161 0.4425 CKMPN (ours) 0.1718 0.2821 0.1197 0.4474 0.0932 0.173 0.0935 0.173 0.0935 0.347 0.1330 0.172 Recall@20 ndcg@20 Recall@20 ndcg@20 Recall@20 ndcg@20 Recall@100 ndcg@100 0.0930 0.172 0.0930 0.346 0.1325 0.172 0.0930 0.170 0.171 0.0925 0.345 0.1320 0.0928 0.0925 0.171 0.1315 0.170 0.0920 0.344 Recall Recall Recall Recall 0.0926 ndcg ndcg ndcg ndcg 0.168 0.0920 0.1310 0.170 0.169 0.0915 0.343 0.0924 0.1305 0.166 0.0915 0.168 0.0910 0.342 0.1300 0.0922 0.169 0.0910 0.167 0.0905 0.341 0.164 0.1295 0.0920 0.168 0.0905 0.166 0.0900 0.340 0.1290 0 16 32 64 128 256 0 0.2 0.5 0.8 1.0 0 0.05 0.1 0.2 0.5 0 0.05 0.1 0.2 0.5 Nm: number of meta-preferences Ratio of Soft Distance Correlation CS of Cross-System Contrastive Loss CS of Cross-System Contrastive Loss (a) Model performance varies (b) Model performance varies (c) Recall@20 (blue) and ndcg (d) Recall@100 (blue) and ndcg with the number of meta- with the ratio 𝜖 of Soft Dis- @20 (red) against the loss @100 (red) against the loss preferences. tance Correlation (DCorr). weight 𝜆𝐶𝑆 . weight 𝜆𝐶𝑆 . Figure 2: Evaluation of model hyperparameters. Zoom in to see figures in detail. In conclusion, Cross-System CL significantly enhances recover in dimensions from the standard Distance Cor- KMPN’s ability to present more relevant items in the relation (DCorr) constraint. As shown in Fig. 2b, 𝜖 = 0 top-100 list through the fusion of unstructured content- (left) removes the DCorr constraint completely, while based features. It complements the aforementioned short- 𝜖 = 1 (right) reduces to a standard DCorr Loss. As 𝜖 ages of conventional fusing methods by merging features approaches 0, the DCorr constraint becomes too loose without corrupting the already-learned representation to encourage the diversity of preferences, leading to a and without directly approaching two systems’ outputs. dramatically decreased performance. The performance peaks at 𝜖 = 0.5, where half of ℎ dimensions are relaxed 4.5. Contributions of Components from the standard Dcorr constraint, and preference em- beddings are still able to grow diversely in the remaining To support the rationale of our designs, Ablation studies half dimensions. This suggests that our softer version of and hyperparameter evaluation are presented to explore DCorr constraint is beneficial to user modeling. the effects of each proposed component. Effects of RRNS. As shown in Table 1, without Recip- Effects of Meta Preferences. An important research rocal Ratio Negative Sampling, Recall@20 of KMPN (w/o question is how the design of modeling users through SoftDcorr) is decreased from 0.1704 to 0.1690. In line meta-preferences improves the model performance. As with our intuition, reducing the probability of sampling shown in Fig. 2a, removing meta-preference modeling of popular items as negative samples for training can yield users from KMPN (𝑁𝑚 = 0) dramatically decreases the benefits in model learning. This demonstrates that while performance, showing that modeling users’ preferences viewed-but-not-clicked (hard negative) samples are not is necessary. 𝑁𝑚 = 16 achieves worse performance than available to the model, our proposed sampling strategy 𝑁𝑚 ≥ 32 since a small number of meta preferences limits enhances the quality of negative samples. the model’s capacity of modeling users. The performance Effects of Cross-System Contrastive Learning. The on all metrics increases until it peaks at 𝑁𝑚 = 64, and system performance of top-20 does not drop much for then it starts to decrease at 𝑁𝑚 ≥ 128. This suggests that 𝜆𝐶𝑆 ≤ 0.2 (Fig. 2c) whereas the performance at top-100 including too many meta preferences induces overfitting increases dramatically for 𝜆𝐶𝑆 ≤ 0.2 relative to a sys- and does not further improve the system. It is a good tem without Cross-System CL (𝜆𝐶𝑆 = 0) (Fig. 2d). This model property in practice since a moderate 𝑁𝑚 = 64 is suggests that by incorporating Cross-System CL in our sufficient for achieving the best performance. training with a reasonable 𝜆𝐶𝑆 , CKMPN is more capable Effects of Soft Distance Correlation Loss. The hyper- of finding relevant items for users. parameter 𝜖 controls the number of principal components to keep after PCA dimension reduction. The lower the ratio, the more flexibility the preference embeddings will 4.6. Performance on Movie-KG-Dataset from the standard test set, showing that our model still functions in the cold-start setting. As shown in Table 1(bottom), the same performance boost is observed in KMPN relative to baselines. For example, KMPN achieves 0.1434 Recall@20 and 0.1073 5. Conclusion ndcg@20, which is higher than 0.1403 Recall@20 and 0.1006 ndcg@20 of the baselines. CKMPN also achieves We present KMPN, a powerful KG-based CF model that the best performance by incorporating content-based outperforms strong baseline models. To investigate the features from NRMS-BERT. It outperforms KMPN in all complementary power of unstructured content-based metrics, showing a significant improvement in ndcg@100 information, we further propose a novel approach Cross- (from 0.1367 to 0.1482) and Hit Ratio@100 (from 0.3602 System Contrastive Learning, which combines CF and to 0.3668) in particular. Therefore, we can conclude that CBF, two distinct paradigms to achieve a substantial im- our method is applicable in multiple different datasets. provement relative to models in literature. This suggests that KG-based CF models can benefit from the incorpo- Table 3 ration of unstructured content information derived from Case study for a user who have browsed the movie Tenet Transformers. (2020). Source Code (2011) has a similar genre, while Dunkirk Our proposed CKMPN has thus far achieved sub- (2017) has the same director. Y/N: whether or not the movie stantial improvements on both datasets, especially on appears in the top-100 recommendation list of the models. top-60/100 metrics. Industrial recommender systems NRMS: NRMS-BERT; MoE: Mixture of Expert. usually follow a 2-step pipeline where a relatively large Item KMPN NRMS MoE CKMPN amount of items 𝐾 = 60, 100 is firstly recalled by a Recall Model and then a Ranking Model is adopted to refine the Source Code (2011) N Y N Y list ranking. This improvement presents more relevant Dunkirk (2017) Y N N Y items in the relatively coarser Recall output, which is ap- pealing to industrial applications. Also, CKMPN is much An example output of systems is presented in Ta- more preferred than the Mixture of Expert model in in- ble 3. Y/N indicates whether or not the movie appears dustrial applications, since it still produces independent in the top-100 recommendation list of the four models user/item representations. This feature enables the fast (KMPN/NRMS-BERT/Mixture of Expert (MoE)/CKMPN). and efficient match of users and items in hidden space This user has browsed Tenet (2020) directed by Christo- with 𝑂(𝑙𝑜𝑔(𝑛)) query time complexity [66]. pher Nolan. The movie Source Code (2011) and Tenet are both about time travel, but they have quite differ- ent film crews. As a result, Source Code was considered References positive by NRMS-BERT which evaluates on the movie description, but was considered negative by KG-based [1] G. Takács, I. Pilászy, B. Németh, D. Tikk, Scalable KMPN. Combining the scores of both systems, MoE did collaborative filtering approaches for large recom- not recommend the movie. However, CKMPN comple- mender systems, The Journal of Machine Learning mented the failure of KMPN and gave a high score for Research 10 (2009) 623–656. this movie, by learning a content-aware item representa- [2] P. B. Thorat, R. Goudar, S. Barve, Survey on collab- tion based on the representation of NRMS-BERT through orative filtering, content-based filtering and hybrid Cross-System CL. In contrast, Dunkirk (2017) is about recommendation system, International Journal of war and history which is not in the same topic as Tenet. Computer Applications 110 (2015) 31–36. However, since they were directed by the same direc- [3] J. Bennett, S. Lanning, et al., The netflix prize, in: tor, KMPN and CKMPN both recommended this movie, Proceedings of KDD cup and workshop, volume while MoE’s prediction was negatively affected by NRMS- 2007, Citeseer, 2007, p. 35. BERT. This case study suggests that our Cross-System [4] I. Pilászy, D. Tikk, Recommending new movies: CL approach is an effective in-depth collaboration of two even a few ratings are more valuable than metadata, systems, outperforming the direct mixture of KMPN and in: Proceedings of the third ACM conference on NRMS-BERT. Recommender systems, 2009, pp. 93–100. We also present the model performance on the cold- [5] Q. Guo, F. Zhuang, C. Qin, H. Zhu, X. Xie, H. Xiong, start test set of the Movie-KG-dataset, where users are Q. He, A survey on knowledge graph-based recom- completely unseen in the training. As shown in the mender systems, IEEE Transactions on Knowledge last section of Table 1 (bottom), our best model CKMPN and Data Engineering (2020). still achieved good performance for unseen users on all [6] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: metrics, e.g., 0.1024 on Recall@20 and 0.3380 on Hit Ra- Pre-training of deep bidirectional transformers for tio@100. The performance did not deteriorate much language understanding, in: Proceedings of the 2019 Conference of the North American Chap- Electronic Commerce, 2000, pp. 158–167. ter of the Association for Computational Linguis- [17] H. Guo, R. TANG, Y. Ye, Z. Li, X. He, Deepfm: tics: Human Language Technologies, Volume 1 A factorization-machine based neural network for (Long and Short Papers), Association for Com- ctr prediction, in: Proceedings of the Twenty- putational Linguistics, Minneapolis, Minnesota, Sixth International Joint Conference on Artifi- 2019, pp. 4171–4186. URL: https://aclanthology.org/ cial Intelligence, IJCAI-17, 2017, pp. 1725–1731. N19-1423. doi:10.18653/v1/N19- 1423 . URL: https://doi.org/10.24963/ijcai.2017/239. doi:10. [7] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, 24963/ijcai.2017/239 . I. Sutskever, et al., Language models are unsuper- [18] W. Zhang, T. Du, J. Wang, Deep learning over multi- vised multitask learners, OpenAI blog 1 (2019) 9. field categorical data, in: European conference on [8] Y. Koren, R. Bell, C. Volinsky, Matrix factorization information retrieval, Springer, 2016, pp. 45–57. techniques for recommender systems, Computer [19] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chan- 42 (2009) 30–37. dra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, [9] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt- M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, Thieme, Bpr: Bayesian personalized ranking from X. Liu, H. Shah, Wide deep learning for rec- implicit feedback, in: Proceedings of the Twenty- ommender systems, in: Proceedings of the 1st Fifth Conference on Uncertainty in Artificial Intel- Workshop on Deep Learning for Recommender ligence, UAI ’09, AUAI Press, Arlington, Virginia, Systems, DLRS 2016, Association for Comput- USA, 2009, p. 452–461. ing Machinery, New York, NY, USA, 2016, p. [10] Y. Koren, Factorization meets the neighborhood: 7–10. URL: https://doi.org/10.1145/2988450.2988454. A multifaceted collaborative filtering model, in: doi:10.1145/2988450.2988454 . Proceedings of the 14th ACM SIGKDD Interna- [20] Y. Qu, H. Cai, K. Ren, W. Zhang, Y. Yu, Y. Wen, tional Conference on Knowledge Discovery and J. Wang, Product-based neural networks for user Data Mining, KDD ’08, Association for Com- response prediction, in: 2016 IEEE 16th Interna- puting Machinery, New York, NY, USA, 2008, tional Conference on Data Mining (ICDM), IEEE, p. 426–434. URL: https://doi.org/10.1145/1401890. 2016, pp. 1149–1154. 1401944. doi:10.1145/1401890.1401944 . [21] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, T.-S. Chua, [11] S. Rendle, Factorization machines with libfm, ACM Neural collaborative filtering, in: Proceedings of Transactions on Intelligent Systems and Technol- the 26th international conference on world wide ogy (TIST) 3 (2012) 1–22. web, 2017, pp. 173–182. [12] X. He, T.-S. Chua, Neural factorization ma- [22] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. chines for sparse predictive analytics, in: Pro- Hamilton, J. Leskovec, Graph convolutional neu- ceedings of the 40th International ACM SIGIR ral networks for web-scale recommender systems, Conference on Research and Development in In- in: Proceedings of the 24th ACM SIGKDD Interna- formation Retrieval, SIGIR ’17, Association for tional Conference on Knowledge Discovery & Data Computing Machinery, New York, NY, USA, 2017, Mining, 2018, pp. 974–983. p. 355–364. URL: https://doi.org/10.1145/3077136. [23] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, M. Wang, 3080777. doi:10.1145/3077136.3080777 . Lightgcn: Simplifying and powering graph convolu- [13] R. J. Oentaryo, E.-P. Lim, J.-W. Low, D. Lo, M. Fine- tion network for recommendation, in: Proceedings gold, Predicting response in mobile advertising of the 43rd International ACM SIGIR conference on with hierarchical importance-aware factorization research and development in Information Retrieval, machine, in: Proceedings of the 7th ACM interna- 2020, pp. 639–648. tional conference on Web search and data mining, [24] J. Chicaiza, P. Valdiviezo-Diaz, A comprehensive 2014, pp. 123–132. survey of knowledge graph-based recommender [14] K. Verstrepen, B. Goethals, Unifying nearest neigh- systems: Technologies, development, and contribu- bors collaborative filtering, in: Proceedings of the tions, Information 12 (2021) 232. 8th ACM Conference on Recommender systems, [25] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, 2014, pp. 177–184. O. Yakhnenko, Translating embeddings for mod- [15] M. Deshpande, G. Karypis, Item-based top-n rec- eling multi-relational data, Advances in neural ommendation algorithms, ACM Transactions on information processing systems 26 (2013). Information Systems - TOIS 22 (2004) 143–177. [26] Z. Wang, J. Zhang, J. Feng, Z. Chen, Knowledge doi:10.1145/963770.963776 . graph embedding by translating on hyperplanes, in: [16] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Analy- Proceedings of the AAAI Conference on Artificial sis of recommendation algorithms for e-commerce, Intelligence, volume 28, 2014. in: Proceedings of the 2nd ACM Conference on [27] Z. Wang, J. Li, Z. Liu, J. Tang, Text-enhanced rep- resentation learning for knowledge graph, in: Pro- dation, in: Proceedings of the 25th ACM SIGKDD ceedings of International Joint Conference on Arti- International Conference on Knowledge Discovery ficial Intelligent (IJCAI), 2016, pp. 4–17. & Data Mining, 2019, pp. 950–958. [28] H. Wang, F. Zhang, M. Zhao, W. Li, X. Xie, M. Guo, [38] Z. Wang, G. Lin, H. Tan, Q. Chen, X. Liu, Ckan: Multi-task feature learning for knowledge graph Collaborative knowledge-aware attentive network enhanced recommendation, in: The World Wide for recommender systems, in: Proceedings of the Web Conference, 2019, pp. 2000–2010. 43rd International ACM SIGIR Conference on Re- [29] F. Zhang, N. J. Yuan, D. Lian, X. Xie, W.-Y. Ma, Col- search and Development in Information Retrieval, laborative knowledge base embedding for recom- 2020, pp. 219–228. mender systems, in: Proceedings of the 22nd ACM [39] W. L. Hamilton, R. Ying, J. Leskovec, Inductive SIGKDD international conference on knowledge representation learning on large graphs, in: Pro- discovery and data mining, 2016, pp. 353–362. ceedings of the 31st International Conference on [30] Y. Cao, X. Wang, X. He, Z. Hu, T.-S. Chua, Uni- Neural Information Processing Systems, 2017, pp. fying knowledge graph learning and recommen- 1025–1035. dation: Towards a better understanding of user [40] P. Veličković, G. Cucurull, A. Casanova, A. Romero, preferences, in: The world wide web conference, P. Liò, Y. Bengio, Graph Attention Networks, Inter- 2019, pp. 151–161. national Conference on Learning Representations [31] B. Hu, C. Shi, W. X. Zhao, P. S. Yu, Leveraging meta- (2018). path based context for top-n recommendation with [41] X. Wang, T. Huang, D. Wang, Y. Yuan, Z. Liu, X. He, a neural co-attention model, in: Proceedings of T.-S. Chua, Learning intents behind interactions the 24th ACM SIGKDD International Conference with knowledge graph for recommendation, in: on Knowledge Discovery & Data Mining, 2018, pp. Proceedings of the Web Conference 2021, 2021, pp. 1531–1540. 878–887. [32] J. Jin, J. Qin, Y. Fang, K. Du, W. Zhang, Y. Yu, [42] J. Liu, P. Dolan, E. R. Pedersen, Personalized news Z. Zhang, A. J. Smola, An efficient neighborhood- recommendation based on click behavior, in: Pro- based interaction model for recommendation on ceedings of the 15th international conference on heterogeneous graph, in: Proceedings of the 26th Intelligent user interfaces, 2010, pp. 31–40. ACM SIGKDD International Conference on Knowl- [43] C. Wu, F. Wu, S. Ge, T. Qi, Y. Huang, X. Xie, Neu- edge Discovery & Data Mining, 2020, pp. 75–84. ral news recommendation with multi-head self- [33] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandel- attention, in: Proceedings of the 2019 Confer- wal, B. Norick, J. Han, Personalized entity recom- ence on Empirical Methods in Natural Language mendation: A heterogeneous information network Processing and the 9th International Joint Con- approach, in: Proceedings of the 7th ACM interna- ference on Natural Language Processing (EMNLP- tional conference on Web search and data mining, IJCNLP), Association for Computational Linguis- 2014, pp. 283–292. tics, Hong Kong, China, 2019, pp. 6389–6394. URL: [34] H. Zhao, Q. Yao, J. Li, Y. Song, D. L. Lee, Meta-graph https://aclanthology.org/D19-1671. doi:10.18653/ based recommendation fusion over heterogeneous v1/D19- 1671 . information networks, in: Proceedings of the 23rd [44] S. Okura, Y. Tagami, S. Ono, A. Tajima, Embedding- ACM SIGKDD international conference on knowl- based news recommendation for millions of users, edge discovery and data mining, 2017, pp. 635–644. in: Proceedings of the 23rd ACM SIGKDD interna- [35] H. Wang, F. Zhang, M. Zhang, J. Leskovec, M. Zhao, tional conference on knowledge discovery and data W. Li, Z. Wang, Knowledge-aware graph neural mining, 2017, pp. 1933–1942. networks with label smoothness regularization for [45] J. Lian, F. Zhang, X. Xie, G. Sun, Towards better rep- recommender systems, in: Proceedings of the 25th resentation learning for personalized news recom- ACM SIGKDD international conference on knowl- mendation: a multi-channel deep fusion approach., edge discovery & data mining, 2019, pp. 968–977. in: IJCAI, 2018, pp. 3805–3811. [36] H. Wang, M. Zhao, X. Xie, W. Li, M. Guo, Knowl- [46] C. Wu, F. Wu, M. An, J. Huang, Y. Huang, X. Xie, edge graph convolutional networks for recom- Npa: neural news recommendation with personal- mender systems, in: The World Wide Web ized attention, in: Proceedings of the 25th ACM Conference, WWW ’19, Association for Com- SIGKDD international conference on knowledge puting Machinery, New York, NY, USA, 2019, p. discovery & data mining, 2019, pp. 2576–2584. 3307–3313. URL: https://doi.org/10.1145/3308558. [47] D. Liu, J. Lian, S. Wang, Y. Qiao, J.-H. Chen, G. Sun, 3313417. doi:10.1145/3308558.3313417 . X. Xie, Kred: Knowledge-aware document represen- [37] X. Wang, X. He, Y. Cao, M. Liu, T.-S. Chua, Kgat: tation for news recommendations, in: Fourteenth Knowledge graph attention network for recommen- ACM Conference on Recommender Systems, 2020, pp. 200–209. and testing dependence by correlation of distances, [48] H. Wang, F. Zhang, X. Xie, M. Guo, Dkn: Deep The annals of statistics 35 (2007) 2769–2794. knowledge-aware network for news recommenda- [61] H. Hotelling, Analysis of a complex of statistical tion, in: Proceedings of the 2018 world wide web variables into principal components., Journal of conference, 2018, pp. 1835–1844. educational psychology 24 (1933) 417. [49] S. H. Choi, Y.-S. Jeong, M. K. Jeong, A hybrid rec- [62] W. Lin, L. Shou, M. Gong, P. Jian, Z. Wang, B. Byrne, ommendation method with reduced data for large- D. Jiang, Combining unstructured content and scale application, IEEE Transactions on Systems, knowledge graphs into recommendation datasets, Man, and Cybernetics, Part C (Applications and in: 4th Edition of Knowledge-aware and Conversa- Reviews) 40 (2010) 557–566. tional Recommender Systems (KaRS) Workshop @ [50] L. M. De Campos, J. M. Fernández-Luna, J. F. Huete, RecSys 2022, 2022. M. A. Rueda-Morales, Combining content-based [63] D. P. Kingma, J. Ba, Adam: A method for stochastic and collaborative recommendations: A hybrid ap- optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd proach based on bayesian networks, Interna- International Conference on Learning Representa- tional journal of approximate reasoning 51 (2010) tions, ICLR 2015, San Diego, CA, USA, May 7-9, 785–799. 2015, Conference Track Proceedings, 2015. URL: [51] D. Billsus, M. J. Pazzani, J. Chen, A learning agent http://arxiv.org/abs/1412.6980. for wireless news access, in: Proceedings of the 5th [64] W. Krichene, S. Rendle, On sampled metrics for item international conference on Intelligent user inter- recommendation, in: Proceedings of the 26th ACM faces, 2000, pp. 33–36. SIGKDD International Conference on Knowledge [52] M. Ghazanfar, A. Prugel-Bennett, Building Discovery & Data Mining, 2020, pp. 1748–1757. switching hybrid recommender system using ma- [65] S. Rendle, Z. Gantner, C. Freudenthaler, L. Schmidt- chine learning classifiers and collaborative filtering, Thieme, Fast context-aware recommendations IAENG International Journal of Computer Science with factorization machines, in: Proceedings 37 (2010). of the 34th International ACM SIGIR Confer- [53] J. M. Noguera, M. J. Barranco, R. J. Segura, ence on Research and Development in Informa- L. Martínez, A mobile 3d-gis hybrid recommender tion Retrieval, SIGIR ’11, Association for Com- system for tourism, Information Sciences 215 (2012) puting Machinery, New York, NY, USA, 2011, 37–52. p. 635–644. URL: https://doi.org/10.1145/2009916. [54] A. S. Lampropoulos, P. S. Lampropoulou, G. A. 2010002. doi:10.1145/2009916.2010002 . Tsihrintzis, A cascade-hybrid music recommender [66] Y. A. Malkov, D. A. Yashunin, Efficient and robust system for mobile services based on musical genre approximate nearest neighbor search using hierar- classification and personality diagnosis, Multime- chical navigable small world graphs, IEEE transac- dia Tools and Applications 59 (2012) 241–258. tions on pattern analysis and machine intelligence [55] I. A. Christensen, S. N. Schiaffino, A hybrid ap- 42 (2018) 824–836. proach for group profiling in recommender systems (2014). [56] P. Bedi, P. Vashisth, P. Khurana, et al., Modeling user preferences in a hybrid recommender system using type-2 fuzzy sets, in: 2013 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), IEEE, 2013, pp. 1–8. [57] R. J. Mooney, L. Roy, Content-based book recom- mending using learning for text categorization, in: Proceedings of the fifth ACM conference on Digital libraries, 2000, pp. 195–204. [58] X. Li, T. Murata, Multidimensional clustering based collaborative filtering approach for diversified rec- ommendation, in: 2012 7th International Confer- ence on Computer Science & Education (ICCSE), IEEE, 2012, pp. 905–910. [59] G. J. Székely, M. L. Rizzo, Brownian distance co- variance, The annals of applied statistics 3 (2009) 1236–1265. [60] G. J. Székely, M. L. Rizzo, N. K. Bakirov, Measuring