Transformer-Empowered Content-Aware Collaborative
Filtering
Weizhe Lin1,† , Linjun Shou2 , Ming Gong2 , Pei Jian3 , Zhilin Wang4 , Bill Byrne1 and
Daxin Jiang2
1
  Department of Engineering, University of Cambridge, Cambridge, United Kingdom
2
  Microsoft STCA, Beijing, China
3
  Simon Fraser University, British Columbia, Canada
4
  University of Washington, Seattle, United States


                                       Abstract
                                       Knowledge graph (KG) based Collaborative Filtering (CF) is an effective approach to personalize recommender systems
                                       for relatively static domains such as movies and books, by leveraging structured information from KG to enrich both item
                                       and user representations. This paper investigates the complementary power of unstructured content information (e.g. rich
                                       summary texts of items) in KG-based CF recommender systems. We introduce Content-aware KG-enhanced Meta-preference
                                       Networks that enhances the CF recommendation based on both structured information from KG as well as unstructured
                                       content features based on Transformer-empowered content-based filtering (CBF). Within this modeling framework, we
                                       demonstrate a powerful KG-based CF model and a CBF model (a variant of the well-known NRMS system) and employ a
                                       novel training scheme, Cross-System Contrastive Learning, to address the inconsistency of the two very different systems in
                                       fusing information. We present experimental results showing that enhancing collaborative filtering with Transformer-based
                                       features derived from content-based filtering offers new improvements relative to strong baseline systems, improving the
                                       ability of KG-based CF systems to exploit item content information.

                                       Keywords
                                       Knowledge graph, recommender systems, collaborative filtering


1. Introduction                                                                                                 examples of such scenarios and serve as the focus of this
                                                                                                                paper.
Collaborative Filtering (CF) and Content-based Filtering                                                           In the Netflix Prize competition (2006-2009) [3], CF
(CBF) are two leading recommendation techniques [1]. features (ratings and user-item interactions) were shown
CF systems study users’ interactions in order to lever- to be more valuable than CBF features (e.g. movies’ meta-
age inter-item, inter-user, or user-item dependencies in data) in recommendation [4]. However, recent work
making recommendations. The underlying notion is that has shown that CF systems can benefit from the in-
users who interact with similar sets of items are likely to corporation of external knowledge graphs (KGs) to en-
share preferences for other items. CBF models leverage rich the user/item representations with structured CBF
descriptive attributes of items (e.g. item description and features [5]. Knowledge graphs consist of knowledge
category) and users (e.g. age and gender). Users are char- triplets; each triplet has a head entity, a tail entity, and a
acterized by the content information available in their link that describes their relationship, e.g. [Christopher
browsing histories [2]. CBF is particularly well-suited to Nolan] - [director] - [Dunkirk (movie)]. KG-based CF
news recommendations, where millions of new items are models are particularly good at linking items to other
produced every day. In contrast, CF systems are better related knowledge graph entities that serve as “item prop-
suited to scenarios where the inventory of items grows erties”. This approach leverages the structured content
slowly and where abundant user-item interactions are information from KGs (e.g. movie genre and actors) to
available. Movie and book recommender systems are complement CF features.
                                                                                                                   While KGs can readily incorporate structured content
4th Edition of Knowledge-aware and Conversational Recommender Sys- information and external knowledge, unstructured con-
tems (KaRS) Workshop @ RecSys 2022, September 18–23 2023, Seattle,
                                                                                                                tent such as item descriptions, is largely unexploited.
WA, USA.
†
    This work was done during Weizhe Lin’s internship at Microsoft
                                                                                                                Recent Transformer-based models, such as BERT [6] and
    STCA.                                                                                                       GPT-2 [7], have shown great power in modeling descrip-
Envelope-Open wl356@cam.ac.uk (W. Lin); lisho@microsoft.com (L. Shou);                                          tive content from natural language, which offers new
migon@microsoft.com (M. Gong); jpei@cs.sfu.ca (P. Jian);                                                        opportunities to enrich item/user representations with
zhilinw@uw.edu (Z. Wang); bill.byrne@eng.cam.ac.uk (B. Byrne); more expressive CBF features derived from Transformers.
djiang@microsoft.com (D. Jiang)
                   © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License For example, the two movies “Interstellar” and “Incep-
    CEUR
    Workshop
    Proceedings
                   Attribution 4.0 International (CC BY 4.0).
                   CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                                tion”, have a very similar set of structured properties in-
cluding genre, writer, and director, but their descriptions    in learning [17, 12, 18, 19, 20, 21]. Building upon graph-
provide more fine-grained discriminative information,          based CF models [22, 23], KG-based CF models fuse exter-
making it clear that one is about physics and universe         nal knowledge from auxiliary KGs to improve both the
and the other is about adventures and dreams.                  accuracy and explainability of recommendation [5, 24].
    Therefore, in this work, we offer insights into the com-   Items in interaction graphs are associated with auxiliary
plementary power of unstructured CBF features derived          KG entities with respect to their attributes (e.g. movie
from Transformers (e.g. summary texts of books and             directors).
movies). We investigate how these content-aware CBF               To exploit the KGs, Embedding-based Methods employ
features can be effectively fused to complement CF learn-      KG embedding methods (e.g TransE [25], TransH [26]
ing, and how much value they can add to standard large-        and TransR [27]) in order to enhance item represen-
scale KG-based CF recommender systems.                         tations with KG-aware entity embeddings [28, 29, 30].
    However, computationally efficient approaches to en-       For example, KTUP [30] trains item representations and
rich KG-based CF models with unstructured CBF fea-             TransH-powered KG completion simultaneously. Path-
tures derived from Transformers are not yet well ad-           based Methods follow the meta-paths manually designed
dressed in the literature. The challenge mainly stems          by domain experts to make KG-path-aware recommenda-
from the need to capture the co-occurrence of graph node       tions [31, 32, 33, 34], which is, however, not feasible for
features by graph convolution operations. This opera-          larger KGs with their enormous entity and path diver-
tion requires representations of graph nodes to be back-       sity. Convolution Methods [35, 36, 32, 37, 38] design con-
propagated and updated after each forward pass, and thus       volution mechanisms, mostly variants of Graph Neural
it is prohibitively costly for large graphs where millions     Networks [39, 40] (GNNs), to enhance item/user repre-
of item/user nodes require transformer-generated em-           sentations with features aggregated from distant entities.
beddings. Therefore, using pre-extracted features from         KGIN [41] further embeds KG-relational embeddings in
trained CBF systems is the most promising option. How-         inter-node feature passing to achieve path-aware graph
ever, conventional fusion schemes (such as Mixture of          convolution.
Expert and early/late fusion) are shown to be vulnerable       Content-based Filtering. CBF models match items
in our experiments (see Sec. 4.4). We address this prob-       to a user by considering the metadata (content-based
lem by introducing Cross-System Contrastive Learning,          information) of items with which the user has inter-
which brings together the benefits of both structured and      acted [42, 43, 44, 45, 46]. Most research in KG-based
unstructured item properties. In this paper:                   CBF, a recently popular topic, focuses on enhancing the
    1. We introduce a powerful KG CF model (KMPN)              item representations with KG embeddings by mapping
       that outperforms strong baselines, and demon-           relevant KG entities to the content of items, e.g., by entity
       strate the improvement brought by each system           linking [47, 48]. However, these methods heavily rely
       component. We also introduce a Transformer-             on word-level entity mapping with KG entities, which
       empowered CBF model (NRMS-BERT) that                    is prohibited for movies/books since their descriptions
       achieves good recommendation performance                mostly consist of imaginary content, such as character
       with only summary texts of books and movies.            names and fictional stories.
                                                               Fusing CF and CBF. Hybrid CF-CBF systems are of-
    2. We propose to merge unstructured content-based
                                                               ten achieved by weighting/combining [49, 50] or switch-
       features into KG-based CF through a simple but ef-
                                                               ing [51, 52, 53] between the ranking outputs of the two
       fective fusion framework based on Cross-System
                                                               systems. They can also pass a relatively coarser ranking
       Contrastive Learning.
                                                               list produced by one system into the other for refine-
    3. Based on two realistic recommendation datasets,
                                                               ment [54, 55]. The features derived from one system can
       we present extensive experiments showing the
                                                               also be used to complement the other system by fusing
       value of incorporating unstructured CBF features
                                                               with the output features (late fusion) [56] or augmenting
       derived from Transformers.
                                                               the user/item input features (early fusion) [57, 58]. For
                                                               example, CKE [29] produces augmented item represen-
2. Related Work                                                tations by obtaining fixed textual features from unsu-
                                                               pervised denoising auto-encoders. In contrast, we intro-
Collaborative Filtering. Traditional CF models rely            duce NRMS-BERT to obtain more expressive textual item
on Matrix Factorization (MF) [8, 9, 10] and Factorization      representations with supervised training and larger lan-
Machine (FM) [11, 12, 13] in learning user-item represen-      guage models. Furthermore, these conventional fusing
tations. Nearest neighbour approaches are also predomi-        approaches (including late/early fusion and mixture of
nant to CF, where the user-item ratings are interpolated       experts) fail to perform well in our experiments (Sec. 4.4).
from the ratings of similar items and users [14, 15, 16].      We address this by proposing a novel training scheme
Recent models incorporate Deep Neural Networks (DNN)           based on contrastive learning that complements a KG-
                                                                                                    2   Soft Distance Correlation Loss
  1 KMPN

                                                          ...
                                                                                                                             5 Prediction


             Knoledge Graph Interactions among Triplets     Preference Modelling


  3 NRMS-BERT
                                                                                                                                Feature Passing
                                              BERT
                                                                Attention                                                       Non-item KG
                                                                Pooling                                                         Entity
                                                                                                                                Item
                                                                                                Cross System
                                                                                            4                                   User
             User Browsed Items           Item Encoder    User Encoder                          Contrastive Learning


Figure 1: Framework pipeline. (1) KMPN: leverages meta preferences to model users from knowledge graph entities and
interacted items; (2) Soft Distance Correlation: encourage preference embeddings to separate at low dimensions; (3) NRMS-
BERT: extracts content-based features; (4) Cross System Contrastive Learning: encourages user/item embeddings to learn
mutual information from content-based representations; (5) Rating: uses the dot product of KMPN user/item features.

based CF model with these Transformer-based represen- 3.2. KG-Enhanced Meta-Preference
tations.                                                   Network (KMPN)
                                                                 This section introduces KG-enhanced Meta-Preference
3. Methodology                                                   Network (thereafter KMPN). It is a KG-based CF model
                                                                 that aggregates features of all KG entities to items effi-
3.1. Data Notation
                                                                 ciently by exploiting relationships in the KG, and then
There are 𝑁𝑢 users {𝑢|𝑢 ∈ 𝒮𝑈 } and 𝑁𝑖 items {𝑖|𝑖 ∈ 𝒮𝐼 }. links item features to users for recommendations, as
𝒫 + = {(𝑢, 𝑖)|𝑢 ∈ 𝒮𝑈 , 𝑖 ∈ 𝒮𝐼 } is the set of user interactions. shown in Fig. 1 (1), (2), and (5).
Each (𝑢, 𝑖) pair indicates that user 𝑢 interacted with 𝑖.
Each item 𝑖 ∈ 𝒮𝐼 carries unstructured data x𝑖 , e.g. a text 3.2.1. Gated Path Graph Convolution Networks
description of the item.
   The KG contains structured information that describes Associated with each KG node 𝑣𝑖 are feature vectors
                                                                  (0)
relations between real world entities. The KG is repre- e𝑖 ∈ Rℎ . Each relation type 𝑟 ∈ ℛ is also associated
sented as a weighted heterogeneous graph 𝒢 = (𝒱 , ℰ ) with a relational embedding e𝑟 . A Gated Path Graph
with a node set 𝒱 consisting of 𝑁𝑣 nodes {𝑣} and an edge Convolution Network is a cascade of 𝐿 convolution lay-
set ℰ containing all edges between nodes. The graph is ers. For each KG node, a convolution layer aggregates
also associated with a relation type mapping function features from its neighbors as follows:
𝜙 ∶ ℰ → ℛ that maps each edge to a type in the relation                          (𝑙+1)    1                              (𝑙)
                                                                                e𝑖     =               ∑         𝛾 e ⊙ e𝑗 ,        (1)
set ℛ consisting of 𝑁𝑟 relations. Note that all items are                                |𝒩𝑖 | {𝑣 |(𝑣 ,𝑟 ,𝑣 )∈𝒯 } 𝑖𝑗 𝑟𝑖𝑗
                                                                                                 𝑗 𝑖 𝑖𝑗 𝑗
included in the KG: 𝒮𝑖 ⊂ 𝒱.
   The edges of the knowledge graph are triplets 𝒯 =                where the neighbouring set of 𝑖: 𝒩𝑖 = {𝑣𝑗 |(𝑣𝑖 , 𝑟𝑖𝑗 , 𝑣𝑗 ) ∈
{(ℎ, 𝑟, 𝑡)|ℎ, 𝑡 ∈ 𝒱 , 𝑟 ∈ ℛ}, where 𝒱 is the collec-             𝒯  }, 𝑟𝑖𝑗 is the type of relation from 𝑣𝑖 to 𝑣𝑗 , and 𝛾𝑖𝑗 is a gated
tion of graph entities/nodes and ℛ is the relation               function      that controls messages that flow from 𝑣𝑗 to 𝑣𝑖 :
set. Each triplet describes that a head entity ℎ is con-                                    𝛾𝑖𝑗 = 𝜎(eT𝑖 e𝑟𝑖𝑗 ),                    (2)
nected to a tail entity 𝑡 with the relation 𝑟. For exam-
ple, (𝑇 ℎ𝑒 𝑆ℎ𝑖𝑛𝑖𝑛𝑔, 𝑓 𝑖𝑙𝑚.𝑓 𝑖𝑙𝑚.𝑎𝑐𝑡𝑜𝑟, 𝐽 𝑎𝑐𝑘 𝑁 𝑖𝑐ℎ𝑜𝑙𝑠𝑜𝑛) speci-     where 𝜎(.) is a sigmoid function that limits the gated
fies that Jack Nicholson is a film actor in the movie “The value between 0 and 1. As a result, the message passed
Shining”. To fully expose the relationships between heads to a node is weighted by its importance to the receiving
and tails, the relation set is extended with reversed re- node and the relation type. Through stacking multiple
lation types, i.e., for any (ℎ, 𝑟, 𝑡) triplet we allow inverse layers of convolution, the final embedding at a node de-
connection (𝑡, 𝑟 ′ , ℎ) to be built, where 𝑟 ′ is the reverse of pends on the path along which the features are shared,
𝑟. The edge set ℰ is derived from these triplets.                as well as the importance of the message being trans-
                                                                             mitted. To overcome the over-smoothing issue of graph
convolutions, the embedding at a KG node after 𝑙 convo- these learnt preferences as much as possible, in order to
lutions is an aggregation of all the intermediate output obtain diverse proxies bridging users and items. Though
                           𝑙        (𝑙 ′ )                        the authors demonstrate a considerable improvement
embeddings: e𝑙𝑖 = ∑𝑙 ′ =0 e𝑖 .
                                                                  over baselines, we take the view that applying constraints
3.2.2. User Preference Modeling                                   to all dimensions of preference embeddings restricts their
                                                                  expressiveness, as they are trained to be very dissimilar
Inspired by Wang et al. [41], we model users using a and have diverse orientations in latent space.
combination of preferences. Wang et al. [41] assumed                  We adopt a softer approach: Soft Distance Corre-
that each user is influenced by multiple intents and lation Loss, which firstly lowers the dimensionality of
that each intent is influenced by multiple movie at- preference embeddings with Principal Component Anal-
tributes, such as the combination of the two relation ysis (PCA) [61] while keeping the most differentiable
types 𝑓 𝑖𝑙𝑚.𝑓 𝑖𝑙𝑚.𝑑𝑖𝑟𝑒𝑐𝑡𝑜𝑟 and 𝑓 𝑖𝑙𝑚.𝑓 𝑖𝑙𝑚.𝑔𝑒𝑛𝑟𝑒. Based on features in embeddings, and then applies distance cor-
this assumption, they proposed to aggregate item embed- relation constraints to encourage diverse expression in
dings to users through “preferences”, and the embedding lower dimensions:
of each preference e𝑝 is modelled by all types of edges:
e𝑝 = ∑𝑟∈ℛ 𝛼𝑟𝑝 e𝑟 , where 𝛼𝑟𝑝 is a Softmax-ed trainable                            ê𝑝 = 𝑃𝐶𝐴({e𝑝 ′ |𝑝 ′ ∈ 𝒫 }) ∈ Rℎ𝜖 ;                 (7)
weight and e𝑟 is the embedding of edge relation type 𝑟.
   We take the view that user preferences are not only                                                      𝐷𝐶𝑜𝑣(e𝑝̂ , e𝑝̂ ′ )
                                                                     ℒ𝑆𝑜𝑓 𝑡𝐷𝐶𝑜𝑟𝑟 =             ∑                                    . (8)
limited to relations but can be extended to more general                                𝑝,𝑝 ′ ∈𝒫 ,𝑝≠𝑝 ′ 𝐷𝑉 𝑎𝑟(ê𝑝 ) ⋅ 𝐷𝑉 𝑎𝑟(ê𝑝 ′ )
cases. We model each preference 𝑝 through a combina-                                                   √
tion of a set of meta-preferences ℳ with in total 𝑁𝑚 meta-            where 𝜖 controls the ratio of principal components to
preferences: each meta-preference 𝑚 ∈ ℳ is associated keep after PCA. 𝐷𝐶𝑜𝑣(⋅) computes distance covariance
with a trainable embedding e𝑚 ∈ Rℎ , and a preference 𝑝 and 𝐷𝑉 𝑎𝑟(⋅) measures distance variance [59, 60].
is formed by these meta-preferences as follows:                       Of course, 𝜖 = 1 yields the original DCorr Loss pro-
                                                                  posed in [41]. Through encouraging diverse expression
                          e𝑝 = ∑ 𝛽𝑝𝑚 e𝑚 ,                     (3)
                                                                  at only lower dimensions, preferences have retained the
                                 𝑚∈ℳ
                                                                  flexibility in higher dimensions.
   where the linear weights {𝛽𝑝𝑚 |𝑚 ∈ ℳ} are derived
from trainable weights {𝛽𝑝𝑚        ̂ |𝑚 ∈ ℳ} for each preference
                                                                  3.2.4. Model Optimization with Reciprocal Ratio
𝑝:
                                               ̂
                                      exp (𝛽𝑝𝑚 )                          Negative Sampling (RRNS)
                      𝛽𝑝𝑚 =                                .  (4)
                              ∑𝑚′ ∈ℳ exp (𝛽𝑝𝑚        ̂ ′)         Following the common approach, the dot product be-
                                                                  tween user and item embeddings is used for rating:
   As a result, meta-preferences reflect the general in-
                                                                    ̂ = (e𝐿𝑢 )T ⋅ e𝐿𝑖 .
                                                                  𝑦𝑢𝑖
terests of all users. A particular user can be profiled by
                                                                      Both of the datasets we study do not provide hard
aggregating the embeddings of interacted items through
                                                                  negative samples: i.e., we do not have samples of items
these preferences:
                                                                  with which viewers chose not to interact. A common
                 (𝑙+1)                               (𝑙)
               e𝑢        = ∑ 𝛼𝑝 ∑ e 𝑖 ⊙ e 𝑝 ,                 (5) practice to synthesize negative examples is to randomly
                            𝑝∈𝒫         (𝑢,𝑖)∈𝒫 +                 sample from users’ unobserved counterparts 𝒫 − =
where 𝒫 is the collection of 𝑁𝑝 preferences {𝑝} and 𝛼𝑝 {(𝑢, 𝑖− )|(𝑢, 𝑖− ) ∉ 𝒫 + }. However, an item is not necessarily
is an attention mechanism that weights the interest of “not interesting” to a user if no interaction happens, as
users over different preferences:                                 not all items have been viewed. We propose to adopt
                                                 (𝑙)
                                 exp (eT𝑝 e𝑢 )                    Reciprocal Ratio Negative Sampling (RRNS), where items
                      𝛼𝑝 =                                .   (6) with more user interactions are considered popular and
                                                      (𝑙)
                             ∑𝑝 ′ ∈𝒫 exp (eT𝑝 ′ e𝑢 )              are sampled less frequently based on the assumption that
   In summary, each preference is formed by general and popular items are less likely to be hard negative samples
diverse meta-preferences, and users are further profiled for any user. The sampling distribution is given by a
by multiple preferences that focus on different aspects of normalized reciprocal ratio of item interactions:
item features. As with items, the final user embedding is:                                              1
        𝑙      (𝑙 ′ )
e𝑙𝑢 = ∑𝑙 ′ =0 e𝑢 .                                                                      𝑖− ∼ 𝑃(𝑖) ∝         for 𝑖 ∈ 𝑆𝐼                (9)
                                                                                                       𝑐(𝑖)
                                                               where 𝑐(𝑖) counts the interactions of all users with the
3.2.3. Soft Distance Correlation
                                                            item 𝑖.
Having modelled users through preferences, Wang et al.         The training set therefore consists of positive and nega-
[41] added an additional loss that utilizes Distance Corre- tive samples: 𝒰 = {(𝑢, 𝑖+ , 𝑖− )|(𝑢, 𝑖+ ) ∈ 𝒫 + , (𝑢, 𝑖− ) ∈ 𝒫 − }.
lation (DCorr) [59, 60] to separate the representations of Pairwise BPR loss [9] is adopted to train the model, which
exploits a contrastive learning concept to assign higher            3.3.3. Model Optimization
scores to users’ browsed items than those items in which
                                                                    The rating is the dot product of user and item embeddings:
the users are not interested:
                                                                     ̂ = (e𝑢 )T ⋅ e𝑖 . Assume that the scores of the positive
                                                                    𝑦𝑢𝑖
          ℒ𝐵𝑃𝑅 =        ∑           − ln(𝜎(𝑦𝑢𝑖
                                            ̂ + − 𝑦𝑢𝑖
                                                   ̂ − )).   (10)   samples and negative samples are 𝑦̂ + and 𝑦1̂ − ,...,𝑦𝐾
                                                                                                                          ̂ − , fol-
                    (𝑢,𝑖+ ,𝑖− )∈𝒰                                   lowing [43], the loss is the log click probability of item 𝑖:
   Together with commonly-used embedding L2 regular-                                                          exp (𝑦̂ + )
ization and Soft Distance Correlation loss, the final loss             ℒ𝑁 𝑅𝑀𝑆 = − ∑ log (                                             − )
is given by:                                                                           𝑖∈𝒮𝑖        exp (𝑦̂ + ) + ∑   𝑘=1,..,𝐾 exp (𝑦𝑘̂ )
                                                                                                                                       (16)
                          1
     ℒ𝐾 𝑀𝑃𝑁 = ℒ𝐵𝑃𝑅 + 𝜆1 ||Θ||22 + 𝜆2 ℒ𝑆𝑜𝑓 𝑡𝐷𝐶𝑜𝑟𝑟 ,    (11)
                          2
                                                                    3.4. Fusing CF and CBF: Content-aware
where Θ = {e𝐿𝑢 , e𝐿𝑖+ , e𝐿𝑖− |(𝑢, 𝑖+ , 𝑖− ) ∈ 𝒰}, and ||Θ||22 is the
                                                                             KMPN (CKMPN)
L2-norm of user/item embeddings. 𝜆1 and 𝜆2 are hyper-
parameters that control loss weights.                                 To fuse the information from a CBF model (NRMS-BERT)
                                                                      to a CF model (KMPN), we must bridge some inconsis-
3.3. Neural Recommendation with                                       tencies between the two types of models. CBF models
                                                                      that utilize large transformers cannot be co-optimized
          Multi-Head Self-Attention                                   with KG-based CF models, as graph convolution requires
Inspired by NRMS [43] that is powerful in news rec- all embeddings to be present before convolution and this
ommendations, we propose a variant of NRMS, NRMS- requires enormous GPU memory for even one single for-
BERT, that further utilizes a fine-tuned Transformer ward pass. As a result, a more efficient solution merges
(BERT) for extracting contextual information from de- the pre-trained CBF features into the training of the KG-
scriptions of items, as shown in Fig. 1 (3).                          CF component, enriching the learned representations.
                                                                         In line with our aim to use a CF model for movie and
3.3.1. Item Encoder                                                   book recommendations, we present a novel and efficient
The item encoder encodes the text description string x𝑖 approach for training a better KMPN: Cross-System Con-
of any item 𝑖 ∈ 𝒮𝑖 through BERT into embeddings of size trastive Learning, as shown in Fig. 1 (4). KMPN is still
ℎ by extracting the embeddings of <CLS> at the last layer: used as the backbone and it is trained with the aid of a
                                                                      pre-trained NRMS-BERT, not requiring more parameters
                       e𝑖 = 𝐵𝐸𝑅𝑇 (x𝑖 ) ∈ Rℎ .                    (12) than KMPN.
    For each user 𝑢, the item encoder encodes one posi-                  In KMPN training, for users and items in (𝑢, 𝑖+ , 𝑖− ) ∈ 𝒰,
tive item e𝑖+ and 𝐾 negative items e𝑖1 , ..., e𝑖𝐾 . 𝐵 items
                                               −       −              embeddings      are generated from NRMS-BERT: 𝑒𝑢𝑁 𝑅𝑀𝑆 ,
                                                                       𝑁 𝑅𝑀𝑆     𝑁  𝑅𝑀𝑆 , and from KMPN: 𝑒𝑢𝐾 𝑀𝑃𝑁 , 𝑒𝑖𝐾+ 𝑀𝑃𝑁 , and
are randomly sampled from the user’s browsed items 𝑒𝑖+                        , 𝑒𝑖−
𝑖𝑢,1 ,...,𝑖𝑢,𝐵 . These browsed items are encoded and gath- 𝑒𝑖−         𝐾 𝑀𝑃𝑁   .
ered to E𝑢 = [e𝑖𝑢,1 , ..., e𝑖𝑢,𝐵 ] ∈ R𝐵×ℎ .                              Cross-System Contrastive Loss is adopted to encour-
                                                                      age the KMPN system to learn to incorporate content-
3.3.2. User Encoder                                                   sensitive features from NRMS-BERT features:

The user encoder uses items with which users interacted             ℒ𝐶𝑆 =       ∑           − ln (𝜎((e𝐾 𝑀𝑃𝑁 )T ⋅ (e𝑁 𝑅𝑀𝑆 − e𝑁− 𝑅𝑀𝑆 )))
                                                                                                      𝑢            𝑖+       𝑖
to produce a content-aware user representation. The final                   (𝑢,𝑖+ ,𝑖− )∈𝒰
user representation is a weighted sum of the 𝐵 browsed                              − ln (𝜎((e𝑁  𝑅𝑀𝑆 )T ⋅ (e𝐾 𝑀𝑃𝑁 − e𝐾−𝑀𝑃𝑁 )))
                                                                                              𝑢              𝑖+      𝑖
items:                     𝐵
                                                                                                                          (17)
                     e𝑢 = ∑ 𝛼𝑏 e𝑖𝑢,𝑏                 (13)
                               𝑏=1
                                                                    This loss encourages KMPN to produce item embed-
                                                                 dings that interact not only with KMPN’s own user em-
   where 𝛼𝑏 is the attention weight assigned to 𝑖𝑢,𝑏 ob-
                                                                 beddings, but also with NRMS-BERT’s user embeddings.
tained by passing features through two linear layers:
                                                                 Similarly, user embeddings of KMPN are trained to in-
                               exp (Â𝑏 )                        teract with items of NRMS-BERT. This allows e𝐾 𝑀𝑃𝑁 to
                 𝛼𝑏 =                          ;            (14) learn mutual expressiveness with e𝑁 𝑅𝑀𝑆 , but 𝑖without
                       ∑𝑏′ =1,..,𝐵 exp (Â𝑏′ )                                                             𝑖
                                                                 approaching the two embeddings directly using similar-
      Â = tanh(E𝑢 A𝑓 𝑐1 + b𝑓 𝑐1 )A𝑓 𝑐2 + b𝑓 𝑐2 ∈ R𝐵×1      (15) ity (e.g. cosine-similarity), which we found not to work
                                                                 well (discussed in Sec. 4.4). In this case, e𝑁  𝑢
                                                                                                                   𝑅𝑀𝑆 serves
                    ℎ× 12 ℎ             1
                                          ℎ           1
                                                        ℎ×1      as an  ‘anchor’  with  which  the item    embeddings  of two
   where A𝑓 𝑐1 ∈ R          , b𝑓 𝑐1 ∈ R 2 , A𝑓 𝑐2 ∈ R 2 , and
          1                                                      systems learn to share commons and increase their mutu-
b𝑓 𝑐2 ∈ R are weights and biases of two fully-connected
                                                                 ality. This loss encourages e𝐾𝑖
                                                                                                  𝑀𝑃𝑁 and e𝑁 𝑅𝑀𝑆 to lie in the
                                                                                                               𝑖
layers, respectively.
Table 1
Model performance on Amazon-Book-Extended (top) and Movie-KG-dataset (bottom). Numbers underlined represent existing
state-of-the-art performance, while best performance of the proposed models is marked in bold. The average of 3 runs is
reported to mitigate experimental randomness. Metrics with (*) are significantly higher than KMPN (𝑝 < 0.05).
 On Amazon-Book-Extended                           Recall                        ndcg                        Hit Ratio
                                         @20      @60        @100      @20      @60       @100      @20      @60         @100
 BPRMF                                   0.1352   0.2433     0.3088    0.0696   0.0957    0.1089    0.2376   0.3984      0.4816
 CKE                                     0.1347   0.2413     0.3070    0.0691   0.0948    0.1081    0.2373   0.3963      0.4800
 KGAT                                    0.1527   0.2595     0.3227    0.0807   0.1066    0.1194    0.2602   0.4156      0.4931
 KGIN                                    0.1654   0.2691     0.3298    0.0893   0.1145    0.1267    0.2805   0.4289      0.5040
 KMPN (ours)                             0.1719   0.2793     0.3405    0.0931   0.1189    0.1315    0.2910   0.4421      0.5166
 - w/o Soft DCrr                         0.1704   0.2790     0.3396    0.0924   0.1185    0.1310    0.2881   0.4419      0.5152
 - w/o Soft DCorr and RRNS               0.1690   0.2774     0.3391    0.0913   0.1177    0.1302    0.2872   0.4414      0.5155
 NRMS-BERT (ours)                        0.1142   0.2083     0.2671    0.0592   0.0817    0.0935    0.2057   0.3487      0.4273
 CKMPN (𝜆𝐶𝑆 = 0.2) (ours)                0.1699   0.2812     0.3461    0.0922   0.1190    0.1319    0.2880   0.4460      0.5235
 CKMPN (𝜆𝐶𝑆 = 0.1) (ours)                0.1718   0.2821*    0.3460*   0.0928   0.1197*   0.1326*   0.2908   0.4474*     0.5244*
 Improv. (%) CKMPN v.s. Best Baselines   3.90     4.82       4.94      4.31     4.55      4.59      3.72     4.33        4.04

 On Movie-KG-Dataset                               Recall                        ndcg                        Hit Ratio
                                         @20      @60        @100      @20      @60       @100      @20      @60         @100
 BPRMF                                   0.1387   0.1944     0.2206    0.0961   0.1137    0.1192    0.1980   0.2785      0.3236
 CKE                                     0.1369   0.1898     0.2150    0.0940   0.1108    0.1160    0.1950   0.2707      0.3155
 KGAT                                    0.1403   0.1928     0.2185    0.1006   0.1173    0.1226    0.1997   0.2742      0.3196
 KGIN                                    0.1351   0.2119     0.2445    0.0982   0.1254    0.1322    0.2194   0.3081      0.3643
 KMPN (𝜖 = 0.5, 𝑁𝑚 = 64) (ours)          0.1434   0.2130     0.2427    0.1073   0.1305    0.1367    0.2193   0.3098      0.3602
 NRMS-BERT (ours)                        0.1241   0.1669     0.1890    0.1034   0.1213    0.1257    0.1728   0.2369      0.2773
 CKMPN (𝜆𝐶𝑆 = 0.01) (ours)               0.1457   0.2157     0.2462    0.1149   0.1417    0.1482    0.2266   0.3153      0.3668
 CKMPN (ours) (on the cold-start set)    0.1024   0.1741     0.2130    0.0570   0.0729    0.0808    0.1812   0.2839      0.3380


same hidden space hyperplane on which features have           4.2. Training Details
the same dot-product results with e𝑁
                                   𝑢
                                     𝑅𝑀𝑆 . This constraint
                                                          All experiments were run on 8 NVIDIA A100 GPUs
encourages KMPN to grow embeddings in the same re-
                                                          with batch size 8192 × 8 for KMPN/CKMPN and 4 × 8
gion of hidden space, leading to mutual expressiveness
                                                          for NRMS-BERT. Adam [63] is used to optimize models.
across the two systems. Finally, the optimization target
                                                          KMPN/CKMPN is trained for 2000 epochs with linearly
is:
            ℒ𝐶𝐾 𝑀𝑃𝑁 = ℒ𝐾 𝑀𝑃𝑁 + 𝜆𝐶𝑆 ℒ𝐶𝑆 ,              (18)decayed learning rates from 10−3 to 0 for Amazon-Book-
                                                          Extended and 5 × 10−4 to 0 for Movie-KG-Dataset. Train-
    where 𝜆𝐶𝑆 controls the weight of the Cross-System
                                                          ing takes 4 hours on Amazon-Book-Extended and 12
Contrastive Loss. This fusion scheme can be applied to
                                                          hours on Movie-KG-Dataset. NRMS-BERT is trained for
any models with similar CF/CBF mechanisms.
                                                          10 epochs at a constant learning rate of 10−4 . Training
                                                          takes 20 hours on Amazon-Book-Extended and 120 hours
4. Experiments                                            on Movie-KG-Dataset.
                                                            Codes and pre-trained models will be released at
4.1. Datasets                                             https://github.com/LinWeizheDragon/Content-Aware-
                                                          Knowledge-Enhanced-Meta-Preference-Networks-for-
We use the two datasets introduced in [62]: (1) Amazon- Recommendation.
Book-Extended collects book descriptions from multiple
data sources for the popular Amazon-Book dataset. It 4.3. Evaluation Metrics and Baselines
contains 70,679 users, 24,915 items along with a KG of
88,572 nodes and 2,557,746 triplets. (2) Movie-KG-Dataset Following common practice [21, 37, 41, 64], we report
is a newly collected dataset that contains 125,218 users, metrics for evaluating model performance: (1) Recall@K :
50,000 items with a KG of 250,327 nodes and 12,055,581 within top-𝐾 recommendations, how well the system
triplets. Descriptions of movies are provided to enable recalls the test-set browsed items for each user; (2)
content-based recommendations.                            ndcg@K (Normalized Discounted Cumulative Gain) [64]:
                                                          increases when relevant items appear earlier in the rec-
ommended list; (3) HitRatio@K : how likely a user finds          Comparison with hybrid methods: Conventional fea-
at least one interesting item in the recommended top-K           ture fusion methods are popular and convenient options
items.                                                           for combining one system into the training of another
   We take the performance of several recently published         (as surveyed in Sec. 2). In fusing a pre-trained NRMS-
recommender systems as points for comparison1 . We               BERT with KMPN, we demonstrate the effectiveness of
carefully reproduced all these baseline systems from their       our proposed fusion framework CKMPN by comparing
repositories2 .                                                  it with these conventional approaches.
   BPRMF [9]: a strong Matrix Factorization (MF)                      • Early Fusion: CBF features are concatenated to the
method that applies a generic optimization criterion BPR-                trainable user/item embeddings of KMPN before the
Opt for personalized ranking. Limited by space, other                    graph convolution layers.
MF models (e.g. FM [65], NFM [12]) are not presented
                                                                      • Late Fusion: CBF features are fused to the output
since BPRMF outperformed them.
                                                                         user/item embeddings of KMPN after the graph con-
   CKE [29]: a CF model that leverages heterogeneous
                                                                         volution layers. Many feature aggregation methods
information in a knowledge base for recommendation.
                                                                         were experimented and the best of them are reported
   KGAT [37]: Knowledge Graph Attention Network
                                                                         in Table 2: (1) concat+linear: CF features are concate-
(KGAT) which explicitly models high-order KG connec-
                                                                         nated with CBF features, and they pass through 3 MLP
tivities in KG. The models’ user/item embeddings were
                                                                         layers into embeddings of size R2×ℎ . (2) MultiHeadAtt:
initialized from the pre-trained BPRMF weights.
                                                                         CF and CBF features passed through 3 Multi-head Self-
   KGIN [41]: a state-of-the-art KG-based CF model that
                                                                         Attention blocks into embeddings of size R2×ℎ .
models users’ latent intents (preferences) as a combina-
                                                                      • Cos-Sim: An auxiliary loss grounded on cosine-
tion of KG relations.
                                                                         similarity is incorporated in training to encourage the
                                                                         user/item embeddings of KMPN to approach those of
4.4. Performance on Amazon Dataset                                       NRMS-BERT.
Comparison with baselines. Performance of models is                   •  Mixture of Expert (MoE): a hybrid system where
presented in Table 1. Our proposed KG-based CF model,                    the output scores of two systems, KMPN and NRMS-
KMPN, achieved a substantial improvement on all met-                     BERT, pass through 3 layers of a Multi-Layer Percep-
rics over the performance of the existing state-of-the-art               tion (MLP) to obtain final item ratings.
model KGIN; for example, Recall@20 was improved from
                                                                         It can be concluded that these feature aggregation ap-
0.1654 to 0.1719, Recall@100 from 0.3298 to 0.3405, and
                                                                      proaches do not perform well in fusing pre-trained CBF
ndcg@100 from 0.1267 to 0.1315. All relative improve-
                                                                      features into KG-based CF training. (1) The performance
ments mentioned in our discussions are statistically
                                                                      of Late Fusion shows that when the already-learned
significant (𝑝 < 0.05).
                                                                      NRMS-BERT item/user embeddings pass through new
    NRMS-BERT models user-item preferences using only
                                                                      layers, these layers undo the learned representations
item summary texts, without external information from a
                                                                      from NRMS-BERT and lead to only degraded perfor-
knowledge base. It still achieves 0.1142 in Recall@20 and
                                                                      mance. (2) Cos-Sim shows that the auxiliary loss based
0.4273 Hit Ratio@100, not far from the KGIN baseline at
                                                                      on cosine-similarity places a reliance on NRMS-BERT’s
0.5040 Hit Ratio@100.
                                                                      features, which damages the KMPN training by limiting
    CKMPN further improves all @60/@100 metrics while
                                                                      the expressiveness of KMPN to that of NRMS-BERT. As a
keeping the model’s performance of @20. For exam-
                                                                      result, the performance is decreased from 0.2793 (KMPN)
ple, with similar Recall@20, CKMPN (0.3461 Recall@100)
                                                                      to 0.2436 (Cos-Sim) recall@60.
outperforms KMPN (0.3405 Recall@100) by 1.6% with
                                                                         Though NRMS-BERT alone achieves much lower met-
statistical significance 𝑝 < 0.05. This demonstrates that
                                                                      rics than KMPN (0.1142 vs 0.1719 Recall@20), MoE,
even though KMPN achieves higher performance rela-
                                                                      where scores of two systems are merged by MLP lay-
tive to NRMS-BERT, gathering item and user embeddings
                                                                      ers, achieves 0.1723 Recall@20, showing that the scor-
from one system (KMPN) with those of the other system
                                                                      ing of two systems is complementary. However, MoE’s
(NRMS-BERT), through proxies (Cross-System CL), can
                                                                      performance deteriorates on @60/100. A case study is
still encourage KMPN to learn and fuse content-aware
                                                                      presented later in Sec. 4.6 to show that the scoring of one
information from the learned representations of a CBF
                                                                      system can possibly be extreme to overwhelm the final
model and presents more relevant items in the top-100
                                                                      rating under the MoE setting. In contrast, our CKMPN
list.
                                                                      steadily achieves better performance in @60/100 results
1
  They are also baseline systems being compared in a recent paper
                                                                      relative to KMPN, showing that our method is an in-
  [41] (WWW’21).                                                      depth collaboration of two systems instead of a simple
2
  As a result, the results reported here may differ from those of the aggregation of system outputs as in MoE.
original papers.
Table 2
Comparison with conventional feature fusion approaches (Amazon-Book-Extended). R: Recall; HR: Hit Ratio.

                                                            Fusion Approach                                     R@20               R@60              ndcg@60                HR@60
                                                            Early Fusion (concat)                               0.1661             0.2708                 0.1148             0.4299
                                                            Late Fusion (concat+linear)                         0.1679             0.2769                 0.1164             0.4381
                                                            Late Fusion (MultiHeadAtt)                          0.1692             0.2778                 0.1175             0.4385
                                                            Cos-Sim                                             0.1436             0.2436                 0.1026             0.4001
                                                            Mixture of Expert                                   0.1723             0.2791                 0.1161             0.4425
                                                            CKMPN (ours)                                        0.1718             0.2821                 0.1197             0.4474

                                                   0.0932        0.173                                        0.0935       0.173                                         0.0935       0.347                                        0.1330
         0.172       Recall@20         ndcg@20                                Recall@20           ndcg@20                               Recall@20             ndcg@20                              Recall@100          ndcg@100
                                                   0.0930                                                                  0.172                                         0.0930       0.346                                        0.1325
                                                                 0.172                                        0.0930
         0.170                                                                                                             0.171                                         0.0925       0.345                                        0.1320
                                                   0.0928                                                     0.0925
                                                                 0.171                                                                                                                                                             0.1315
                                                                                                                           0.170                                         0.0920       0.344
Recall


                                                                                                                  Recall


                                                                                                                                                                             Recall
                                                       Recall


                                                   0.0926
                                                       ndcg


                                                                                                                                                                             ndcg


                                                                                                                                                                                                                                       ndcg
                                                                                                                  ndcg
         0.168                                                                                                0.0920                                                                                                               0.1310
                                                                 0.170                                                     0.169                                         0.0915       0.343
                                                   0.0924                                                                                                                                                                          0.1305
         0.166                                                                                                0.0915
                                                                                                                           0.168                                         0.0910       0.342                                        0.1300
                                                   0.0922        0.169                                        0.0910       0.167                                         0.0905       0.341
         0.164                                                                                                                                                                                                                     0.1295
                                                   0.0920
                                                                 0.168                                        0.0905       0.166                                         0.0900       0.340                                        0.1290
                 0    16     32   64   128   256                          0      0.2      0.5   0.8    1.0                          0     0.05      0.1     0.2    0.5                         0     0.05       0.1   0.2    0.5
                 Nm: number of meta-preferences                          Ratio of Soft Distance Correlation                        CS of Cross-System Contrastive Loss                        CS of Cross-System Contrastive Loss


(a) Model performance varies (b) Model performance varies (c) Recall@20 (blue) and ndcg (d) Recall@100 (blue) and ndcg
    with the number of meta- with the ratio 𝜖 of Soft Dis- @20 (red) against the loss       @100 (red) against the loss
    preferences.                 tance Correlation (DCorr).   weight 𝜆𝐶𝑆 .                  weight 𝜆𝐶𝑆 .
Figure 2: Evaluation of model hyperparameters. Zoom in to see figures in detail.

  In conclusion, Cross-System CL significantly enhances    recover in dimensions from the standard Distance Cor-
KMPN’s ability to present more relevant items in the       relation (DCorr) constraint. As shown in Fig. 2b, 𝜖 = 0
top-100 list through the fusion of unstructured content-   (left) removes the DCorr constraint completely, while
based features. It complements the aforementioned short-   𝜖 = 1 (right) reduces to a standard DCorr Loss. As 𝜖
ages of conventional fusing methods by merging features    approaches 0, the DCorr constraint becomes too loose
without corrupting the already-learned representation      to encourage the diversity of preferences, leading to a
and without directly approaching two systems’ outputs.     dramatically decreased performance. The performance
                                                           peaks at 𝜖 = 0.5, where half of ℎ dimensions are relaxed
4.5. Contributions of Components                           from the standard Dcorr constraint, and preference em-
                                                           beddings are still able to grow diversely in the remaining
To support the rationale of our designs, Ablation studies half dimensions. This suggests that our softer version of
and hyperparameter evaluation are presented to explore DCorr constraint is beneficial to user modeling.
the effects of each proposed component.                    Effects of RRNS. As shown in Table 1, without Recip-
Effects of Meta Preferences. An important research rocal Ratio Negative Sampling, Recall@20 of KMPN (w/o
question is how the design of modeling users through SoftDcorr) is decreased from 0.1704 to 0.1690. In line
meta-preferences improves the model performance. As with our intuition, reducing the probability of sampling
shown in Fig. 2a, removing meta-preference modeling of popular items as negative samples for training can yield
users from KMPN (𝑁𝑚 = 0) dramatically decreases the benefits in model learning. This demonstrates that while
performance, showing that modeling users’ preferences viewed-but-not-clicked (hard negative) samples are not
is necessary. 𝑁𝑚 = 16 achieves worse performance than available to the model, our proposed sampling strategy
𝑁𝑚 ≥ 32 since a small number of meta preferences limits enhances the quality of negative samples.
the model’s capacity of modeling users. The performance Effects of Cross-System Contrastive Learning. The
on all metrics increases until it peaks at 𝑁𝑚 = 64, and system performance of top-20 does not drop much for
then it starts to decrease at 𝑁𝑚 ≥ 128. This suggests that 𝜆𝐶𝑆 ≤ 0.2 (Fig. 2c) whereas the performance at top-100
including too many meta preferences induces overfitting increases dramatically for 𝜆𝐶𝑆 ≤ 0.2 relative to a sys-
and does not further improve the system. It is a good tem without Cross-System CL (𝜆𝐶𝑆 = 0) (Fig. 2d). This
model property in practice since a moderate 𝑁𝑚 = 64 is suggests that by incorporating Cross-System CL in our
sufficient for achieving the best performance.             training with a reasonable 𝜆𝐶𝑆 , CKMPN is more capable
Effects of Soft Distance Correlation Loss. The hyper- of finding relevant items for users.
parameter 𝜖 controls the number of principal components
to keep after PCA dimension reduction. The lower the
ratio, the more flexibility the preference embeddings will
4.6. Performance on Movie-KG-Dataset                            from the standard test set, showing that our model still
                                                                functions in the cold-start setting.
As shown in Table 1(bottom), the same performance
boost is observed in KMPN relative to baselines. For
example, KMPN achieves 0.1434 Recall@20 and 0.1073              5. Conclusion
ndcg@20, which is higher than 0.1403 Recall@20 and
0.1006 ndcg@20 of the baselines. CKMPN also achieves            We present KMPN, a powerful KG-based CF model that
the best performance by incorporating content-based             outperforms strong baseline models. To investigate the
features from NRMS-BERT. It outperforms KMPN in all             complementary power of unstructured content-based
metrics, showing a significant improvement in ndcg@100          information, we further propose a novel approach Cross-
(from 0.1367 to 0.1482) and Hit Ratio@100 (from 0.3602          System Contrastive Learning, which combines CF and
to 0.3668) in particular. Therefore, we can conclude that       CBF, two distinct paradigms to achieve a substantial im-
our method is applicable in multiple different datasets.        provement relative to models in literature. This suggests
                                                                that KG-based CF models can benefit from the incorpo-
Table 3                                                         ration of unstructured content information derived from
Case study for a user who have browsed the movie Tenet          Transformers.
(2020). Source Code (2011) has a similar genre, while Dunkirk      Our proposed CKMPN has thus far achieved sub-
(2017) has the same director. Y/N: whether or not the movie     stantial improvements on both datasets, especially on
appears in the top-100 recommendation list of the models.       top-60/100 metrics. Industrial recommender systems
NRMS: NRMS-BERT; MoE: Mixture of Expert.                        usually follow a 2-step pipeline where a relatively large
        Item           KMPN     NRMS      MoE     CKMPN         amount of items 𝐾 = 60, 100 is firstly recalled by a Recall
                                                                Model and then a Ranking Model is adopted to refine the
  Source Code (2011)     N         Y       N         Y
                                                                list ranking. This improvement presents more relevant
    Dunkirk (2017)       Y         N       N         Y
                                                                items in the relatively coarser Recall output, which is ap-
                                                                pealing to industrial applications. Also, CKMPN is much
   An example output of systems is presented in Ta-             more preferred than the Mixture of Expert model in in-
ble 3. Y/N indicates whether or not the movie appears           dustrial applications, since it still produces independent
in the top-100 recommendation list of the four models           user/item representations. This feature enables the fast
(KMPN/NRMS-BERT/Mixture of Expert (MoE)/CKMPN).                 and efficient match of users and items in hidden space
This user has browsed Tenet (2020) directed by Christo-         with 𝑂(𝑙𝑜𝑔(𝑛)) query time complexity [66].
pher Nolan. The movie Source Code (2011) and Tenet
are both about time travel, but they have quite differ-
ent film crews. As a result, Source Code was considered         References
positive by NRMS-BERT which evaluates on the movie
description, but was considered negative by KG-based             [1] G. Takács, I. Pilászy, B. Németh, D. Tikk, Scalable
KMPN. Combining the scores of both systems, MoE did                  collaborative filtering approaches for large recom-
not recommend the movie. However, CKMPN comple-                      mender systems, The Journal of Machine Learning
mented the failure of KMPN and gave a high score for                 Research 10 (2009) 623–656.
this movie, by learning a content-aware item representa-         [2] P. B. Thorat, R. Goudar, S. Barve, Survey on collab-
tion based on the representation of NRMS-BERT through                orative filtering, content-based filtering and hybrid
Cross-System CL. In contrast, Dunkirk (2017) is about                recommendation system, International Journal of
war and history which is not in the same topic as Tenet.             Computer Applications 110 (2015) 31–36.
However, since they were directed by the same direc-             [3] J. Bennett, S. Lanning, et al., The netflix prize, in:
tor, KMPN and CKMPN both recommended this movie,                     Proceedings of KDD cup and workshop, volume
while MoE’s prediction was negatively affected by NRMS-              2007, Citeseer, 2007, p. 35.
BERT. This case study suggests that our Cross-System             [4] I. Pilászy, D. Tikk, Recommending new movies:
CL approach is an effective in-depth collaboration of two            even a few ratings are more valuable than metadata,
systems, outperforming the direct mixture of KMPN and                in: Proceedings of the third ACM conference on
NRMS-BERT.                                                           Recommender systems, 2009, pp. 93–100.
   We also present the model performance on the cold-            [5] Q. Guo, F. Zhuang, C. Qin, H. Zhu, X. Xie, H. Xiong,
start test set of the Movie-KG-dataset, where users are              Q. He, A survey on knowledge graph-based recom-
completely unseen in the training. As shown in the                   mender systems, IEEE Transactions on Knowledge
last section of Table 1 (bottom), our best model CKMPN               and Data Engineering (2020).
still achieved good performance for unseen users on all          [6] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
metrics, e.g., 0.1024 on Recall@20 and 0.3380 on Hit Ra-             Pre-training of deep bidirectional transformers for
tio@100. The performance did not deteriorate much                    language understanding, in: Proceedings of the
     2019 Conference of the North American Chap-                  Electronic Commerce, 2000, pp. 158–167.
     ter of the Association for Computational Linguis-       [17] H. Guo, R. TANG, Y. Ye, Z. Li, X. He, Deepfm:
     tics: Human Language Technologies, Volume 1                  A factorization-machine based neural network for
     (Long and Short Papers), Association for Com-                ctr prediction, in: Proceedings of the Twenty-
     putational Linguistics, Minneapolis, Minnesota,              Sixth International Joint Conference on Artifi-
     2019, pp. 4171–4186. URL: https://aclanthology.org/          cial Intelligence, IJCAI-17, 2017, pp. 1725–1731.
     N19-1423. doi:10.18653/v1/N19- 1423 .                        URL: https://doi.org/10.24963/ijcai.2017/239. doi:10.
 [7] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,             24963/ijcai.2017/239 .
     I. Sutskever, et al., Language models are unsuper-      [18] W. Zhang, T. Du, J. Wang, Deep learning over multi-
     vised multitask learners, OpenAI blog 1 (2019) 9.            field categorical data, in: European conference on
 [8] Y. Koren, R. Bell, C. Volinsky, Matrix factorization         information retrieval, Springer, 2016, pp. 45–57.
     techniques for recommender systems, Computer            [19] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chan-
     42 (2009) 30–37.                                             dra, H. Aradhye, G. Anderson, G. Corrado, W. Chai,
 [9] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-         M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain,
     Thieme, Bpr: Bayesian personalized ranking from              X. Liu, H. Shah, Wide deep learning for rec-
     implicit feedback, in: Proceedings of the Twenty-            ommender systems, in: Proceedings of the 1st
     Fifth Conference on Uncertainty in Artificial Intel-         Workshop on Deep Learning for Recommender
     ligence, UAI ’09, AUAI Press, Arlington, Virginia,           Systems, DLRS 2016, Association for Comput-
     USA, 2009, p. 452–461.                                       ing Machinery, New York, NY, USA, 2016, p.
[10] Y. Koren, Factorization meets the neighborhood:              7–10. URL: https://doi.org/10.1145/2988450.2988454.
     A multifaceted collaborative filtering model, in:            doi:10.1145/2988450.2988454 .
     Proceedings of the 14th ACM SIGKDD Interna-             [20] Y. Qu, H. Cai, K. Ren, W. Zhang, Y. Yu, Y. Wen,
     tional Conference on Knowledge Discovery and                 J. Wang, Product-based neural networks for user
     Data Mining, KDD ’08, Association for Com-                   response prediction, in: 2016 IEEE 16th Interna-
     puting Machinery, New York, NY, USA, 2008,                   tional Conference on Data Mining (ICDM), IEEE,
     p. 426–434. URL: https://doi.org/10.1145/1401890.            2016, pp. 1149–1154.
     1401944. doi:10.1145/1401890.1401944 .                  [21] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, T.-S. Chua,
[11] S. Rendle, Factorization machines with libfm, ACM            Neural collaborative filtering, in: Proceedings of
     Transactions on Intelligent Systems and Technol-             the 26th international conference on world wide
     ogy (TIST) 3 (2012) 1–22.                                    web, 2017, pp. 173–182.
[12] X. He, T.-S. Chua,        Neural factorization ma-      [22] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L.
     chines for sparse predictive analytics, in: Pro-             Hamilton, J. Leskovec, Graph convolutional neu-
     ceedings of the 40th International ACM SIGIR                 ral networks for web-scale recommender systems,
     Conference on Research and Development in In-                in: Proceedings of the 24th ACM SIGKDD Interna-
     formation Retrieval, SIGIR ’17, Association for              tional Conference on Knowledge Discovery & Data
     Computing Machinery, New York, NY, USA, 2017,                Mining, 2018, pp. 974–983.
     p. 355–364. URL: https://doi.org/10.1145/3077136.       [23] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, M. Wang,
     3080777. doi:10.1145/3077136.3080777 .                       Lightgcn: Simplifying and powering graph convolu-
[13] R. J. Oentaryo, E.-P. Lim, J.-W. Low, D. Lo, M. Fine-        tion network for recommendation, in: Proceedings
     gold, Predicting response in mobile advertising              of the 43rd International ACM SIGIR conference on
     with hierarchical importance-aware factorization             research and development in Information Retrieval,
     machine, in: Proceedings of the 7th ACM interna-             2020, pp. 639–648.
     tional conference on Web search and data mining,        [24] J. Chicaiza, P. Valdiviezo-Diaz, A comprehensive
     2014, pp. 123–132.                                           survey of knowledge graph-based recommender
[14] K. Verstrepen, B. Goethals, Unifying nearest neigh-          systems: Technologies, development, and contribu-
     bors collaborative filtering, in: Proceedings of the         tions, Information 12 (2021) 232.
     8th ACM Conference on Recommender systems,              [25] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston,
     2014, pp. 177–184.                                           O. Yakhnenko, Translating embeddings for mod-
[15] M. Deshpande, G. Karypis, Item-based top-n rec-              eling multi-relational data, Advances in neural
     ommendation algorithms, ACM Transactions on                  information processing systems 26 (2013).
     Information Systems - TOIS 22 (2004) 143–177.           [26] Z. Wang, J. Zhang, J. Feng, Z. Chen, Knowledge
     doi:10.1145/963770.963776 .                                  graph embedding by translating on hyperplanes, in:
[16] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Analy-          Proceedings of the AAAI Conference on Artificial
     sis of recommendation algorithms for e-commerce,             Intelligence, volume 28, 2014.
     in: Proceedings of the 2nd ACM Conference on            [27] Z. Wang, J. Li, Z. Liu, J. Tang, Text-enhanced rep-
     resentation learning for knowledge graph, in: Pro-            dation, in: Proceedings of the 25th ACM SIGKDD
     ceedings of International Joint Conference on Arti-           International Conference on Knowledge Discovery
     ficial Intelligent (IJCAI), 2016, pp. 4–17.                   & Data Mining, 2019, pp. 950–958.
[28] H. Wang, F. Zhang, M. Zhao, W. Li, X. Xie, M. Guo,       [38] Z. Wang, G. Lin, H. Tan, Q. Chen, X. Liu, Ckan:
     Multi-task feature learning for knowledge graph               Collaborative knowledge-aware attentive network
     enhanced recommendation, in: The World Wide                   for recommender systems, in: Proceedings of the
     Web Conference, 2019, pp. 2000–2010.                          43rd International ACM SIGIR Conference on Re-
[29] F. Zhang, N. J. Yuan, D. Lian, X. Xie, W.-Y. Ma, Col-         search and Development in Information Retrieval,
     laborative knowledge base embedding for recom-                2020, pp. 219–228.
     mender systems, in: Proceedings of the 22nd ACM          [39] W. L. Hamilton, R. Ying, J. Leskovec, Inductive
     SIGKDD international conference on knowledge                  representation learning on large graphs, in: Pro-
     discovery and data mining, 2016, pp. 353–362.                 ceedings of the 31st International Conference on
[30] Y. Cao, X. Wang, X. He, Z. Hu, T.-S. Chua, Uni-               Neural Information Processing Systems, 2017, pp.
     fying knowledge graph learning and recommen-                  1025–1035.
     dation: Towards a better understanding of user           [40] P. Veličković, G. Cucurull, A. Casanova, A. Romero,
     preferences, in: The world wide web conference,               P. Liò, Y. Bengio, Graph Attention Networks, Inter-
     2019, pp. 151–161.                                            national Conference on Learning Representations
[31] B. Hu, C. Shi, W. X. Zhao, P. S. Yu, Leveraging meta-         (2018).
     path based context for top-n recommendation with         [41] X. Wang, T. Huang, D. Wang, Y. Yuan, Z. Liu, X. He,
     a neural co-attention model, in: Proceedings of               T.-S. Chua, Learning intents behind interactions
     the 24th ACM SIGKDD International Conference                  with knowledge graph for recommendation, in:
     on Knowledge Discovery & Data Mining, 2018, pp.               Proceedings of the Web Conference 2021, 2021, pp.
     1531–1540.                                                    878–887.
[32] J. Jin, J. Qin, Y. Fang, K. Du, W. Zhang, Y. Yu,         [42] J. Liu, P. Dolan, E. R. Pedersen, Personalized news
     Z. Zhang, A. J. Smola, An efficient neighborhood-             recommendation based on click behavior, in: Pro-
     based interaction model for recommendation on                 ceedings of the 15th international conference on
     heterogeneous graph, in: Proceedings of the 26th              Intelligent user interfaces, 2010, pp. 31–40.
     ACM SIGKDD International Conference on Knowl-            [43] C. Wu, F. Wu, S. Ge, T. Qi, Y. Huang, X. Xie, Neu-
     edge Discovery & Data Mining, 2020, pp. 75–84.                ral news recommendation with multi-head self-
[33] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandel-           attention, in: Proceedings of the 2019 Confer-
     wal, B. Norick, J. Han, Personalized entity recom-            ence on Empirical Methods in Natural Language
     mendation: A heterogeneous information network                Processing and the 9th International Joint Con-
     approach, in: Proceedings of the 7th ACM interna-             ference on Natural Language Processing (EMNLP-
     tional conference on Web search and data mining,              IJCNLP), Association for Computational Linguis-
     2014, pp. 283–292.                                            tics, Hong Kong, China, 2019, pp. 6389–6394. URL:
[34] H. Zhao, Q. Yao, J. Li, Y. Song, D. L. Lee, Meta-graph        https://aclanthology.org/D19-1671. doi:10.18653/
     based recommendation fusion over heterogeneous                v1/D19- 1671 .
     information networks, in: Proceedings of the 23rd        [44] S. Okura, Y. Tagami, S. Ono, A. Tajima, Embedding-
     ACM SIGKDD international conference on knowl-                 based news recommendation for millions of users,
     edge discovery and data mining, 2017, pp. 635–644.            in: Proceedings of the 23rd ACM SIGKDD interna-
[35] H. Wang, F. Zhang, M. Zhang, J. Leskovec, M. Zhao,            tional conference on knowledge discovery and data
     W. Li, Z. Wang, Knowledge-aware graph neural                  mining, 2017, pp. 1933–1942.
     networks with label smoothness regularization for        [45] J. Lian, F. Zhang, X. Xie, G. Sun, Towards better rep-
     recommender systems, in: Proceedings of the 25th              resentation learning for personalized news recom-
     ACM SIGKDD international conference on knowl-                 mendation: a multi-channel deep fusion approach.,
     edge discovery & data mining, 2019, pp. 968–977.              in: IJCAI, 2018, pp. 3805–3811.
[36] H. Wang, M. Zhao, X. Xie, W. Li, M. Guo, Knowl-          [46] C. Wu, F. Wu, M. An, J. Huang, Y. Huang, X. Xie,
     edge graph convolutional networks for recom-                  Npa: neural news recommendation with personal-
     mender systems, in: The World Wide Web                        ized attention, in: Proceedings of the 25th ACM
     Conference, WWW ’19, Association for Com-                     SIGKDD international conference on knowledge
     puting Machinery, New York, NY, USA, 2019, p.                 discovery & data mining, 2019, pp. 2576–2584.
     3307–3313. URL: https://doi.org/10.1145/3308558.         [47] D. Liu, J. Lian, S. Wang, Y. Qiao, J.-H. Chen, G. Sun,
     3313417. doi:10.1145/3308558.3313417 .                        X. Xie, Kred: Knowledge-aware document represen-
[37] X. Wang, X. He, Y. Cao, M. Liu, T.-S. Chua, Kgat:             tation for news recommendations, in: Fourteenth
     Knowledge graph attention network for recommen-               ACM Conference on Recommender Systems, 2020,
     pp. 200–209.                                                   and testing dependence by correlation of distances,
[48] H. Wang, F. Zhang, X. Xie, M. Guo, Dkn: Deep                   The annals of statistics 35 (2007) 2769–2794.
     knowledge-aware network for news recommenda-              [61] H. Hotelling, Analysis of a complex of statistical
     tion, in: Proceedings of the 2018 world wide web               variables into principal components., Journal of
     conference, 2018, pp. 1835–1844.                               educational psychology 24 (1933) 417.
[49] S. H. Choi, Y.-S. Jeong, M. K. Jeong, A hybrid rec-       [62] W. Lin, L. Shou, M. Gong, P. Jian, Z. Wang, B. Byrne,
     ommendation method with reduced data for large-                D. Jiang, Combining unstructured content and
     scale application, IEEE Transactions on Systems,               knowledge graphs into recommendation datasets,
     Man, and Cybernetics, Part C (Applications and                 in: 4th Edition of Knowledge-aware and Conversa-
     Reviews) 40 (2010) 557–566.                                    tional Recommender Systems (KaRS) Workshop @
[50] L. M. De Campos, J. M. Fernández-Luna, J. F. Huete,            RecSys 2022, 2022.
     M. A. Rueda-Morales, Combining content-based              [63] D. P. Kingma, J. Ba, Adam: A method for stochastic
     and collaborative recommendations: A hybrid ap-                optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd
     proach based on bayesian networks, Interna-                    International Conference on Learning Representa-
     tional journal of approximate reasoning 51 (2010)              tions, ICLR 2015, San Diego, CA, USA, May 7-9,
     785–799.                                                       2015, Conference Track Proceedings, 2015. URL:
[51] D. Billsus, M. J. Pazzani, J. Chen, A learning agent           http://arxiv.org/abs/1412.6980.
     for wireless news access, in: Proceedings of the 5th      [64] W. Krichene, S. Rendle, On sampled metrics for item
     international conference on Intelligent user inter-            recommendation, in: Proceedings of the 26th ACM
     faces, 2000, pp. 33–36.                                        SIGKDD International Conference on Knowledge
[52] M. Ghazanfar, A. Prugel-Bennett,            Building           Discovery & Data Mining, 2020, pp. 1748–1757.
     switching hybrid recommender system using ma-             [65] S. Rendle, Z. Gantner, C. Freudenthaler, L. Schmidt-
     chine learning classifiers and collaborative filtering,        Thieme, Fast context-aware recommendations
     IAENG International Journal of Computer Science                with factorization machines, in: Proceedings
     37 (2010).                                                     of the 34th International ACM SIGIR Confer-
[53] J. M. Noguera, M. J. Barranco, R. J. Segura,                   ence on Research and Development in Informa-
     L. Martínez, A mobile 3d-gis hybrid recommender                tion Retrieval, SIGIR ’11, Association for Com-
     system for tourism, Information Sciences 215 (2012)            puting Machinery, New York, NY, USA, 2011,
     37–52.                                                         p. 635–644. URL: https://doi.org/10.1145/2009916.
[54] A. S. Lampropoulos, P. S. Lampropoulou, G. A.                  2010002. doi:10.1145/2009916.2010002 .
     Tsihrintzis, A cascade-hybrid music recommender           [66] Y. A. Malkov, D. A. Yashunin, Efficient and robust
     system for mobile services based on musical genre              approximate nearest neighbor search using hierar-
     classification and personality diagnosis, Multime-             chical navigable small world graphs, IEEE transac-
     dia Tools and Applications 59 (2012) 241–258.                  tions on pattern analysis and machine intelligence
[55] I. A. Christensen, S. N. Schiaffino, A hybrid ap-              42 (2018) 824–836.
     proach for group profiling in recommender systems
     (2014).
[56] P. Bedi, P. Vashisth, P. Khurana, et al., Modeling
     user preferences in a hybrid recommender system
     using type-2 fuzzy sets, in: 2013 IEEE International
     Conference on Fuzzy Systems (FUZZ-IEEE), IEEE,
     2013, pp. 1–8.
[57] R. J. Mooney, L. Roy, Content-based book recom-
     mending using learning for text categorization, in:
     Proceedings of the fifth ACM conference on Digital
     libraries, 2000, pp. 195–204.
[58] X. Li, T. Murata, Multidimensional clustering based
     collaborative filtering approach for diversified rec-
     ommendation, in: 2012 7th International Confer-
     ence on Computer Science & Education (ICCSE),
     IEEE, 2012, pp. 905–910.
[59] G. J. Székely, M. L. Rizzo, Brownian distance co-
     variance, The annals of applied statistics 3 (2009)
     1236–1265.
[60] G. J. Székely, M. L. Rizzo, N. K. Bakirov, Measuring