<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Transformer-Empowered Content-Aware Collaborative Filtering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daxin Jiang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Weizhe Lin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Linjun Shou</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ming Gong</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pei Jian</string-name>
          <email>jpei@cs.sfu.ca</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhilin Wang</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bill Byrne</string-name>
          <email>bill.byrne@eng.cam.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Engineering, University of Cambridge</institution>
          ,
          <addr-line>Cambridge</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Microsoft STCA</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Simon Fraser University</institution>
          ,
          <addr-line>British Columbia</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Washington</institution>
          ,
          <addr-line>Seattle</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowledge graph (KG) based Collaborative Filtering (CF) is an efective approach to personalize recommender systems for relatively static domains such as movies and books, by leveraging structured information from KG to enrich both item and user representations. This paper investigates the complementary power of unstructured content information (e.g. rich summary texts of items) in KG-based CF recommender systems. We introduce Content-aware KG-enhanced Meta-preference Networks that enhances the CF recommendation based on both structured information from KG as well as unstructured content features based on Transformer-empowered content-based filtering (CBF). Within this modeling framework, we demonstrate a powerful KG-based CF model and a CBF model (a variant of the well-known NRMS system) and employ a novel training scheme, Cross-System Contrastive Learning, to address the inconsistency of the two very diferent systems in fusing information. We present experimental results showing that enhancing collaborative filtering with Transformer-based features derived from content-based filtering ofers new improvements relative to strong baseline systems, improving the ability of KG-based CF systems to exploit item content information.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge graph</kwd>
        <kwd>recommender systems</kwd>
        <kwd>collaborative filtering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Collaborative Filtering (CF) and Content-based Filtering
CF systems study users’ interactions in order to
leverage inter-item, inter-user, or user-item dependencies in
making recommendations. The underlying notion is that
users who interact with similar sets of items are likely to
share preferences for other items. CBF models leverage
descriptive attributes of items (e.g. item description and
category) and users (e.g. age and gender). Users are
characterized by the content information available in their
browsing histories [2]. CBF is particularly well-suited to
news recommendations, where millions of new items are
produced every day. In contrast, CF systems are better
suited to scenarios where the inventory of items grows
slowly and where abundant user-item interactions are
available. Movie and book recommender systems are
WA, USA.
nEvelop-O
STCA.
†This work was done during Weizhe Lin’s internship at Microsoft
Nolan] - [director] - [Dunkirk (movie)]. KG-based CF
models are particularly good at linking items to other
related knowledge graph entities that serve as “item
properties”. This approach leverages the structured content
information from KGs (e.g. movie genre and actors) to
complement CF features.</p>
      <p>While KGs can readily incorporate structured content
tent such as item descriptions, is largely unexploited.</p>
      <p>Recent Transformer-based models, such as BERT [6] and
GPT-2 [7], have shown great power in modeling
descriptive content from natural language, which ofers new
opportunities to enrich item/user representations with
more expressive CBF features derived from Transformers.
tion”, have a very similar set of structured properties
including genre, writer, and director, but their descriptions in learning [17, 12, 18, 19, 20, 21]. Building upon
graphprovide more fine-grained discriminative information, based CF models [22, 23], KG-based CF models fuse
extermaking it clear that one is about physics and universe nal knowledge from auxiliary KGs to improve both the
and the other is about adventures and dreams. accuracy and explainability of recommendation [5, 24].</p>
      <p>Therefore, in this work, we ofer insights into the com- Items in interaction graphs are associated with auxiliary
plementary power of unstructured CBF features derived KG entities with respect to their attributes (e.g. movie
from Transformers (e.g. summary texts of books and directors).
movies). We investigate how these content-aware CBF To exploit the KGs, Embedding-based Methods employ
features can be efectively fused to complement CF learn- KG embedding methods (e.g TransE [25], TransH [26]
ing, and how much value they can add to standard large- and TransR [27]) in order to enhance item
represenscale KG-based CF recommender systems. tations with KG-aware entity embeddings [28, 29, 30].</p>
      <p>However, computationally eficient approaches to en- For example, KTUP [30] trains item representations and
rich KG-based CF models with unstructured CBF fea- TransH-powered KG completion simultaneously.
Pathtures derived from Transformers are not yet well ad- based Methods follow the meta-paths manually designed
dressed in the literature. The challenge mainly stems by domain experts to make KG-path-aware
recommendafrom the need to capture the co-occurrence of graph node tions [31, 32, 33, 34], which is, however, not feasible for
features by graph convolution operations. This opera- larger KGs with their enormous entity and path
divertion requires representations of graph nodes to be back- sity. Convolution Methods [35, 36, 32, 37, 38] design
conpropagated and updated after each forward pass, and thus volution mechanisms, mostly variants of Graph Neural
it is prohibitively costly for large graphs where millions Networks [39, 40] (GNNs), to enhance item/user
repreof item/user nodes require transformer-generated em- sentations with features aggregated from distant entities.
beddings. Therefore, using pre-extracted features from KGIN [41] further embeds KG-relational embeddings in
trained CBF systems is the most promising option. How- inter-node feature passing to achieve path-aware graph
ever, conventional fusion schemes (such as Mixture of convolution.</p>
      <p>Expert and early/late fusion) are shown to be vulnerable Content-based Filtering. CBF models match items
in our experiments (see Sec. 4.4). We address this prob- to a user by considering the metadata (content-based
lem by introducing Cross-System Contrastive Learning, information) of items with which the user has
interwhich brings together the benefits of both structured and acted [42, 43, 44, 45, 46]. Most research in KG-based
unstructured item properties. In this paper: CBF, a recently popular topic, focuses on enhancing the
1. We introduce a powerful KG CF model (KMPN) item representations with KG embeddings by mapping
that outperforms strong baselines, and demon- relevant KG entities to the content of items, e.g., by entity
strate the improvement brought by each system linking [47, 48]. However, these methods heavily rely
component. We also introduce a Transformer- on word-level entity mapping with KG entities, which
empowered CBF model (NRMS-BERT) that is prohibited for movies/books since their descriptions
achieves good recommendation performance mostly consist of imaginary content, such as character
with only summary texts of books and movies. names and fictional stories.
2. We propose to merge unstructured content-based Fusing CF and CBF. Hybrid CF-CBF systems are
offeatures into KG-based CF through a simple but ef- ten achieved by weighting/combining [49, 50] or
switchfective fusion framework based on Cross-System ing [51, 52, 53] between the ranking outputs of the two
Contrastive Learning. systems. They can also pass a relatively coarser ranking
list produced by one system into the other for
refine3. Based on two realistic recommendation datasets, ment [54, 55]. The features derived from one system can
we present extensive experiments showing the also be used to complement the other system by fusing
value of incorporating unstructured CBF features with the output features (late fusion) [56] or augmenting
derived from Transformers. the user/item input features (early fusion) [57, 58]. For
example, CKE [29] produces augmented item
represen2. Related Work tations by obtaining fixed textual features from
unsupervised denoising auto-encoders. In contrast, we
introCollaborative Filtering. Traditional CF models rely duce NRMS-BERT to obtain more expressive textual item
on Matrix Factorization (MF) [8, 9, 10] and Factorization representations with supervised training and larger
lanMachine (FM) [11, 12, 13] in learning user-item represen- guage models. Furthermore, these conventional fusing
tations. Nearest neighbour approaches are also predomi- approaches (including late/early fusion and mixture of
nant to CF, where the user-item ratings are interpolated experts) fail to perform well in our experiments (Sec. 4.4).
from the ratings of similar items and users [14, 15, 16]. We address this by proposing a novel training scheme
Recent models incorporate Deep Neural Networks (DNN) based on contrastive learning that complements a
KGKnoledge Graph Interactions among Triplets</p>
      <p>Preference Modelling
...</p>
      <p>BERT</p>
      <p>Attention</p>
      <p>Pooling
3 NRMS-BERT
5 Prediction</p>
      <p>
        Feature Passing
Non-item KG
Entity
Item
User
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
User Browsed Items
      </p>
      <p>Item Encoder</p>
      <p>User Encoder
4 CCroonstsraSsytsivteemLearning
3.</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <sec id="sec-2-1">
        <title>3.1. Data Notation</title>
        <p>+ = {(, )| ∈ 
There are   users {| ∈ 
description of the item.</p>
        <p>Each (, )</p>
        <p>pair indicates that user  interacted with  .</p>
        <p>Each item  ∈   carries unstructured data x , e.g. a text
included in the KG:   ⊂  .</p>
        <p>The KG contains structured information that describes
relations between real world entities. The KG is
represented as a weighted heterogeneous graph  = ( , ℰ )
with a node set  consisting of   nodes { } and an edge
set ℰ containing all edges between nodes. The graph is
also associated with a relation type mapping function
 ∶ ℰ → ℛ</p>
        <p>that maps each edge to a type in the relation
set ℛ consisting of   relations. Note that all items are</p>
        <p>The edges of the knowledge graph are triplets  =
{(ℎ,  , )|ℎ,  ∈  ,  ∈
ℛ}</p>
        <p>, where 
tion of graph entities/nodes and ℛ is the relation
set. Each triplet describes that a head entity ℎ is
connected to a tail entity  with the relation  . For
example, ( ℎ ℎ,  . . ,    ℎ)
ifes that Jack Nicholson is a film actor in the movie “The
Shining”. To fully expose the relationships between heads
and tails, the relation set is extended with reversed
relation types, i.e., for any (ℎ,  , ) triplet we allow inverse
connection (,  ′, ℎ) to be built, where  ′ is the reverse of
 . The edge set ℰ is derived from these triplets.</p>
        <p>
          speci ,  ∈   } is the set of user interactions. shown in Fig. 1 (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ), and (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ).
 } and   items {| ∈   }. links item features to users for recommendations, as
is the collec-  } ,   is the type of relation from   to   , and   is a gated
function that controls messages that flow from   to   :
where the neighbouring set of  :   = {  |(  , 
 ,   ) ∈
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Network (KMPN)</title>
        <p>This section introduces KG-enhanced Meta-Preference
Network (thereafter KMPN). It is a KG-based CF model
that aggregates features of all KG entities to items
eficiently by exploiting relationships in the KG, and then</p>
        <sec id="sec-2-2-1">
          <title>3.2.1. Gated Path Graph Convolution Networks</title>
          <p>Associated with each KG node   are feature vectors
e(0) ∈ Rℎ. Each relation type  ∈ ℛ is also associated

with a relational embedding e . A Gated Path Graph
Convolution Network is a cascade of  convolution
layers. For each KG node, a convolution layer aggregates
features from its neighbors as follows:
e
(+1) =

1
|  | {  |( 
∑
,
 ,  )∈ }
  e

 ⊙ e ,</p>
          <p>()
  =  ( e e  ),</p>
          <p>T
where  (.) is a sigmoid function that limits the gated
value between 0 and 1. As a result, the message passed
to a node is weighted by its importance to the receiving
node and the relation type. Through stacking multiple
layers of convolution, the final embedding at a node
depends on the path along which the features are shared,
as well as the importance of the message being
transconvolutions, the embedding at a KG node after  convo- these learnt preferences as much as possible, in order to
lutions is an aggregation of all the intermediate output
obtain diverse proxies bridging users and items. Though
this assumption, they proposed to aggregate item embed- relation constraints to encourage diverse expression in
the authors demonstrate a considerable improvement
over baselines, we take the view that applying constraints
to all dimensions of preference embeddings restricts their
expressiveness, as they are trained to be very dissimilar
and have diverse orientations in latent space.</p>
          <p>We adopt a softer approach:</p>
          <p>Soft Distance
Correlation Loss, which firstly lowers the dimensionality of
preference embeddings with Principal Component
Analysis (PCA) [61] while keeping the most diferentiable
features in embeddings, and then applies distance
corlower dimensions:
ê =  ({
e ′| ′ ∈  }) ∈</p>
          <p>Rℎ ;</p>
          <p>
            (
            <xref ref-type="bibr" rid="ref7">7</xref>
            )
. (
            <xref ref-type="bibr" rid="ref8">8</xref>
            )
embeddings: e =
          </p>
          <p>∑ ′=0 e</p>
          <p>( ′).</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>3.2.2. User Preference Modeling</title>
          <p>Inspired by Wang et al. [41], we model users using a
combination of preferences. Wang et al. [41] assumed
that each user is influenced by multiple intents and
that each intent is influenced by multiple movie
attributes, such as the combination of the two relation
types  . . 
and  . .</p>
          <p>. Based on
dings to users through “preferences”, and the embedding
of each preference e is modelled by all types of edges:
weight and e is the embedding of edge relation type  .</p>
          <p>We take the view that user preferences are not only
limited to relations but can be extended to more general
cases. We model each preference  through a
combination of a set of meta-preferences ℳ with in total  
metapreferences: each meta-preference  ∈ ℳ</p>
          <p>is associated
is formed by these meta-preferences as follows:
with a trainable embedding e ∈ Rℎ, and a preference 
e =
∑   e ,
∈ℳ
 :</p>
          <p>where the linear weights {  | ∈ ℳ} are derived
from trainable weights { ̂ | ∈ ℳ} for each preference
 
=</p>
          <p>exp ( ̂ )
∑ ′∈ℳ exp ( ̂ ′)</p>
          <p>As a result, meta-preferences reflect the general
interests of all users. A particular user can be profiled by
aggregating the embeddings of interacted items through
these preferences:
e

(+1) = ∑  
∈</p>
          <p>∑
(,)∈
+
e
() ⊙ e ,

  =</p>
          <p>exp (eTe() )
∑ ′∈ exp (eT′e() )

e = ∑ ′=0 e</p>
          <p>( ′).</p>
          <p>In summary, each preference is formed by general and
diverse meta-preferences, and users are further profiled
by multiple preferences that focus on diferent aspects of
item features. As with items, the final user embedding is:
3.2.3. Soft Distance Correlation
) ∉  +}. However, an item is not necessarily</p>
          <p>=
not all items have been viewed. We propose to adopt
Reciprocal Ratio Negative Sampling (RRNS), where items
with more user interactions are considered popular and
are sampled less frequently based on the assumption that
popular items are less likely to be hard negative samples
for any user. The sampling distribution is given by a
normalized reciprocal ratio of item interactions:
 − ∼  () ∝
1
()
for  ∈</p>
          <p>
            (
            <xref ref-type="bibr" rid="ref9">9</xref>
            )
where () counts the interactions of all users with the
item  .
where  is the collection of   preferences {} and  
is an attention mechanism that weights the interest of “not interesting” to a user if no interaction happens, as
users over diferent preferences:
Having modelled users through preferences, Wang et al.
          </p>
          <p>The training set therefore consists of positive and
nega[41] added an additional loss that utilizes Distance Corre- tive samples:  = {(, 
+,  −)|(,  +
) ∈  +, (,  −
) ∈  −}.
lation (DCorr) [59, 60] to separate the representations of
Pairwise BPR loss [9] is adopted to train the model, which
exploits a contrastive learning concept to assign higher
scores to users’ browsed items than those items in which
the users are not interested:
ℒ
=</p>
          <p>∑
(, +, −)∈</p>
          <p>− ln( ( ̂  + −  ̂ −)).</p>
          <p>Together with commonly-used embedding L2
regularization and Soft Distance Correlation loss, the final loss
is given by:
1
2
ℒ
= ℒ
+  1 ||Θ||22 +  2ℒ 
,</p>
          <p>
            (
            <xref ref-type="bibr" rid="ref11">11</xref>
            )
parameters that control loss weights.
where Θ = {e , e
 +, e −|(,  +,  −) ∈  } , and ||Θ||22 is the
          </p>
          <p>L2-norm of user/item embeddings.  1 and  2 are
hyper</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.3. Neural Recommendation with</title>
      </sec>
      <sec id="sec-2-4">
        <title>Multi-Head Self-Attention</title>
        <p>
          Inspired by NRMS [43] that is powerful in news
recThe rating is the dot product of user and item embeddings:
 ̂
(
          <xref ref-type="bibr" rid="ref10">10</xref>
          ) samples and negative samples are  ̂ + and  1̂−,..., ̂ −,
following [43], the loss is the log click probability of item  :
= (e )T ⋅ e . Assume that the scores of the positive
ℒ 
= − ∑ log (
∈ 
exp ( +̂) + ∑=1,.., exp (  −̂))
exp ( +̂)
(
          <xref ref-type="bibr" rid="ref16">16</xref>
          )
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>KMPN (CKMPN)</title>
      </sec>
      <sec id="sec-2-6">
        <title>3.4. Fusing CF and CBF: Content-aware</title>
        <p>To fuse the information from a CBF model (NRMS-BERT)
to a CF model (KMPN), we must bridge some
inconsistencies between the two types of models. CBF models
that utilize large transformers cannot be co-optimized
with KG-based CF models, as graph convolution requires
all embeddings to be present before convolution and this
ommendations, we propose a variant of NRMS, NRMS- requires enormous GPU memory for even one single
forBERT, that further utilizes a fine-tuned Transformer
ward pass. As a result, a more eficient solution merges
(BERT) for extracting contextual information from de- the pre-trained CBF features into the training of the
KGe =  (
x ) ∈ Rℎ.</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref12">12</xref>
          ) than KMPN.
scriptions of items, as shown in Fig. 1 (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ).
        </p>
        <sec id="sec-2-6-1">
          <title>3.3.1. Item Encoder</title>
          <p>The item encoder encodes the text description string x
of any item  ∈   through BERT into embeddings of size
ℎ by extracting the embeddings of &lt;CLS&gt; at the last layer:
ered to E = [e ,1 , ..., e</p>
          <p>, ] ∈ R×ℎ .</p>
          <p>For each user  , the item encoder encodes one
positive item e + and  negative items e −, ..., e −.  items
1

are randomly sampled from the user’s browsed items
 ,1 ,..., , . These browsed items are encoded and gath-</p>
        </sec>
        <sec id="sec-2-6-2">
          <title>3.3.2. User Encoder</title>
          <p>The user encoder uses items with which users interacted
to produce a content-aware user representation. The final
user representation is a weighted sum of the  browsed
items:

=1
e = ∑   e</p>
          <p>,
tained by passing features through two linear layers:
where   is the attention weight assigned to  ,
ob  =</p>
          <p>exp (Â )
∑ ′=1,.., exp (Â ′)</p>
          <p>;
Â = tanh(E A  1
+ b  1
)A  2
+ b  2</p>
          <p>∈ R×1
b
  2
layers, respectively.</p>
          <p>where A  1 ∈ Rℎ× 12 ℎ, b  1 ∈ R 2 ℎ, A  2 ∈ R 21 ℎ×1, and</p>
          <p>1
∈ R1 are weights and biases of two fully-connected

 
 +

 −
ℒ
,   −</p>
          <p>,  +</p>
          <p>Cross-System Contrastive Loss is adopted to
encourage the KMPN system to learn to incorporate
contentsensitive features from NRMS-BERT features:
− ln ( ((e
− ln ( ((e 
)T ⋅ (e</p>
          <p>+
)T ⋅ (e
 +
− e −
− e−
 
, and</p>
          <p>This loss encourages KMPN to produce item
embeddings that interact not only with KMPN’s own user
embeddings, but also with NRMS-BERT’s user embeddings.</p>
          <p>Similarly, user embeddings of KMPN are trained to
inlearn mutual expressiveness with e 
teract with items of NRMS-BERT. This allows e
, but without</p>
          <p>
            to
approaching the two embeddings directly using
similar(
            <xref ref-type="bibr" rid="ref15">15</xref>
            ) ity (e.g. cosine-similarity), which we found not to work
well (discussed in Sec. 4.4). In this case, e
          </p>
          <p>as an ‘anchor’ with which the item embeddings of two
systems learn to share commons and increase their
mutu</p>
          <p>serves
ality. This loss encourages e

and e</p>
          <p>to lie in the


state-of-the-art performance, while best performance of the proposed models is marked in bold. The average of 3 runs is
reported to mitigate experimental randomness. Metrics with (*) are significantly higher than KMPN ( &lt; 0.05 ).</p>
          <p>On Amazon-Book-Extended
BPRMF
CKE
KGAT
KGIN
KMPN (ours)
- w/o Soft DCrr
- w/o Soft DCorr and RRNS
NRMS-BERT (ours)
CKMPN (  = 0.2) (ours)
CKMPN (  = 0.1) (ours)
Improv. (%) CKMPN v.s. Best Baselines
On Movie-KG-Dataset
BPRMF
CKE
KGAT
KGIN
KMPN ( = 0.5,   = 64) (ours)
NRMS-BERT (ours)
CKMPN (  = 0.01) (ours)
CKMPN (ours) (on the cold-start set)</p>
          <p>Recall
0.2433
0.2413</p>
          <p>ndcg
0.0957
0.0948
Hit Ratio
same hidden space hyperplane on which features have
the same dot-product results with e</p>
          <p>encourages KMPN to grow embeddings in the same
region of hidden space, leading to mutual expressiveness
across the two systems. Finally, the optimization target</p>
          <p>. This constraint
is:
ℒ  
= ℒ  
+  
ℒ ,</p>
          <p>(18)
where</p>
          <p>controls the weight of the Cross-System
Contrastive Loss. This fusion scheme can be applied to
any models with similar CF/CBF mechanisms.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <sec id="sec-3-1">
        <title>4.1. Datasets</title>
        <p>
          We use the two datasets introduced in [62]: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
AmazonBook-Extended collects book descriptions from multiple
data sources for the popular Amazon-Book dataset. It
contains 70,679 users, 24,915 items along with a KG of
88,572 nodes and 2,557,746 triplets. (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Movie-KG-Dataset
is a newly collected dataset that contains 125,218 users,
50,000 items with a KG of 250,327 nodes and 12,055,581
triplets. Descriptions of movies are provided to enable
content-based recommendations.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Training Details</title>
        <p>All experiments were run on 8 NVIDIA A100 GPUs
with batch size 8192 × 8 for KMPN/CKMPN and 4 × 8
for NRMS-BERT. Adam [63] is used to optimize models.
KMPN/CKMPN is trained for 2000 epochs with linearly
decayed learning rates from 10−3 to 0 for
Amazon-BookExtended and 5 × 10−4 to 0 for Movie-KG-Dataset.
Training takes 4 hours on Amazon-Book-Extended and 12
hours on Movie-KG-Dataset. NRMS-BERT is trained for
10 epochs at a constant learning rate of 10−4. Training
takes 20 hours on Amazon-Book-Extended and 120 hours
on Movie-KG-Dataset.</p>
        <p>Codes and pre-trained models will be released at
https://github.com/LinWeizheDragon/Content-AwareKnowledge-Enhanced-Meta-Preference-Networks-forRecommendation.</p>
      </sec>
      <sec id="sec-3-3">
        <title>4.3. Evaluation Metrics and Baselines</title>
        <p>
          Following common practice [21, 37, 41, 64], we report
metrics for evaluating model performance: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Recall@K :
within top- recommendations, how well the system
recalls the test-set browsed items for each user; (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
ndcg@K (Normalized Discounted Cumulative Gain) [64]:
increases when relevant items appear earlier in the
recommended list; (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) HitRatio@K : how likely a user finds
at least one interesting item in the recommended top-K
items.
        </p>
        <p>We take the performance of several recently published
recommender systems as points for comparison1. We
carefully reproduced all these baseline systems from their
repositories2.</p>
        <p>BPRMF [9]: a strong Matrix Factorization (MF)
method that applies a generic optimization criterion
BPROpt for personalized ranking. Limited by space, other
MF models (e.g. FM [65], NFM [12]) are not presented
since BPRMF outperformed them.</p>
        <p>CKE [29]: a CF model that leverages heterogeneous
information in a knowledge base for recommendation.</p>
        <p>KGAT [37]: Knowledge Graph Attention Network
(KGAT) which explicitly models high-order KG
connectivities in KG. The models’ user/item embeddings were
initialized from the pre-trained BPRMF weights.</p>
        <p>KGIN [41]: a state-of-the-art KG-based CF model that
models users’ latent intents (preferences) as a
combination of KG relations.</p>
      </sec>
      <sec id="sec-3-4">
        <title>4.4. Performance on Amazon Dataset</title>
        <p>Comparison with baselines. Performance of models is
presented in Table 1. Our proposed KG-based CF model,
KMPN, achieved a substantial improvement on all
metrics over the performance of the existing state-of-the-art
model KGIN; for example, Recall@20 was improved from
0.1654 to 0.1719, Recall@100 from 0.3298 to 0.3405, and
ndcg@100 from 0.1267 to 0.1315. All relative
improvements mentioned in our discussions are statistically
significant (  &lt; 0.05 ).</p>
        <p>NRMS-BERT models user-item preferences using only
item summary texts, without external information from a
knowledge base. It still achieves 0.1142 in Recall@20 and
0.4273 Hit Ratio@100, not far from the KGIN baseline at
0.5040 Hit Ratio@100.</p>
        <p>CKMPN further improves all @60/@100 metrics while
keeping the model’s performance of @20. For
example, with similar Recall@20, CKMPN (0.3461 Recall@100)
outperforms KMPN (0.3405 Recall@100) by 1.6% with
statistical significance  &lt; 0.05 . This demonstrates that
even though KMPN achieves higher performance
relative to NRMS-BERT, gathering item and user embeddings
from one system (KMPN) with those of the other system
(NRMS-BERT), through proxies (Cross-System CL), can
still encourage KMPN to learn and fuse content-aware
information from the learned representations of a CBF
model and presents more relevant items in the top-100
list.
1They are also baseline systems being compared in a recent paper
[41] (WWW’21).
2As a result, the results reported here may difer from those of the
original papers.</p>
        <p>
          Comparison with hybrid methods: Conventional
feature fusion methods are popular and convenient options
for combining one system into the training of another
(as surveyed in Sec. 2). In fusing a pre-trained
NRMSBERT with KMPN, we demonstrate the efectiveness of
our proposed fusion framework CKMPN by comparing
it with these conventional approaches.
• Early Fusion: CBF features are concatenated to the
trainable user/item embeddings of KMPN before the
graph convolution layers.
• Late Fusion: CBF features are fused to the output
user/item embeddings of KMPN after the graph
convolution layers. Many feature aggregation methods
were experimented and the best of them are reported
in Table 2: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) concat+linear: CF features are
concatenated with CBF features, and they pass through 3 MLP
layers into embeddings of size R2×ℎ. (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) MultiHeadAtt:
CF and CBF features passed through 3 Multi-head
Self
        </p>
        <p>Attention blocks into embeddings of size R2×ℎ.
• Cos-Sim: An auxiliary loss grounded on
cosinesimilarity is incorporated in training to encourage the
user/item embeddings of KMPN to approach those of</p>
        <p>NRMS-BERT.
• Mixture of Expert (MoE): a hybrid system where
the output scores of two systems, KMPN and
NRMSBERT, pass through 3 layers of a Multi-Layer
Perception (MLP) to obtain final item ratings.</p>
        <p>
          It can be concluded that these feature aggregation
approaches do not perform well in fusing pre-trained CBF
features into KG-based CF training. (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) The performance
of Late Fusion shows that when the already-learned
NRMS-BERT item/user embeddings pass through new
layers, these layers undo the learned representations
from NRMS-BERT and lead to only degraded
performance. (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Cos-Sim shows that the auxiliary loss based
on cosine-similarity places a reliance on NRMS-BERT’s
features, which damages the KMPN training by limiting
the expressiveness of KMPN to that of NRMS-BERT. As a
result, the performance is decreased from 0.2793 (KMPN)
to 0.2436 (Cos-Sim) recall@60.
        </p>
        <p>Though NRMS-BERT alone achieves much lower
metrics than KMPN (0.1142 vs 0.1719 Recall@20), MoE,
where scores of two systems are merged by MLP
layers, achieves 0.1723 Recall@20, showing that the
scoring of two systems is complementary. However, MoE’s
performance deteriorates on @60/100. A case study is
presented later in Sec. 4.6 to show that the scoring of one
system can possibly be extreme to overwhelm the final
rating under the MoE setting. In contrast, our CKMPN
steadily achieves better performance in @60/100 results
relative to KMPN, showing that our method is an
indepth collaboration of two systems instead of a simple
aggregation of system outputs as in MoE.
0.1148
0.1164
0.1175
0.1026
0.1161
0.1197
0.172
0.170
llca0.168
e
R
0.166</p>
        <p>In conclusion, Cross-System CL significantly enhances recover in dimensions from the standard Distance
CorKMPN’s ability to present more relevant items in the relation (DCorr) constraint. As shown in Fig. 2b,  = 0
top-100 list through the fusion of unstructured content- (left) removes the DCorr constraint completely, while
based features. It complements the aforementioned short-  = 1 (right) reduces to a standard DCorr Loss. As 
ages of conventional fusing methods by merging features approaches 0, the DCorr constraint becomes too loose
without corrupting the already-learned representation to encourage the diversity of preferences, leading to a
and without directly approaching two systems’ outputs. dramatically decreased performance. The performance
peaks at  = 0.5 , where half of ℎ dimensions are relaxed
4.5. Contributions of Components from the standard Dcorr constraint, and preference
embeddings are still able to grow diversely in the remaining
To support the rationale of our designs, Ablation studies half dimensions. This suggests that our softer version of
and hyperparameter evaluation are presented to explore DCorr constraint is beneficial to user modeling.
the efects of each proposed component. Efects of RRNS. As shown in Table 1, without
RecipEfects of Meta Preferences. An important research rocal Ratio Negative Sampling, Recall@20 of KMPN (w/o
question is how the design of modeling users through SoftDcorr) is decreased from 0.1704 to 0.1690. In line
meta-preferences improves the model performance. As with our intuition, reducing the probability of sampling
shown in Fig. 2a, removing meta-preference modeling of popular items as negative samples for training can yield
users from KMPN (  = 0) dramatically decreases the benefits in model learning. This demonstrates that while
performance, showing that modeling users’ preferences viewed-but-not-clicked (hard negative) samples are not
is necessary.   = 16 achieves worse performance than available to the model, our proposed sampling strategy
  ≥ 32 since a small number of meta preferences limits enhances the quality of negative samples.
the model’s capacity of modeling users. The performance Efects of Cross-System Contrastive Learning. The
on all metrics increases until it peaks at   = 64, and system performance of top-20 does not drop much for
then it starts to decrease at   ≥ 128. This suggests that   ≤ 0.2 (Fig. 2c) whereas the performance at top-100
including too many meta preferences induces overfitting increases dramatically for   ≤ 0.2 relative to a
sysand does not further improve the system. It is a good tem without Cross-System CL (  = 0) (Fig. 2d). This
model property in practice since a moderate   = 64 is suggests that by incorporating Cross-System CL in our
suficient for achieving the best performance. training with a reasonable   , CKMPN is more capable
Efects of Soft Distance Correlation Loss. The hyper- of finding relevant items for users.
parameter  controls the number of principal components
to keep after PCA dimension reduction. The lower the
ratio, the more flexibility the preference embeddings will
from the standard test set, showing that our model still
functions in the cold-start setting.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>As shown in Table 1(bottom), the same performance
boost is observed in KMPN relative to baselines. For
example, KMPN achieves 0.1434 Recall@20 and 0.1073
ndcg@20, which is higher than 0.1403 Recall@20 and
0.1006 ndcg@20 of the baselines. CKMPN also achieves
the best performance by incorporating content-based
features from NRMS-BERT. It outperforms KMPN in all
metrics, showing a significant improvement in ndcg@100
(from 0.1367 to 0.1482) and Hit Ratio@100 (from 0.3602
to 0.3668) in particular. Therefore, we can conclude that
our method is applicable in multiple diferent datasets.</p>
      <p>We present KMPN, a powerful KG-based CF model that
outperforms strong baseline models. To investigate the
complementary power of unstructured content-based
information, we further propose a novel approach
CrossSystem Contrastive Learning, which combines CF and
CBF, two distinct paradigms to achieve a substantial
improvement relative to models in literature. This suggests
that KG-based CF models can benefit from the
incorpoTable 3 ration of unstructured content information derived from
Case study for a user who have browsed the movie Tenet Transformers.
(2020). Source Code (2011) has a similar genre, while Dunkirk Our proposed CKMPN has thus far achieved
sub(2017) has the same director. Y/N: whether or not the movie stantial improvements on both datasets, especially on
appears in the top-100 recommendation list of the models. top-60/100 metrics. Industrial recommender systems
NRMS: NRMS-BERT; MoE: Mixture of Expert. usually follow a 2-step pipeline where a relatively large
Item KMPN NRMS MoE CKMPN amount of items  = 60, 100 is firstly recalled by a Recall</p>
      <p>Model and then a Ranking Model is adopted to refine the
SoDurucnekCirokd(e20(21071)1) NY NY NN YY list ranking. This improvement presents more relevant
items in the relatively coarser Recall output, which is
appealing to industrial applications. Also, CKMPN is much
more preferred than the Mixture of Expert model in
industrial applications, since it still produces independent
user/item representations. This feature enables the fast
and eficient match of users and items in hidden space
with (()) query time complexity [66].</p>
      <p>An example output of systems is presented in
Table 3. Y/N indicates whether or not the movie appears
in the top-100 recommendation list of the four models
(KMPN/NRMS-BERT/Mixture of Expert (MoE)/CKMPN).</p>
      <p>This user has browsed Tenet (2020) directed by
Christopher Nolan. The movie Source Code (2011) and Tenet
are both about time travel, but they have quite
diferent film crews. As a result, Source Code was considered
positive by NRMS-BERT which evaluates on the movie
description, but was considered negative by KG-based
KMPN. Combining the scores of both systems, MoE did
not recommend the movie. However, CKMPN
complemented the failure of KMPN and gave a high score for
this movie, by learning a content-aware item
representation based on the representation of NRMS-BERT through
Cross-System CL. In contrast, Dunkirk (2017) is about
war and history which is not in the same topic as Tenet.</p>
      <p>However, since they were directed by the same
director, KMPN and CKMPN both recommended this movie,
while MoE’s prediction was negatively afected by
NRMSBERT. This case study suggests that our Cross-System
CL approach is an efective in-depth collaboration of two
systems, outperforming the direct mixture of KMPN and
NRMS-BERT.</p>
      <p>We also present the model performance on the
coldstart test set of the Movie-KG-dataset, where users are
completely unseen in the training. As shown in the
last section of Table 1 (bottom), our best model CKMPN
still achieved good performance for unseen users on all
metrics, e.g., 0.1024 on Recall@20 and 0.3380 on Hit
Ratio@100. The performance did not deteriorate much
resentation learning for knowledge graph, in: Pro- dation, in: Proceedings of the 25th ACM SIGKDD
ceedings of International Joint Conference on Arti- International Conference on Knowledge Discovery
ifcial Intelligent (IJCAI), 2016, pp. 4–17. &amp; Data Mining, 2019, pp. 950–958.
[28] H. Wang, F. Zhang, M. Zhao, W. Li, X. Xie, M. Guo, [38] Z. Wang, G. Lin, H. Tan, Q. Chen, X. Liu, Ckan:
Multi-task feature learning for knowledge graph Collaborative knowledge-aware attentive network
enhanced recommendation, in: The World Wide for recommender systems, in: Proceedings of the
Web Conference, 2019, pp. 2000–2010. 43rd International ACM SIGIR Conference on
Re[29] F. Zhang, N. J. Yuan, D. Lian, X. Xie, W.-Y. Ma, Col- search and Development in Information Retrieval,
laborative knowledge base embedding for recom- 2020, pp. 219–228.
mender systems, in: Proceedings of the 22nd ACM [39] W. L. Hamilton, R. Ying, J. Leskovec, Inductive
SIGKDD international conference on knowledge representation learning on large graphs, in:
Prodiscovery and data mining, 2016, pp. 353–362. ceedings of the 31st International Conference on
[30] Y. Cao, X. Wang, X. He, Z. Hu, T.-S. Chua, Uni- Neural Information Processing Systems, 2017, pp.
fying knowledge graph learning and recommen- 1025–1035.
dation: Towards a better understanding of user [40] P. Veličković, G. Cucurull, A. Casanova, A. Romero,
preferences, in: The world wide web conference, P. Liò, Y. Bengio, Graph Attention Networks,
Inter2019, pp. 151–161. national Conference on Learning Representations
[31] B. Hu, C. Shi, W. X. Zhao, P. S. Yu, Leveraging meta- (2018).</p>
      <p>path based context for top-n recommendation with [41] X. Wang, T. Huang, D. Wang, Y. Yuan, Z. Liu, X. He,
a neural co-attention model, in: Proceedings of T.-S. Chua, Learning intents behind interactions
the 24th ACM SIGKDD International Conference with knowledge graph for recommendation, in:
on Knowledge Discovery &amp; Data Mining, 2018, pp. Proceedings of the Web Conference 2021, 2021, pp.
1531–1540. 878–887.
[32] J. Jin, J. Qin, Y. Fang, K. Du, W. Zhang, Y. Yu, [42] J. Liu, P. Dolan, E. R. Pedersen, Personalized news
Z. Zhang, A. J. Smola, An eficient neighborhood- recommendation based on click behavior, in:
Probased interaction model for recommendation on ceedings of the 15th international conference on
heterogeneous graph, in: Proceedings of the 26th Intelligent user interfaces, 2010, pp. 31–40.
ACM SIGKDD International Conference on Knowl- [43] C. Wu, F. Wu, S. Ge, T. Qi, Y. Huang, X. Xie,
Neuedge Discovery &amp; Data Mining, 2020, pp. 75–84. ral news recommendation with multi-head
self[33] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandel- attention, in: Proceedings of the 2019
Conferwal, B. Norick, J. Han, Personalized entity recom- ence on Empirical Methods in Natural Language
mendation: A heterogeneous information network Processing and the 9th International Joint
Conapproach, in: Proceedings of the 7th ACM interna- ference on Natural Language Processing
(EMNLPtional conference on Web search and data mining, IJCNLP), Association for Computational
Linguis2014, pp. 283–292. tics, Hong Kong, China, 2019, pp. 6389–6394. URL:
[34] H. Zhao, Q. Yao, J. Li, Y. Song, D. L. Lee, Meta-graph https://aclanthology.org/D19-1671. doi:10.18653/
based recommendation fusion over heterogeneous v1/D19-1671.
information networks, in: Proceedings of the 23rd [44] S. Okura, Y. Tagami, S. Ono, A. Tajima,
EmbeddingACM SIGKDD international conference on knowl- based news recommendation for millions of users,
edge discovery and data mining, 2017, pp. 635–644. in: Proceedings of the 23rd ACM SIGKDD
interna[35] H. Wang, F. Zhang, M. Zhang, J. Leskovec, M. Zhao, tional conference on knowledge discovery and data
W. Li, Z. Wang, Knowledge-aware graph neural mining, 2017, pp. 1933–1942.
networks with label smoothness regularization for [45] J. Lian, F. Zhang, X. Xie, G. Sun, Towards better
reprecommender systems, in: Proceedings of the 25th resentation learning for personalized news
recomACM SIGKDD international conference on knowl- mendation: a multi-channel deep fusion approach.,
edge discovery &amp; data mining, 2019, pp. 968–977. in: IJCAI, 2018, pp. 3805–3811.
[36] H. Wang, M. Zhao, X. Xie, W. Li, M. Guo, Knowl- [46] C. Wu, F. Wu, M. An, J. Huang, Y. Huang, X. Xie,
edge graph convolutional networks for recom- Npa: neural news recommendation with
personalmender systems, in: The World Wide Web ized attention, in: Proceedings of the 25th ACM
Conference, WWW ’19, Association for Com- SIGKDD international conference on knowledge
puting Machinery, New York, NY, USA, 2019, p. discovery &amp; data mining, 2019, pp. 2576–2584.
3307–3313. URL: https://doi.org/10.1145/3308558. [47] D. Liu, J. Lian, S. Wang, Y. Qiao, J.-H. Chen, G. Sun,
3313417. doi:10.1145/3308558.3313417. X. Xie, Kred: Knowledge-aware document
represen[37] X. Wang, X. He, Y. Cao, M. Liu, T.-S. Chua, Kgat: tation for news recommendations, in: Fourteenth
Knowledge graph attention network for recommen- ACM Conference on Recommender Systems, 2020,
pp. 200–209. and testing dependence by correlation of distances,
[48] H. Wang, F. Zhang, X. Xie, M. Guo, Dkn: Deep The annals of statistics 35 (2007) 2769–2794.
knowledge-aware network for news recommenda- [61] H. Hotelling, Analysis of a complex of statistical
tion, in: Proceedings of the 2018 world wide web variables into principal components., Journal of
conference, 2018, pp. 1835–1844. educational psychology 24 (1933) 417.
[49] S. H. Choi, Y.-S. Jeong, M. K. Jeong, A hybrid rec- [62] W. Lin, L. Shou, M. Gong, P. Jian, Z. Wang, B. Byrne,
ommendation method with reduced data for large- D. Jiang, Combining unstructured content and
scale application, IEEE Transactions on Systems, knowledge graphs into recommendation datasets,
Man, and Cybernetics, Part C (Applications and in: 4th Edition of Knowledge-aware and
ConversaReviews) 40 (2010) 557–566. tional Recommender Systems (KaRS) Workshop @
[50] L. M. De Campos, J. M. Fernández-Luna, J. F. Huete, RecSys 2022, 2022.</p>
      <p>M. A. Rueda-Morales, Combining content-based [63] D. P. Kingma, J. Ba, Adam: A method for stochastic
and collaborative recommendations: A hybrid ap- optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd
proach based on bayesian networks, Interna- International Conference on Learning
Representational journal of approximate reasoning 51 (2010) tions, ICLR 2015, San Diego, CA, USA, May 7-9,
785–799. 2015, Conference Track Proceedings, 2015. URL:
[51] D. Billsus, M. J. Pazzani, J. Chen, A learning agent http://arxiv.org/abs/1412.6980.
for wireless news access, in: Proceedings of the 5th [64] W. Krichene, S. Rendle, On sampled metrics for item
international conference on Intelligent user inter- recommendation, in: Proceedings of the 26th ACM
faces, 2000, pp. 33–36. SIGKDD International Conference on Knowledge
[52] M. Ghazanfar, A. Prugel-Bennett, Building Discovery &amp; Data Mining, 2020, pp. 1748–1757.
switching hybrid recommender system using ma- [65] S. Rendle, Z. Gantner, C. Freudenthaler, L.
Schmidtchine learning classifiers and collaborative filtering, Thieme, Fast context-aware recommendations
IAENG International Journal of Computer Science with factorization machines, in: Proceedings
37 (2010). of the 34th International ACM SIGIR
Confer[53] J. M. Noguera, M. J. Barranco, R. J. Segura, ence on Research and Development in
InformaL. Martínez, A mobile 3d-gis hybrid recommender tion Retrieval, SIGIR ’11, Association for
Comsystem for tourism, Information Sciences 215 (2012) puting Machinery, New York, NY, USA, 2011,
37–52. p. 635–644. URL: https://doi.org/10.1145/2009916.
[54] A. S. Lampropoulos, P. S. Lampropoulou, G. A. 2010002. doi:10.1145/2009916.2010002.</p>
      <p>Tsihrintzis, A cascade-hybrid music recommender [66] Y. A. Malkov, D. A. Yashunin, Eficient and robust
system for mobile services based on musical genre approximate nearest neighbor search using
hierarclassification and personality diagnosis, Multime- chical navigable small world graphs, IEEE
transacdia Tools and Applications 59 (2012) 241–258. tions on pattern analysis and machine intelligence
[55] I. A. Christensen, S. N. Schiafino, A hybrid ap- 42 (2018) 824–836.</p>
      <p>proach for group profiling in recommender systems
(2014).
[56] P. Bedi, P. Vashisth, P. Khurana, et al., Modeling
user preferences in a hybrid recommender system
using type-2 fuzzy sets, in: 2013 IEEE International
Conference on Fuzzy Systems (FUZZ-IEEE), IEEE,
2013, pp. 1–8.
[57] R. J. Mooney, L. Roy, Content-based book
recommending using learning for text categorization, in:
Proceedings of the fith ACM conference on Digital
libraries, 2000, pp. 195–204.
[58] X. Li, T. Murata, Multidimensional clustering based
collaborative filtering approach for diversified
recommendation, in: 2012 7th International
Conference on Computer Science &amp; Education (ICCSE),</p>
      <p>IEEE, 2012, pp. 905–910.
[59] G. J. Székely, M. L. Rizzo, Brownian distance
covariance, The annals of applied statistics 3 (2009)
1236–1265.
[60] G. J. Székely, M. L. Rizzo, N. K. Bakirov, Measuring</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Takács</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Pilászy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Németh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tikk</surname>
          </string-name>
          ,
          <article-title>Scalable collaborative filtering approaches for large recommender systems</article-title>
          ,
          <source>The Journal of Machine Learning Research</source>
          <volume>10</volume>
          (
          <year>2009</year>
          )
          <fpage>623</fpage>
          -
          <lpage>656</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P. B.</given-names>
            <surname>Thorat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Goudar</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Barve,</surname>
          </string-name>
          <article-title>Survey on collaborative filtering, content-based filtering and hybrid recommendation system</article-title>
          ,
          <source>International Journal of Computer Applications</source>
          <volume>110</volume>
          (
          <year>2015</year>
          )
          <fpage>31</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lanning</surname>
          </string-name>
          , et al.,
          <article-title>The netflix prize</article-title>
          ,
          <source>in: Proceedings of KDD cup and workshop</source>
          , volume
          <volume>2007</volume>
          , Citeseer,
          <year>2007</year>
          , p.
          <fpage>35</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I.</given-names>
            <surname>Pilászy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tikk</surname>
          </string-name>
          ,
          <article-title>Recommending new movies: even a few ratings are more valuable than metadata</article-title>
          ,
          <source>in: Proceedings of the third ACM conference on Recommender systems</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>93</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>A survey on knowledge graph-based recommender systems</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North American</source>
          Chap- Electronic
          <string-name>
            <surname>Commerce</surname>
          </string-name>
          ,
          <year>2000</year>
          , pp.
          <fpage>158</fpage>
          -
          <lpage>167</lpage>
          .
          <article-title>ter of the Association for Computational Linguis-</article-title>
          [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>TANG</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Deepfm: tics: Human Language Technologies, Volume 1 A factorization-machine based neural network for (Long and Short Papers), Association for Com- ctr prediction</article-title>
          ,
          <source>in: Proceedings of the Twentyputational Linguistics</source>
          , Minneapolis, Minnesota, Sixth International Joint Conference on Artifi2019, pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://aclanthology.org/ cial Intelligence,
          <source>IJCAI-17</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1725</fpage>
          -
          <lpage>1731</lpage>
          .
          <fpage>N19</fpage>
          -
          <lpage>1423</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1423. URL: https://doi.org/10.24963/ijcai.
          <year>2017</year>
          /239. doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <volume>24963</volume>
          /ijcai.
          <year>2017</year>
          /239. I.
          <string-name>
            <surname>Sutskever</surname>
          </string-name>
          , et al.,
          <source>Language models are unsuper-</source>
          [18]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Du,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Deep learning over multivised multitask learners</article-title>
          ,
          <source>OpenAI blog 1</source>
          (
          <year>2019</year>
          )
          <article-title>9. ifeld categorical data</article-title>
          , in: European conference on
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Volinsky</surname>
          </string-name>
          , Matrix factorization information retrieval, Springer,
          <year>2016</year>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>57</lpage>
          .
          <article-title>techniques for recommender systems</article-title>
          , Computer [19]
          <string-name>
            <surname>H</surname>
            .-T. Cheng, L. Koc,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Harmsen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Shaked</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          Chan42 (
          <year>2009</year>
          )
          <fpage>30</fpage>
          -
          <lpage>37</lpage>
          . dra, H. Aradhye, G. Anderson,
          <string-name>
            <given-names>G.</given-names>
            <surname>Corrado</surname>
          </string-name>
          , W. Chai,
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Freudenthaler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gantner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmidt- M. Ispir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Anil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Haque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jain</surname>
          </string-name>
          , Thieme, Bpr:
          <article-title>Bayesian personalized ranking from X. Liu</article-title>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>Wide deep learning for recimplicit feedback</article-title>
          ,
          <source>in: Proceedings of the Twenty- ommender systems, in: Proceedings of the 1st Fifth Conference on Uncertainty in Artificial Intel- Workshop on Deep Learning for Recommender ligence, UAI '09</source>
          , AUAI Press, Arlington, Virginia, Systems, DLRS 2016, Association for ComputUSA,
          <year>2009</year>
          , p.
          <fpage>452</fpage>
          -
          <lpage>461</lpage>
          . ing Machinery, New York, NY, USA,
          <year>2016</year>
          , p.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koren</surname>
          </string-name>
          , Factorization meets the neighborhood:
          <fpage>7</fpage>
          -
          <lpage>10</lpage>
          . URL: https://doi.org/10.1145/2988450.2988454.
          <article-title>A multifaceted collaborative filtering model</article-title>
          ,
          <source>in: doi:10.1145/2988450.2988454. Proceedings of the 14th ACM SIGKDD</source>
          Interna- [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wen</surname>
          </string-name>
          , tional Conference on Knowledge Discovery and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Product-based neural networks for user Data Mining, KDD '08, Association for Com- response prediction</article-title>
          ,
          <source>in: 2016 IEEE 16th Internaputing Machinery</source>
          , New York, NY, USA,
          <year>2008</year>
          , tional
          <article-title>Conference on Data Mining (ICDM), IEEE</article-title>
          , p.
          <fpage>426</fpage>
          -
          <lpage>434</lpage>
          . URL: https://doi.org/10.1145/1401890.
          <year>2016</year>
          , pp.
          <fpage>1149</fpage>
          -
          <lpage>1154</lpage>
          . 1401944. doi:
          <volume>10</volume>
          .1145/1401890.1401944. [21]
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          , T.-S. Chua,
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          ,
          <article-title>Factorization machines with libfm, ACM Neural collaborative filtering</article-title>
          ,
          <source>in: Proceedings of Transactions on Intelligent Systems and Technol- the 26th international conference on world wide ogy (TIST) 3</source>
          (
          <issue>2012</issue>
          )
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          . web,
          <year>2017</year>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          , T.-S. Chua, Neural factorization ma- [22]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ying</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Eksombatchai</surname>
          </string-name>
          , W. L.
          <article-title>chines for sparse predictive analytics</article-title>
          , in: Pro- Hamilton, J. Leskovec,
          <article-title>Graph convolutional neuceedings of the 40th International ACM SIGIR ral networks for web-scale recommender systems</article-title>
          , Conference on Research and Development in In- in
          <source>: Proceedings of the 24th ACM SIGKDD Internaformation Retrieval</source>
          , SIGIR '17,
          <string-name>
            <surname>Association</surname>
            <given-names>for</given-names>
          </string-name>
          <source>tional Conference on Knowledge Discovery &amp; Data Computing Machinery</source>
          , New York, NY, USA,
          <year>2017</year>
          , Mining,
          <year>2018</year>
          , pp.
          <fpage>974</fpage>
          -
          <lpage>983</lpage>
          . p.
          <fpage>355</fpage>
          -
          <lpage>364</lpage>
          . URL: https://doi.org/10.1145/3077136. [23]
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Wang,
          <volume>3080777</volume>
          . doi:
          <volume>10</volume>
          .1145/3077136.3080777.
          <string-name>
            <surname>Lightgcn</surname>
          </string-name>
          <article-title>: Simplifying and powering graph convolu-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Oentaryo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.-P.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-W.</given-names>
            <surname>Low</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Fine- tion network for recommendation</article-title>
          , in: Proceedings gold,
          <article-title>Predicting response in mobile advertising of the 43rd International ACM SIGIR conference on with hierarchical importance-aware factorization research and development in Information Retrieval, machine</article-title>
          ,
          <source>in: Proceedings of the 7th ACM interna- 2020</source>
          , pp.
          <fpage>639</fpage>
          -
          <lpage>648</lpage>
          .
          <article-title>tional conference on Web search and data mining</article-title>
          , [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chicaiza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Valdiviezo-Diaz</surname>
          </string-name>
          ,
          <source>A comprehensive</source>
          <year>2014</year>
          , pp.
          <fpage>123</fpage>
          -
          <lpage>132</lpage>
          .
          <article-title>survey of knowledge graph-based recommender</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Verstrepen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Goethals</surname>
          </string-name>
          ,
          <article-title>Unifying nearest neigh- systems: Technologies, development, and contribubors collaborative filtering</article-title>
          ,
          <source>in: Proceedings of the tions, Information</source>
          <volume>12</volume>
          (
          <year>2021</year>
          )
          <fpage>232</fpage>
          . 8th ACM Conference on Recommender systems, [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usunier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcia-Duran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <year>2014</year>
          , pp.
          <fpage>177</fpage>
          -
          <lpage>184</lpage>
          . O.
          <string-name>
            <surname>Yakhnenko</surname>
          </string-name>
          ,
          <article-title>Translating embeddings for mod-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          , G. Karypis,
          <article-title>Item-based top-n rec- eling multi-relational data, Advances in neural ommendation algorithms</article-title>
          ,
          <source>ACM Transactions on information processing systems</source>
          <volume>26</volume>
          (
          <year>2013</year>
          ).
          <source>Information Systems - TOIS 22</source>
          (
          <year>2004</year>
          )
          <fpage>143</fpage>
          -
          <lpage>177</lpage>
          . [26]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Knowledge doi:
          <volume>10</volume>
          .1145/963770.963776.
          <article-title>graph embedding by translating on hyperplanes</article-title>
          , in:
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sarwar</surname>
          </string-name>
          , G. Karypis,
          <string-name>
            <given-names>J.</given-names>
            <surname>Konstan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Riedl</surname>
          </string-name>
          , Analy-
          <source>Proceedings of the AAAI Conference on Artificial sis of recommendation algorithms for e-commerce, Intelligence</source>
          , volume
          <volume>28</volume>
          ,
          <year>2014</year>
          .
          <source>in: Proceedings of the 2nd ACM Conference on</source>
          [27]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , Text-enhanced rep-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>