=Paper=
{{Paper
|id=Vol-2960/paper16
|storemode=property
|title=Improving Media Content Recommendation with Automatic Annotations (Long paper)
|pdfUrl=https://ceur-ws.org/Vol-2960/paper16.pdf
|volume=Vol-2960
|authors=Ismail Harrando,Raphael Troncy
|dblpUrl=https://dblp.org/rec/conf/recsys/HarrandoT21
}}
==Improving Media Content Recommendation with Automatic Annotations (Long paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-2960/paper16.pdf</pdf>
<pre>
Improving Media Content Recommendation with
Automatic Annotations
Ismail Harrando1 , Raphaël Troncy1
1
    EURECOM, France


                                             Abstract
                                             With the immense growth of media content production on the internet and increasing wariness about privacy, content-based
                                             recommendation systems offer the possibility of promoting media to users (e.g. posts, videos, podcasts) based solely on
                                             a representation of the content, i.e. without using any user-related data such as views and more generally interactions
                                             between users and items. In this work, we study the potential of using off-the-shelf automatic annotation tools from the
                                             Information Extraction literature to improve recommendation performance without any extra cost of training, data collection
                                             or annotation. We experiment with how these annotations can improve recommendations on two tasks: the traditional user
                                             history-based recommendation, as well as a purely content-based recommendation evaluation. We pair these automatic
                                             annotations with the manually created metadata and we show that Knowledge Graphs through their embeddings constitute a
                                             great modality to seamlessly integrate this extracted knowledge and provide better recommendations. The evaluation code,
                                             as well as the enrichment generation, is available at https://github.com/D2KLab/ka-recsys.

                                             Keywords
                                             Recommender Systems, Content-based Recommendation, Knowledge Graph, Automatic Annotation


1. Introduction                                                                                                       recommendations out of), and in cases where it is hard
                                                                                                                      to collect such feedback (anonymity, privacy).
As user engagement with content online has become                                                                        In this paper, we are interested in the second kind of
a crucial element in most if not all content-providing                                                                recommendations which are based solely on the content
multimedia platforms – i.e. retaining a user’s interest in                                                            of the media to recommend. The “content” in content-
the provided content and maximizing their time watch-                                                                 based can refer to a variety of potential formats: text,
ing/reading/listening to the content, the role of recom-                                                              image, video, metadata (e.g. tags and keywords) and
mender systems cannot be overstated in shaping and im-                                                                so on. Typically, a representation of such content is
proving the user experience when it comes to consuming                                                                extracted or learned, and the task of recommendation
and interacting with said content, as it helps funneling                                                              is then cast as a content similarity/retrieval task: given
the usually overwhelming amount of data into a con-                                                                   the representation of an item of interest (e.g. the video
densed, targeted and interesting selection of items that                                                              the user is currently watching), and the representation of
the user is most likely to find enjoyable and interesting.                                                            all items already existing in the catalog, we want to find
Traditionally, recommendation systems either use col-                                                                 the items which have the highest similarity to the item
laborative filtering, i.e. leveraging user statistics and                                                             of interest. While many varieties of this approach exist
their implicit/explicit feedback (views, likes, watch time)                                                           (ones that target other metrics such as serendipity [1],
to find items to recommend (the underlying assumption                                                                 diversity [2] and explainability [3]) which may formulate
is that people who have similar interests interact with                                                               the problem differently, but at its core, the task can be
the same items), or provide content-based recommen-                                                                   framed as finding the best content representation that
dations, which rely on the content of the item itself to                                                              allows uncovering a meaningful measure of similarity.
find similar content without any input from the user.                                                                    We posit in this paper that the use of Knowledge
Content-based recommendations are particularly inter-                                                                 Graphs (KGs), both created using item metadata and auto-
esting in the case of the cold start problem where there                                                              matically generated from the given content, can improve
is no feedback from users (no interactions to based the                                                               the task of media recommendation. Instead of relying
                                                                                                                      only on the content, we leverage several Information
3rd Edition of Knowledge-aware and Conversational Recommender                                                         Extraction techniques to extract high level descriptors
Systems (KaRS) & 5th Edition of Recommendation in Complex                                                             that allow the automatic creation of metadata, which can
Environments (ComplexRec) Joint Workshop @ RecSys 2021,
September 27–1 October 2021, Amsterdam, Netherlands
                                                                                                                      be then used to generate a KG connecting all content in
Envelope-Open ismail.harrando@eurecom.fr (I. Harrando);                                                               the media catalog. Given the versatility of Knowledge
raphael.troncy@eurecom.fr (R. Troncy)                                                                                 Graphs, they allow us to combine these automatic an-
Orcid 0000-0002-3593-4490 (I. Harrando); 0000-0003-0457-1436                                                          notations with already existing metadata seamlessly. To
(R. Troncy)                                                                                                           validate this approach, we focus on studying the TED
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     dataset [4], an open-sourced multimedia dataset that of-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
fers the unique possibility of evaluating recommenda-                        supposed to reflect subjective topical relatedness
tions based on both the content only (“related videos”, as                   between talks in the corpus. Performance on this
curated by human editors) and the user preferences based                     task reflects the model’s ability to recommend
on their interactions history. We demonstrate that our                       content to either users without an interactions
approach improves the recommendation performance                             history (new users, visitors without accounts) or
on both tasks, and that KGs are a reliable framework to                      new videos (that have not yet received any inter-
integrate external knowledge into the task of recommen-                      actions). We note that in the ground truth, some
dation.                                                                      talks are associated with three related talks, some
                                                                             with two, and some with only one. We account
                                                                             for this in the evaluation metrics.
2. Related Work
                                                                     Previous works have studied specific aspects of this
The TED Dataset The TED dataset [4] is a multimodal               dataset such as sentiment analysis [6], estimating trust
dataset which contains the audiovisual recordings of the          from comments polarity and ratings to improve recom-
TED talks downloaded from the official website 1 , which          mendation [7], or studying hybrid recommender systems
sums up to 1149 talks, alongside metadata fields and user         [8]. In this work, we focus our interest on this dataset
profiles with rating and commenting interactions. The             as it offers a unique possibility of evaluating content-
metadata fields are as follows: identifier, title, descrip-       based recommendation using both real user feedback
tion, speaker name, TED event at which the talk is given,         and hand-picked recommendations, as the later has not
transcript, publication date, filming date, and number of         been considered in any of the published works on this
views. For nearly every video, the dataset contains a list        dataset to the best of our knowledge.
of user interactions (marked by the action of “Adding to             We also note that, while the dataset is multimodal (TED
favorites”), as well as up to three “related videos”, which       Talks Videos are also available), our work does not tackle
are picked by the editorial staff to be recommended to            visual information extraction, mainly because TED Talks
the user to watch next. What is unique for this dataset is        are not visually diverse (mostly speakers and audience
that it provides two sorts of ground truths for the recom-        wide shots). This is however a promising direction of
mender system use-case, that we can formulate in these            work that has been tackled in previous works [9].
two tasks:
       • Task 1 - Personalized (user-specific) recom-             Graph-based Recommender Systems Given the re-
         mendations: based on a user’s list of favorite           cent growing interest in Knowledge Graphs and their
         talks, the task is to predict what they would watch      applications, there is a growing literature on the tech-
         next. A evaluation dataset can thus be created           niques and models that can be leveraged to build
         using a “leave one out” protocol, i.e. removing          “knowledge-aware” recommender systems. [10] present
         one interaction from the user list of favorites, and     such an approach to bring external knowledge to the
         measuring how successful a method is in predict-         task of content-based Knowledge Graphs, identifying
         ing the omitted item. Most recommender system-           two main approaches to what they called “Semantics-
         type datasets contain a similar information, i.e.        aware Recommender Systems” to tackle traditional prob-
         what items a user has actually interacted with in        lems of content-based recommender systems, Top-down
         reality, based on their viewing/interaction history.     Approaches which incorporate knowledge from ontolog-
         This task is usually handled with collaborative          ical resources such as WordNet [11], and encyclopedic
         filtering methods (e.g. [5]), but is still interesting   knowledge sources such as Wikipedia2 , to enrich the
         for content-based recommendation in the case of          item representations with external world and linguistic
         the cold start problem: when a new talk is added to      knowledge, and Bottom-up Approaches which uses lin-
         the platform, how can we recommend it to other           guistic resources such as what we commonly refer to as
         users? The most common approach is to use its            distributional word representations, e.g. using pretrained
         content to recommend it to users who previously          word embeddings to avoid the issue of exact matching
         liked a similar content.                                 in traditional content-based systems. They also raise
                                                                  the problem of the potential use of a graph structure
       • Task 2 - General (content-based) recommen-               to discover latent connections among items, which we
         dations: to the best of our knowledge, this is the       study in our experiments. [12] offers an extensive sur-
         only dataset which offers ground truth for multi-        vey of Knowledge Graph-based Recommender System
         media recommendations based on content only,             approaches, proposing a high-level taxonomy of methods
         which are referred to as “related videos”, manu-         that either use graph embeddings, connectivity patterns
         ally annotated by TED editorial staff. These are
   1                                                                  2
       https://www.ted.com                                                https://en.wikipedia.org/wiki/Main_Page
Figure 1: High level illustration of the approach: we start by extracting annotations from the video transcript using off-the-
shelf Information Extraction tools, which we combine with manual annotations to create a Knowledge Graph, where the
talks and the annotations are nodes, connected with the corresponding semantic relation. Using this graph structure, we
can generate continuous fixed-dimensional representations using a Graph Embedding technique, which we can later use to
measure content similarity for recommendation.


(common paths mining), or combining the two. In this pa-        3.1. Topic Modeling
per, we only focus on embedding-based methods to study
                                                          Topic modeling is a ubiquitously used Information Extrac-
the use of automatic annotations on the performance of
                                                          tion technique, which attempts to find the latent topics in
recommender systems. Additionally, unlike some previ-
                                                          a text corpus. A topic can be roughly defined as a coher-
ous works, our work does not tackle the two tasks jointly
                                                          ent set of vocabulary words that tend to co-appear with
as a learning problem[13], but attempts to show how
                                                          high probability in the same documents. When applied
the same approach can at the same time improve the
                                                          on documents of natural language, topic models have the
performance on both.
                                                          ability to find the underlying “themes” in the document
                                                          collection, such as sport, technology, etc.
3. Approach                                                  The literature on topic modeling is rich and diverse,
                                                          with approaches relying solely on word counts such as
The proposed approach builds on using several Informa- the commonly used LDA [15], to using state-of-the-art
tion Extraction techniques such as Topic Modeling (3.1), representations to represent documents in more mean-
Named Entity Recognition (3.2), and Keyword Extraction ingful representational spaces [16, 17]. Topics are usually
(3.3), to generate high level descriptors – annotations – represented with their “top N words” (the 𝑁 words most
of the content of each video in the dataset. Once the likely to appear given a topic). In our dataset, we find
annotations are generated for each video, we use them to topics such as:
build a Knowledge Graph connecting the talks by their
annotations. This approach also allows us to integrate • Technology: network,online,computers,digital,google
external metadata if such metadata is available (for our • Environment: waste,plants,electrical,plastic,battery
dataset, metadata such as “Tags” and “Themes” are avail- • Gaming: games,online,virtual,gamers,penalty
able and will be used). Once the KG is generated, we can • Health: aids,malaria,drugs,mortality,vaccine
use a graph embedding method [14] to generate a fixed- For our experiments, we use LDA as it is still commonly
dimensional embedding for each video in the dataset, used and offers simple yet competitive performance[18].
such that videos having similar annotations would be We test two aspects of topic modeling that can influence
represented in proximity in the embedding space. As a re- the structure of the graph (the number of nodes and
sult, we can measure the (cosine) similarity between any relations added) which are the number of topics (i.e. the
two videos’ embeddings as a proxy to their relatedness. number of topic nodes in the final KG), as well as the
   The approach is illustrated in Figure 1.               cutoff threshold reflecting the topic model’s confidence is
   We present a selection of automatic annotations tech- assigning a given topic to a given talk (which would affect
niques and how they are used in our approach in the the number of relations to topic nodes). We report the
following subsections.                                    results in Section 4. For a better performance of the topic
                                                                modeling task, we preprocess our dataset as follows:

                                                                1. Lowercase all words
2. Remove short words (less than 3 characters)                       4. Experiments and Results
3. Remove punctuation
4. Remove the most frequent words (top 1%)                In this section, we explain the experimental protocol and
                                                          describe the results for the different experiments done
                                                          to study the impact of using automatic annotations on
3.2. Named Entity Recognition                             recommendation performance. We first reintroduce the
Named Entity Recognition is the task of extracting from dataset and how it is going to be used in the rest of this
unstructured text, terms or phrases that refer to named section. Then, we define the metrics we use to measure
entities, i.e. real world objects that have proper names this performance (Hit Rate, Mean Reciprocal Rate and
and can refer to one of several classes: persons, places, Normalized Discounted Cumulative Gain), and the em-
organizations, etc. Once extracted, these Named Entities bedding method to use for the rest of the experiments.
can be used as high level descriptors for a text content. For each automatic annotation considered (i.e. Topics,
For example, if two talks mention “Einstein” and “New- Named Entities and Keywords), we consider several con-
ton”, they may have a similar topic. While this task used figurations, with and without the addition of the original
to rely on grammatical and hand-crafted features to des- metadata from the dataset. Finally, we observe the poten-
ignate what would constitute a Named Entity (e.g. starts tial of combining the resulting automatically generated
with a capital letter), modern systems do without such graph embeddings with the textual embeddings of the
hand crafted features [19, 20], but rely on combining content, and show how the two complement each other
the learning power of neural networks with annotated to push the performance even higher.
corpora of Named Entities.
   In our experiments, we use SpaCy’s [21] NER model 4.1. Dataset
which uses an architecture that combines a word em-
bedding strategy using sub word features, and a deep As mentioned previously, the TED Talks dataset has two
convolution neural network with residual connections, versions of ground truths (or prediction tasks) for recom-
which is “designed to give a good balance of efficiency, mendation, namely:
accuracy and adaptability”3 .
                                                                • User-specific recommendations that are based on
   For our experiments, we keep the Named Entities be-
                                                                   actual users interactions history (henceforth re-
longing to the following classes: ’PERSON’, ’LOC’ (loca-
                                                                   ferred to as T1)
tion), ’ORG’ (organization), ’GPE’ (geopolitical entity),
’FAC’ (faculty), ’PRODUCT’, and ’WORK_OF_ART’. We               • Content-based recommendations, which are
also experiment with the impact of keeping all extracted           hand-picked by editors for each talk (henceforth
Named Entities or filtering some out based on frequency,           referred to as T2)
thus altering the number of added nodes to the graph
and their relations to the existing talks. We report the For our evaluation purposes, to unify the evaluation for
results in Section 4.                                     both tasks, we proceed as follows:

                                                                        • For T1, we create a test split using the leave-one-
3.3. Keyword Extraction                                                   out protocol that is commonly used in the liter-
Similarly to the two previous tasks, Keyword Extraction                   ature [23], thus having a “training” set which
is the process of extracting terms of phrases that summa-                 contains all but one talk that the user interacted
rize on a high level the core themes of a textual document.               with (the user has to have at least two interac-
Generally, the keywords (or sometimes called tags) are                    tions otherwise they are dropped). We create a
the terms or phrases that are explicitly mentioned in the                 user embedding by averaging the computed em-
text with a high frequency or are somehow relevant to a                   beddings of all talks in the training set. The top
big portion of it.                                                        recommendations are then generated by taking
   For our experiments, we use KeyBERT [22], an off-                      the talks which have the highest similarity score
the-shelf keyword extractor that is based on BERT [20],                   (in the same KG embedding space) to the user em-
which extracts keywords by first finding the frequent                     bedding. We note that there is actually no actual
n-grams, then measuring the similarity between their                      training taking place, but this method allows us
embedding and the embedding of the whole document.                        to leverage actual “historical” user behavior to
We experiment with keeping all keywords or filtering                      evaluate purely content-based recommendation.
out rare ones and report the results in Section 4.                      • For T2, we consider all “related videos” as a test
                                                                          set. In other words, for each talk, we compute its
                                                                          similarity to all other talks in the dataset, and we
   3
       urlhttps://spacy.io/universe/project/video-spacys-ner-model        recommend the talks which score the highest.
4.2. Metrics                                                     formula so that it is equal to 1 if all related talks are
                                                                 occupying the top spots in the system predictions:
To evaluate the performance of our method, we use two
commonly used metrics in the recommender systems                                     𝑇                       𝐾
                                                                                  1              1               ℎ𝑖𝑡(𝑡, 𝑟𝑒𝑐𝑖 (𝑡))
literature. In the following paragraphs, 𝑇 is the number          𝑀𝑅𝑅@𝐾 =           ∑                       ∑
of talks in the dataset, 𝑈 is the number of users with                            𝑇 𝑡=1 ∑𝑟𝑒𝑙𝑎𝑡𝑒𝑑(𝑡) 1/𝑐𝑜𝑢𝑛𝑡 𝑖=1 𝑟𝑎𝑛𝑘(𝑡, 𝑟𝑒𝑐𝑖 (𝑡))
                                                                                           𝑐𝑜𝑢𝑛𝑡=1
at least 2 interactions in their history, 𝐾 is the number
of (ordered) model recommendations to considerate (we            4.3. Evaluation Protocol
picked 𝐾 = 10 in our results), 𝑡 is a talk ID (which maps
to its embedding), 𝑢 is a user ID (which maps to its em-         The protocol is summarized in Figure 1. For each of
bedding, i.e. the average of the embeddings of all talks         the studied automatic annotations, we start by running
in the user’s history), 𝑟𝑒𝑐𝑗 (𝑥) is the 𝑗 𝑡ℎ recommendation      our automatic annotation model (as described in 3). We
by our model (x being a user ID for T1 and a talk ID for         then create a Knowledge Graph using on one hand the
T2). ℎ𝑖𝑡(𝑥, 𝑗) = 1 if the talk 𝑗 is indeed in the ground truth   metadata provided in the dataset (each talk is labeled with
for 𝑥, otherwise it is 0. 𝑟𝑒𝑙𝑎𝑡𝑒𝑑(𝑡) is the number of related    a “tag” and a “theme”), and our automatically extracted
talks in T2 (which can be 1, 2 or 3). 𝑟𝑎𝑛𝑘(𝑥, 𝑗) is the rank     descriptors on the other hand. Once we connect all the
of talk 𝑗 in the suggested recommendations for talk/user         talks using these annotations, we run a Graph Embedding
𝑥 by descending similarity score.                                method (see Section 4.4) to generate an embedding for
                                                                 each talk in the dataset. These embeddings serve then as
                                                                 representations that we can use to measure similarities
Hit Rate (HR@K): A simple metric to quantify the
                                                                 for both T1 and T2.
probability of an item in the ground truth to be among the
top-K suggestions produced by the system. For T1, this
means that the left-out item from the user history must          4.4. Choice of embeddings
be among the 𝐾 most similar talks to the user embedding       Throughout the experiments section, we generate a graph
(as defined above). For T2, this means that the talk that     connecting the talks and their annotations. Next, we com-
was manually picked by editors is among the K-most            pute node embeddings for each talk in our dataset. While
similar talks in the embedding space.                         this choice is important for the overall performance of
   For T1 we get the formula:                                 the final recommendation system, our focus in this paper
                             𝑈 𝐾                              is to demonstrate the utility of automatic annotations for
                         1                                    improving content recommendation.
             𝐻 𝑅@𝐾 = ∑ ∑ ℎ𝑖𝑡(𝑢, 𝑟𝑒𝑐𝑖 (𝑢))
                         𝑈 𝑡=1 𝑖=1                               To bypass the need to select a proper graph embedding
                                                              technique and the expensive hyperparameter finetuning
   For T2, we normalize the counting of hits to account
                                                              that goes with it for each experiment, we simulate an
for the variance of number of talks in the ground truth
                                                              ideal scenario where we start from the KG containing the
so that the Hit Rate is 1 at best (i.e. when all related
                                                              talks and their manually annotated metadata from the
talks in the ground truth are included in the system’s
                                                              original TED dataset, i.e. tags and themes. This would al-
recommendations):
                                                              low us to create a Knowledge Graph that does not contain
                                                              any noisy or extraneous annotations. We compute the
                       𝑇              𝐾                       node embeddings for each talk using a selection of embed-
                    1         1
        𝐻 𝑅@𝐾 = ∑                    ∑ ℎ𝑖𝑡(𝑡, 𝑟𝑒𝑐𝑖 (𝑢))       ding algorithms contained in the P y k g 2 v e c package [24]4 ,
                    𝑇 𝑡=1 𝑟𝑒𝑙𝑎𝑡𝑒𝑑(𝑡) 𝑖=1
                                                              a Python library for learning representations of entities
                                                              and relations in Knowledge Graphs using state-of-the-art
Mean Reciprocal Rate (MRR@K): Similarly to models. We finetune each representation using a small
𝐻 𝑅@𝐾, this metric also measures the probability of hav- grid-search optimization over learning rate, embedding
ing ground truth recommendations among the system’s size and number of training epochs. We also add the One-
predictions, but it also accounts for the rank (order) of the hot encoding of each talk (each talk is represented by a
prediction: the closest it is to the top of the predictions, binary vector which represent the presence or absence
the better. For T1 we get the formula:                        of each tag and theme in the metadata) to see if there is
                            𝑈 𝐾                               an advantage for using graph embeddings over a simple
                         1          ℎ𝑖𝑡(𝑢, 𝑟𝑒𝑐𝑖 (𝑢))          flat representation of the nodes, i.e. whether the graph
           𝑀𝑅𝑅@𝐾 = ∑ ∑
                         𝑈 𝑡=1 𝑖=1 𝑟𝑎𝑛𝑘(𝑢, 𝑟𝑒𝑐𝑖 (𝑢))          embeddings encode some semantics between the annota-
                                                              tions that a simple binary representation cannot pick up
   For T2, and again to account for varying number of on (e.g. the presence of one tag may be related to some
talks in the ground truth, we slightly alter the previous
                                                                    4
                                                                        https://github.com/Sujit-O/pykg2vec
other tag/theme, in other words that the annotations are              task itself, for our experiments, we will take this
not mutually orthogonal).                                             model as our embedding method of choice (with
  We report the results on tables 1 and 2, for T1 and T2,             a learning rate of 0.001, embedding and hidden
respectively.                                                         size of 300, all trained for 1000 epochs. The other
                                                                      hyperparameters are left at their default values).
       Embedding method      HIT@10      MRR@10
                                                                   – One-hot node embeddings perform well on
       ConvE                  0.0183       0.0062                    both tasks, which shows that on clean, con-
       DistMult               0.0088       0.0030                    trolled, human-annotated metadata, a simple ex-
       NTN                    0.0533       0.0192                    act matching of metadata is good enough to pro-
       Rescal                 0.0112       0.0031
                                                                     duce good results. The fact that T r a n s D outper-
       TransD                 0.0765       0.0315
       TransE                 0.0663       0.0258
                                                                     forms One-hot embeddings even in this setting
       TransH                 0.0678       0.0251                    shows that the graph embeddings capture some
       TransM                 0.0691       0.0268                    semantics beyond exact matching, which means
       TransR                 0.0641       0.0234                    that it learns to find latent meaning between the
       One-hot                0.0661       0.0256                    tags and themes, which ultimately justifies the
                                                                     use of graph embeddings.
Table 1
The best performance of different embedding methods on T1
                                                               4.5. Automatic annotations
                                                               In this section, we observe the performance gain of the
       Embedding method      HIT@10      MRR@10                different automatic enrichment methods we have intro-
                                                               duced in Section 3.
       ConvE                  0.0163       0.0094
       DistMult               0.0176       0.0099
       NTN                    0.1244       0.0720              4.5.1. Topic Modeling
       Rescal                 0.0143       0.0083
                                                               In Table 3, we report on the results of adding the output
       TransD                 0.2403       0.1542
       TransE                 0.2270       0.1352
                                                               of the topic modeling annotations to the KG. We evalu-
       TransH                 0.2182       0.1309              ate the results as we vary two parameters: the number
       TransM                 0.2219       0.1316              of topics and the cutoff threshold (the confidence score
       TransR                 0.1910       0.1123              above which we assign a talk to a given topic).
       One-hot                0.2215       0.1293
                                                                     # topics   Threshold        HIT@10   MRR@10
Table 2
The best performance of different embedding methods on T2                                   T1
                                                                     No topics added             0.0765   0.0315
  From these tables of results, we make the following                10           0.03           0.0612   0.0246
                                                                     10            0.3           0.0629   0.0262
observations:
                                                                     40           0.03           0.0769   0.0317
    – Over the studied configurations of hyperparam-                 40            0.3           0.0782   0.0326
      eters, models generally have the same ranking                  100          0.03           0.0562   0.0220
      in performance whether used on T1 or T2, i.e.                  100           0.3           0.0606   0.0230
      models which perform well on one task tend to                                         T2
      perform well on the other task. This means that                No topics added             0.2403   0.1542
      whatever properties an embedding method has,                   10           0.03           0.2096    0.033
      they seem to translate similarly on both tasks.                10            0.3           0.2135   0.1294
      The poor performance of some methods may be                    40           0.03           0.2365   0.1623
      due to their high sensitivity to hyperparameter                40            0.3           0.2475   0.1716
      finetuning.                                                    100          0.03           0.1921   0.1196
                                                                     100           0.3           0.2074   0.1226
    – Over the studied configurations of hyperparame-
      ters, translation-based methods perform the best         Table 3
      empirically, with T r a n s D [25] performing the best   The results of enriching the metadata KG with Topic nodes,
      (by quite a margin) in both set of experiments.          varying the number of topics and the cutoff threshold
      While further experiments may be needed to de-
      termine how much this performance is due to the            From this small sample of hyperparameters values, we
      nature of the dataset (size, sparsity, etc.) and the     see that both the number of topics and the cutoff thresh-
old impact the performance of the recommendation on             4.5.3. Keywords Extraction
both tasks. Performance improves when raising the cut-
                                                                In Table 5, we report on the results of adding the output
off threshold, which implies that when we only assign
                                                                of the Keyword Extraction to the KG. We evaluate the
topics to talks, if the topic model is highly confident, it
                                                                results as we add either all extracted keywords or only
decreases the noisy relations in the graph and decrease
                                                                the ones that the keyword extraction model assigned a
the risk of accidentally connecting nodes that are not
                                                                high enough confidence score to. In our experiment, a
really topically similar. We also note that under the right
                                                                confidence score above 0.3 has been chosen.
configuration, we improve the performance on both met-
rics for both tasks, whereas in most other configurations
                                                                      Confidence                  HIT@10    MRR@10
the performance suffers. We note that with the number
of topics one should find a value that is befitting the stud-                                T1
ied corpus, as the value 40 (inspired by the ground truth             No KWs added                0.0765     0.0315
number of themes in the dataset) seems to give the best               All KWs added               0.0732     0.0295
results.                                                              Only with conf > 0.3        0.0772     0.0322
   Topic modeling is a task that is generally very sen-                                      T2
sitive to the initial hyper-parameters and subject to in-
herent stochasticity, which means that with enough ex-                No KWs added                0.2403     0.1542
                                                                      All KWs added               0.2398     0.1523
periments, it is likely to find a configuration of hyper-
                                                                      Only with conf > 0.3        0.2494     0.1593
pamaters (not only the number of topics and the cutoff
threshold but also model-specific hyperparameters such          Table 5
as LDA’s alpha and beta) that yields even better improve-       The results of enriching the metadata KG with Keywords
ment over the reported results.                                 nodes, varying the confidence threshold

4.5.2. Named Entity Recognition
In Table 4, we report on the results of adding the output       4.5.4. Combining annotations
of the Named Entity Recognition annotations to the KG.
                                                                In Table 6, we summarize the results from previous ex-
We evaluate the results as we switch between keeping
                                                                periments, and we see that the addition of the best con-
all entities we extracted in the KG and keeping only ones
                                                                figuration from each experimental setting into one KG
that appear with a high enough frequency: in our case,
                                                                further improves the results.
we only add nodes for entities that are mentioned more
than 10 times in the corpus.
                                                                      Annotation                  HIT@10    MRR@10
    # mentions                   HIT@10      MRR@10                                          T1
                            T1                                        No annotations added        0.0765     0.0315
                                                                      Topics                      0.0782     0.0326
    No NEs added                  0.0765       0.0315                 Named Entities              0.0808     0.0314
    All NEs added                 0.0776       0.0304                 Keywords                    0.0772     0.0322
    More than 10 mentions         0.0808       0.0314                 All                         0.0854     0.0355
                            T2                                                               T2
    No NEs added                  0.2403       0.1542                 No annotations added        0.2403     0.1542
    All NEs added                 0.2435       0.1548                 Topics                      0.2475     0.1716
    More than 10 mentions         0.2575       0.1908                 Named Entities              0.2575     0.1908
                                                                      Keywords                    0.2494     0.1593
Table 4
                                                                      All                         0.2613     0.1584
The results of enriching the metadata KG with Named Entity
nodes, varying the number of filtered entities                  Table 6
                                                                The results on both recommendation tasks with all the differ-
   From these results, we see that adding NEs improves          ent annotations added to the KG
the results of the recommender system, especially af-
ter removing rarely appearing Named Entities (either               We observe that the automatic annotations overall im-
erroneous or superfluous mentions). We also notice that         prove the performance on the recommendation task on
MRR increases significantly with this addition for T2,          purely content-based recommendations (T2), but surpris-
suggesting that the Named Entities are strong indicators        ingly, they do so even for user preference-based ones (T1),
of content relatedness.                                         although the overall performance is still significantly
lower. One could argue that this is because users are usu-     centric recommendation problems.
ally interested in similar content to what they watched
previously (in other words, all recommendation tasks are
partially content-based). There is a possibility, however,     Acknowledgment
that the user is likely to click on the suggested video
                                                               This work has been partially supported by the French Na-
in the “related” section, which creates a dependence be-
                                                               tional Research Agency (ANR) within the ANTRACT
tween the two tasks that is impossible to untangle. This
                                                               (ANR-17-CE38-0010) projects, and by the European
is beyond the scope of this paper, but it is interesting
                                                               Union’s Horizon 2020 research and innovation program
to study the feedback loop of recommendation in such
                                                               within the MeMAD (GA 780069) project.
setting. Finally, the results suggest that Named Entity
Recognition contributes the most to the overall perfor-
mance improvement of the system, as it is the closest to       References
the overall performance and still gives a better absolute
MRR score.                                                      [1] D. Kotkov, S. Wang, J. Veijalainen,          A sur-
                                                                    vey of serendipity in recommender systems,
                                                                    Knowledge-Based Systems 111 (2016) 180–192.
5. Conclusion and future work                                       URL:       https://www.sciencedirect.com/science/
                                                                    article/pii/S0950705116302763.
In this work, we showed how combining the knowledge
                                                                [2] M. Kunaver, T. Požrl, Diversity in recommender
extracted automatically using Information Extraction
                                                                    systems – a survey, Knowledge-Based Systems 123
techniques with the representational power of KG and
                                                                    (2017) 154–162. URL: https://www.sciencedirect.
their embeddings can improve the performance content-
                                                                    com/science/article/pii/S0950705117300680.
based media Recommender Systems without requiring
                                                                [3] Y. Zhang, X. Chen, Explainable recommendation:
any supervision or external data collection, as we demon-
                                                                    A survey and new perspectives, Found. Trends Inf.
strated clear performance improvement as measured on
                                                                    Retr. 14 (2020) 1–101.
two tasks: making recommendations based on manually
                                                                [4] N. Pappas, A. Popescu-Belis, Combining con-
curated recommendations, and based on actual users in-
                                                                    tent with user preferences for ted lecture recom-
teraction history. Our results are reproducible using the
                                                                    mendation, in: 11th International Workshop on
code published at https://github.com/D2KLab/ka-recsys.
                                                                    Content-Based Multimedia Indexing (CBMI), 2013,
   With these promising results showing actual improve-
                                                                    pp. 47–52.
ment over relying only on human annotation, there are
                                                                [5] J. B. Schafer, D. Frankowski, J. Herlocker, S. Sen,
multiple paths for further exploration. First, other tech-
                                                                    Collaborative Filtering Recommender Systems,
niques from the information extraction literature can
                                                                    Springer Berlin Heidelberg, Berlin, Heidelberg,
be investigated such as entity linking, aspect extraction,
                                                                    2007, pp. 291–324.
and concept mining, with more exploration to be done
                                                                [6] N. Pappas, A. Popescu-Belis, Sentiment analysis
on the techniques already presented (i.e. experimenting
                                                                    of user comments for one-class collaborative fil-
with other approaches for Topic Modeling, Named Entity
                                                                    tering over ted talks, in: 36th international ACM
Extraction and Keyword Extraction). What’s more, as
                                                                    SIGIR conference on Research and development in
shown experimentally, the way these automatic annota-
                                                                    information retrieval, 2013, pp. 773–776.
tions are processed and filtered (thus changing the struc-
                                                                [7] A. Merchant, N. Singh, Hybrid trust-aware model
ture of the generated KG), the results can vary, which
                                                                    for personalized top-n recommendation, in: Fourth
calls for further study of how to balance the quantity of
                                                                    ACM IKDD Conferences on Data Sciences, Associ-
automatic annotations and the cutback on the necessary
                                                                    ation for Computing Machinery, 2017.
noise that comes with it. Another direction of work is to
                                                                [8] N. Pappas, A. Popescu-Belis, Combining content
further explore models that go beyond simple graph em-
                                                                    with user preferences for non-fiction multimedia
beddings. We should also consider combining the results
                                                                    recommendation: a study on ted lectures, Multime-
of such annotations with the original textual context, as
                                                                    dia Tools and Applications 74 (2013) 1175–1197.
our early experiments suggest that combining both the
                                                                [9] R. Sun, X. Cao, Y. Zhao, J. Wan, K. Zhou, F. Zhang,
low-level features (text embeddings) and high level ones
                                                                    Z. Wang, K. Zheng, Multi-Modal Knowledge
(graph embeddings) improve further upon the perfor-
                                                                    Graphs for Recommender Systems, Association for
mance. Furthermore, as these extracted annotations live
                                                                    Computing Machinery, New York, NY, USA, 2020,
on a KG, multiple methods in the direction of Explainable
                                                                    p. 1405–1414. URL: https://doi.org/10.1145/3340531.
Recommendations can be explored in tandem.
                                                                    3411947.
   Finally, we would like to test this approach on other
                                                               [10] M. de Gemmis, P. Lops, C. Musto, F. Narducci, G. Se-
datasets to see if it can be as successful on other content-
                                                                    meraro, Semantics-Aware Content-Based Recom-
     mender Systems, Springer US, Boston, MA, 2015,                   ter of the Association for Computational Linguis-
     pp. 119–159.                                                     tics: Human Language Technologies, Volume 1
[11] G. A. Miller, Wordnet: A lexical database for                    (Long and Short Papers), Association for Com-
     english, Commun. ACM 38 (1995) 39–41. URL:                       putational Linguistics, Minneapolis, Minnesota,
     https://doi.org/10.1145/219717.219748.                           2019, pp. 4171–4186. URL: https://aclanthology.org/
[12] Q. Guo, F. Zhuang, C. Qin, H. Zhu, X. Xie, H. Xiong,             N19-1423.
     Q. He, A survey on knowledge graph-based recom-             [21] M. Honnibal, I. Montani, S. Van Landeghem,
     mender systems, 2020. URL: https://arxiv.org/abs/                A. Boyd, spaCy: Industrial-strength Natural Lan-
     2003.00911.                                                      guage Processing in Python, 2020. URL: https://doi.
[13] Y. Cao, X. Wang, X. He, Z. Hu, C. Tat-seng, Unifying             org/10.5281/zenodo.1212303.
     knowledge graph learning and recommendation:                [22] M. Grootendorst, Keybert: Minimal keyword ex-
     Towards a better understanding of user preference,               traction with bert., 2020. URL: https://doi.org/10.
     in: WWW, 2019. URL: https://arxiv.org/abs/1906.                  5281/zenodo.4461265.
     04239.                                                      [23] S. Rendle, Factorization machines, in: IEEE In-
[14] H. Cai, V. Zheng, K. Chang, A comprehensive sur-                 ternational Conference on Data Mining, 2010, pp.
     vey of graph embedding: Problems, techniques, and                995–1000.
     applications, IEEE Transactions on Knowledge and            [24] S. Y. Yu, S. Rokka Chhetri, A. Canedo, P. Goyal,
     Data Engineering 30 (2018) 1616–1637.                            M. A. A. Faruque, Pykg2vec: A python library for
[15] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet             knowledge graph embedding, 2019.
     allocation 3 (2003) 993–1022.                               [25] G. Ji, S. He, L. Xu, K. Liu, J. Zhao, Knowledge graph
[16] F. Bianchi, S. Terragni, D. Hovy, Pre-training is a              embedding via dynamic mapping matrix, in: ACL,
     hot topic: Contextualized document embeddings                    2015.
     improve topic coherence, in: Proceedings of the
     59th Annual Meeting of the Association for Com-
     putational Linguistics and the 11th International
     Joint Conference on Natural Language Processing
     (Volume 2: Short Papers), Association for Computa-
     tional Linguistics, Online, 2021, pp. 759–766. URL:
     https://aclanthology.org/2021.acl-short.96.
[17] T. Tian, Z. F. Fang, Attention-based autoen-
     coder topic model for short texts,          Procedia
     Computer Science 151 (2019) 1134–1139.
     URL:       https://www.sciencedirect.com/science/
     article/pii/S1877050919306283.            doi:h t t p s :
     //doi.org/10.1016/j.procs.2019.04.161,             the
     10th International Conference on Ambient Sys-
     tems, Networks and Technologies (ANT 2019) /
     The 2nd International Conference on Emerging
     Data and Industry 4.0 (EDI40 2019) / Affiliated
     Workshops.
[18] I. Harrando, P. Lisena, R. Troncy, Apples to apples:
     A systematic evaluation of topic models, in: RANLP,
     volume 260, 2021, pp. 488–498.
[19] I. Yamada, A. Asai, H. Shindo, H. Takeda, Y. Mat-
     sumoto, LUKE: Deep contextualized entity rep-
     resentations with entity-aware self-attention, in:
     Proceedings of the 2020 Conference on Empirical
     Methods in Natural Language Processing (EMNLP),
     Association for Computational Linguistics, Online,
     2020, pp. 6442–6454. URL: https://aclanthology.org/
     2020.emnlp-main.523.
[20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
     Pre-training of deep bidirectional transformers for
     language understanding, in: Proceedings of the
     2019 Conference of the North American Chap-

</pre>