=Paper=
{{Paper
|id=Vol-2960/paper16
|storemode=property
|title=Improving Media Content Recommendation with Automatic Annotations (Long paper)
|pdfUrl=https://ceur-ws.org/Vol-2960/paper16.pdf
|volume=Vol-2960
|authors=Ismail Harrando,Raphael Troncy
|dblpUrl=https://dblp.org/rec/conf/recsys/HarrandoT21
}}
==Improving Media Content Recommendation with Automatic Annotations (Long paper)==
Improving Media Content Recommendation with Automatic Annotations Ismail Harrando1 , Raphaël Troncy1 1 EURECOM, France Abstract With the immense growth of media content production on the internet and increasing wariness about privacy, content-based recommendation systems offer the possibility of promoting media to users (e.g. posts, videos, podcasts) based solely on a representation of the content, i.e. without using any user-related data such as views and more generally interactions between users and items. In this work, we study the potential of using off-the-shelf automatic annotation tools from the Information Extraction literature to improve recommendation performance without any extra cost of training, data collection or annotation. We experiment with how these annotations can improve recommendations on two tasks: the traditional user history-based recommendation, as well as a purely content-based recommendation evaluation. We pair these automatic annotations with the manually created metadata and we show that Knowledge Graphs through their embeddings constitute a great modality to seamlessly integrate this extracted knowledge and provide better recommendations. The evaluation code, as well as the enrichment generation, is available at https://github.com/D2KLab/ka-recsys. Keywords Recommender Systems, Content-based Recommendation, Knowledge Graph, Automatic Annotation 1. Introduction recommendations out of), and in cases where it is hard to collect such feedback (anonymity, privacy). As user engagement with content online has become In this paper, we are interested in the second kind of a crucial element in most if not all content-providing recommendations which are based solely on the content multimedia platforms – i.e. retaining a user’s interest in of the media to recommend. The “content” in content- the provided content and maximizing their time watch- based can refer to a variety of potential formats: text, ing/reading/listening to the content, the role of recom- image, video, metadata (e.g. tags and keywords) and mender systems cannot be overstated in shaping and im- so on. Typically, a representation of such content is proving the user experience when it comes to consuming extracted or learned, and the task of recommendation and interacting with said content, as it helps funneling is then cast as a content similarity/retrieval task: given the usually overwhelming amount of data into a con- the representation of an item of interest (e.g. the video densed, targeted and interesting selection of items that the user is currently watching), and the representation of the user is most likely to find enjoyable and interesting. all items already existing in the catalog, we want to find Traditionally, recommendation systems either use col- the items which have the highest similarity to the item laborative filtering, i.e. leveraging user statistics and of interest. While many varieties of this approach exist their implicit/explicit feedback (views, likes, watch time) (ones that target other metrics such as serendipity [1], to find items to recommend (the underlying assumption diversity [2] and explainability [3]) which may formulate is that people who have similar interests interact with the problem differently, but at its core, the task can be the same items), or provide content-based recommen- framed as finding the best content representation that dations, which rely on the content of the item itself to allows uncovering a meaningful measure of similarity. find similar content without any input from the user. We posit in this paper that the use of Knowledge Content-based recommendations are particularly inter- Graphs (KGs), both created using item metadata and auto- esting in the case of the cold start problem where there matically generated from the given content, can improve is no feedback from users (no interactions to based the the task of media recommendation. Instead of relying only on the content, we leverage several Information 3rd Edition of Knowledge-aware and Conversational Recommender Extraction techniques to extract high level descriptors Systems (KaRS) & 5th Edition of Recommendation in Complex that allow the automatic creation of metadata, which can Environments (ComplexRec) Joint Workshop @ RecSys 2021, September 27–1 October 2021, Amsterdam, Netherlands be then used to generate a KG connecting all content in Envelope-Open ismail.harrando@eurecom.fr (I. Harrando); the media catalog. Given the versatility of Knowledge raphael.troncy@eurecom.fr (R. Troncy) Graphs, they allow us to combine these automatic an- Orcid 0000-0002-3593-4490 (I. Harrando); 0000-0003-0457-1436 notations with already existing metadata seamlessly. To (R. Troncy) validate this approach, we focus on studying the TED © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). dataset [4], an open-sourced multimedia dataset that of- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) fers the unique possibility of evaluating recommenda- supposed to reflect subjective topical relatedness tions based on both the content only (“related videos”, as between talks in the corpus. Performance on this curated by human editors) and the user preferences based task reflects the model’s ability to recommend on their interactions history. We demonstrate that our content to either users without an interactions approach improves the recommendation performance history (new users, visitors without accounts) or on both tasks, and that KGs are a reliable framework to new videos (that have not yet received any inter- integrate external knowledge into the task of recommen- actions). We note that in the ground truth, some dation. talks are associated with three related talks, some with two, and some with only one. We account for this in the evaluation metrics. 2. Related Work Previous works have studied specific aspects of this The TED Dataset The TED dataset [4] is a multimodal dataset such as sentiment analysis [6], estimating trust dataset which contains the audiovisual recordings of the from comments polarity and ratings to improve recom- TED talks downloaded from the official website 1 , which mendation [7], or studying hybrid recommender systems sums up to 1149 talks, alongside metadata fields and user [8]. In this work, we focus our interest on this dataset profiles with rating and commenting interactions. The as it offers a unique possibility of evaluating content- metadata fields are as follows: identifier, title, descrip- based recommendation using both real user feedback tion, speaker name, TED event at which the talk is given, and hand-picked recommendations, as the later has not transcript, publication date, filming date, and number of been considered in any of the published works on this views. For nearly every video, the dataset contains a list dataset to the best of our knowledge. of user interactions (marked by the action of “Adding to We also note that, while the dataset is multimodal (TED favorites”), as well as up to three “related videos”, which Talks Videos are also available), our work does not tackle are picked by the editorial staff to be recommended to visual information extraction, mainly because TED Talks the user to watch next. What is unique for this dataset is are not visually diverse (mostly speakers and audience that it provides two sorts of ground truths for the recom- wide shots). This is however a promising direction of mender system use-case, that we can formulate in these work that has been tackled in previous works [9]. two tasks: • Task 1 - Personalized (user-specific) recom- Graph-based Recommender Systems Given the re- mendations: based on a user’s list of favorite cent growing interest in Knowledge Graphs and their talks, the task is to predict what they would watch applications, there is a growing literature on the tech- next. A evaluation dataset can thus be created niques and models that can be leveraged to build using a “leave one out” protocol, i.e. removing “knowledge-aware” recommender systems. [10] present one interaction from the user list of favorites, and such an approach to bring external knowledge to the measuring how successful a method is in predict- task of content-based Knowledge Graphs, identifying ing the omitted item. Most recommender system- two main approaches to what they called “Semantics- type datasets contain a similar information, i.e. aware Recommender Systems” to tackle traditional prob- what items a user has actually interacted with in lems of content-based recommender systems, Top-down reality, based on their viewing/interaction history. Approaches which incorporate knowledge from ontolog- This task is usually handled with collaborative ical resources such as WordNet [11], and encyclopedic filtering methods (e.g. [5]), but is still interesting knowledge sources such as Wikipedia2 , to enrich the for content-based recommendation in the case of item representations with external world and linguistic the cold start problem: when a new talk is added to knowledge, and Bottom-up Approaches which uses lin- the platform, how can we recommend it to other guistic resources such as what we commonly refer to as users? The most common approach is to use its distributional word representations, e.g. using pretrained content to recommend it to users who previously word embeddings to avoid the issue of exact matching liked a similar content. in traditional content-based systems. They also raise the problem of the potential use of a graph structure • Task 2 - General (content-based) recommen- to discover latent connections among items, which we dations: to the best of our knowledge, this is the study in our experiments. [12] offers an extensive sur- only dataset which offers ground truth for multi- vey of Knowledge Graph-based Recommender System media recommendations based on content only, approaches, proposing a high-level taxonomy of methods which are referred to as “related videos”, manu- that either use graph embeddings, connectivity patterns ally annotated by TED editorial staff. These are 1 2 https://www.ted.com https://en.wikipedia.org/wiki/Main_Page Figure 1: High level illustration of the approach: we start by extracting annotations from the video transcript using off-the- shelf Information Extraction tools, which we combine with manual annotations to create a Knowledge Graph, where the talks and the annotations are nodes, connected with the corresponding semantic relation. Using this graph structure, we can generate continuous fixed-dimensional representations using a Graph Embedding technique, which we can later use to measure content similarity for recommendation. (common paths mining), or combining the two. In this pa- 3.1. Topic Modeling per, we only focus on embedding-based methods to study Topic modeling is a ubiquitously used Information Extrac- the use of automatic annotations on the performance of tion technique, which attempts to find the latent topics in recommender systems. Additionally, unlike some previ- a text corpus. A topic can be roughly defined as a coher- ous works, our work does not tackle the two tasks jointly ent set of vocabulary words that tend to co-appear with as a learning problem[13], but attempts to show how high probability in the same documents. When applied the same approach can at the same time improve the on documents of natural language, topic models have the performance on both. ability to find the underlying “themes” in the document collection, such as sport, technology, etc. 3. Approach The literature on topic modeling is rich and diverse, with approaches relying solely on word counts such as The proposed approach builds on using several Informa- the commonly used LDA [15], to using state-of-the-art tion Extraction techniques such as Topic Modeling (3.1), representations to represent documents in more mean- Named Entity Recognition (3.2), and Keyword Extraction ingful representational spaces [16, 17]. Topics are usually (3.3), to generate high level descriptors – annotations – represented with their “top N words” (the 𝑁 words most of the content of each video in the dataset. Once the likely to appear given a topic). In our dataset, we find annotations are generated for each video, we use them to topics such as: build a Knowledge Graph connecting the talks by their annotations. This approach also allows us to integrate • Technology: network,online,computers,digital,google external metadata if such metadata is available (for our • Environment: waste,plants,electrical,plastic,battery dataset, metadata such as “Tags” and “Themes” are avail- • Gaming: games,online,virtual,gamers,penalty able and will be used). Once the KG is generated, we can • Health: aids,malaria,drugs,mortality,vaccine use a graph embedding method [14] to generate a fixed- For our experiments, we use LDA as it is still commonly dimensional embedding for each video in the dataset, used and offers simple yet competitive performance[18]. such that videos having similar annotations would be We test two aspects of topic modeling that can influence represented in proximity in the embedding space. As a re- the structure of the graph (the number of nodes and sult, we can measure the (cosine) similarity between any relations added) which are the number of topics (i.e. the two videos’ embeddings as a proxy to their relatedness. number of topic nodes in the final KG), as well as the The approach is illustrated in Figure 1. cutoff threshold reflecting the topic model’s confidence is We present a selection of automatic annotations tech- assigning a given topic to a given talk (which would affect niques and how they are used in our approach in the the number of relations to topic nodes). We report the following subsections. results in Section 4. For a better performance of the topic modeling task, we preprocess our dataset as follows: 1. Lowercase all words 2. Remove short words (less than 3 characters) 4. Experiments and Results 3. Remove punctuation 4. Remove the most frequent words (top 1%) In this section, we explain the experimental protocol and describe the results for the different experiments done to study the impact of using automatic annotations on 3.2. Named Entity Recognition recommendation performance. We first reintroduce the Named Entity Recognition is the task of extracting from dataset and how it is going to be used in the rest of this unstructured text, terms or phrases that refer to named section. Then, we define the metrics we use to measure entities, i.e. real world objects that have proper names this performance (Hit Rate, Mean Reciprocal Rate and and can refer to one of several classes: persons, places, Normalized Discounted Cumulative Gain), and the em- organizations, etc. Once extracted, these Named Entities bedding method to use for the rest of the experiments. can be used as high level descriptors for a text content. For each automatic annotation considered (i.e. Topics, For example, if two talks mention “Einstein” and “New- Named Entities and Keywords), we consider several con- ton”, they may have a similar topic. While this task used figurations, with and without the addition of the original to rely on grammatical and hand-crafted features to des- metadata from the dataset. Finally, we observe the poten- ignate what would constitute a Named Entity (e.g. starts tial of combining the resulting automatically generated with a capital letter), modern systems do without such graph embeddings with the textual embeddings of the hand crafted features [19, 20], but rely on combining content, and show how the two complement each other the learning power of neural networks with annotated to push the performance even higher. corpora of Named Entities. In our experiments, we use SpaCy’s [21] NER model 4.1. Dataset which uses an architecture that combines a word em- bedding strategy using sub word features, and a deep As mentioned previously, the TED Talks dataset has two convolution neural network with residual connections, versions of ground truths (or prediction tasks) for recom- which is “designed to give a good balance of efficiency, mendation, namely: accuracy and adaptability”3 . • User-specific recommendations that are based on For our experiments, we keep the Named Entities be- actual users interactions history (henceforth re- longing to the following classes: ’PERSON’, ’LOC’ (loca- ferred to as T1) tion), ’ORG’ (organization), ’GPE’ (geopolitical entity), ’FAC’ (faculty), ’PRODUCT’, and ’WORK_OF_ART’. We • Content-based recommendations, which are also experiment with the impact of keeping all extracted hand-picked by editors for each talk (henceforth Named Entities or filtering some out based on frequency, referred to as T2) thus altering the number of added nodes to the graph and their relations to the existing talks. We report the For our evaluation purposes, to unify the evaluation for results in Section 4. both tasks, we proceed as follows: • For T1, we create a test split using the leave-one- 3.3. Keyword Extraction out protocol that is commonly used in the liter- Similarly to the two previous tasks, Keyword Extraction ature [23], thus having a “training” set which is the process of extracting terms of phrases that summa- contains all but one talk that the user interacted rize on a high level the core themes of a textual document. with (the user has to have at least two interac- Generally, the keywords (or sometimes called tags) are tions otherwise they are dropped). We create a the terms or phrases that are explicitly mentioned in the user embedding by averaging the computed em- text with a high frequency or are somehow relevant to a beddings of all talks in the training set. The top big portion of it. recommendations are then generated by taking For our experiments, we use KeyBERT [22], an off- the talks which have the highest similarity score the-shelf keyword extractor that is based on BERT [20], (in the same KG embedding space) to the user em- which extracts keywords by first finding the frequent bedding. We note that there is actually no actual n-grams, then measuring the similarity between their training taking place, but this method allows us embedding and the embedding of the whole document. to leverage actual “historical” user behavior to We experiment with keeping all keywords or filtering evaluate purely content-based recommendation. out rare ones and report the results in Section 4. • For T2, we consider all “related videos” as a test set. In other words, for each talk, we compute its similarity to all other talks in the dataset, and we 3 urlhttps://spacy.io/universe/project/video-spacys-ner-model recommend the talks which score the highest. 4.2. Metrics formula so that it is equal to 1 if all related talks are occupying the top spots in the system predictions: To evaluate the performance of our method, we use two commonly used metrics in the recommender systems 𝑇 𝐾 1 1 ℎ𝑖𝑡(𝑡, 𝑟𝑒𝑐𝑖 (𝑡)) literature. In the following paragraphs, 𝑇 is the number 𝑀𝑅𝑅@𝐾 = ∑ ∑ of talks in the dataset, 𝑈 is the number of users with 𝑇 𝑡=1 ∑𝑟𝑒𝑙𝑎𝑡𝑒𝑑(𝑡) 1/𝑐𝑜𝑢𝑛𝑡 𝑖=1 𝑟𝑎𝑛𝑘(𝑡, 𝑟𝑒𝑐𝑖 (𝑡)) 𝑐𝑜𝑢𝑛𝑡=1 at least 2 interactions in their history, 𝐾 is the number of (ordered) model recommendations to considerate (we 4.3. Evaluation Protocol picked 𝐾 = 10 in our results), 𝑡 is a talk ID (which maps to its embedding), 𝑢 is a user ID (which maps to its em- The protocol is summarized in Figure 1. For each of bedding, i.e. the average of the embeddings of all talks the studied automatic annotations, we start by running in the user’s history), 𝑟𝑒𝑐𝑗 (𝑥) is the 𝑗 𝑡ℎ recommendation our automatic annotation model (as described in 3). We by our model (x being a user ID for T1 and a talk ID for then create a Knowledge Graph using on one hand the T2). ℎ𝑖𝑡(𝑥, 𝑗) = 1 if the talk 𝑗 is indeed in the ground truth metadata provided in the dataset (each talk is labeled with for 𝑥, otherwise it is 0. 𝑟𝑒𝑙𝑎𝑡𝑒𝑑(𝑡) is the number of related a “tag” and a “theme”), and our automatically extracted talks in T2 (which can be 1, 2 or 3). 𝑟𝑎𝑛𝑘(𝑥, 𝑗) is the rank descriptors on the other hand. Once we connect all the of talk 𝑗 in the suggested recommendations for talk/user talks using these annotations, we run a Graph Embedding 𝑥 by descending similarity score. method (see Section 4.4) to generate an embedding for each talk in the dataset. These embeddings serve then as representations that we can use to measure similarities Hit Rate (HR@K): A simple metric to quantify the for both T1 and T2. probability of an item in the ground truth to be among the top-K suggestions produced by the system. For T1, this means that the left-out item from the user history must 4.4. Choice of embeddings be among the 𝐾 most similar talks to the user embedding Throughout the experiments section, we generate a graph (as defined above). For T2, this means that the talk that connecting the talks and their annotations. Next, we com- was manually picked by editors is among the K-most pute node embeddings for each talk in our dataset. While similar talks in the embedding space. this choice is important for the overall performance of For T1 we get the formula: the final recommendation system, our focus in this paper 𝑈 𝐾 is to demonstrate the utility of automatic annotations for 1 improving content recommendation. 𝐻 𝑅@𝐾 = ∑ ∑ ℎ𝑖𝑡(𝑢, 𝑟𝑒𝑐𝑖 (𝑢)) 𝑈 𝑡=1 𝑖=1 To bypass the need to select a proper graph embedding technique and the expensive hyperparameter finetuning For T2, we normalize the counting of hits to account that goes with it for each experiment, we simulate an for the variance of number of talks in the ground truth ideal scenario where we start from the KG containing the so that the Hit Rate is 1 at best (i.e. when all related talks and their manually annotated metadata from the talks in the ground truth are included in the system’s original TED dataset, i.e. tags and themes. This would al- recommendations): low us to create a Knowledge Graph that does not contain any noisy or extraneous annotations. We compute the 𝑇 𝐾 node embeddings for each talk using a selection of embed- 1 1 𝐻 𝑅@𝐾 = ∑ ∑ ℎ𝑖𝑡(𝑡, 𝑟𝑒𝑐𝑖 (𝑢)) ding algorithms contained in the P y k g 2 v e c package [24]4 , 𝑇 𝑡=1 𝑟𝑒𝑙𝑎𝑡𝑒𝑑(𝑡) 𝑖=1 a Python library for learning representations of entities and relations in Knowledge Graphs using state-of-the-art Mean Reciprocal Rate (MRR@K): Similarly to models. We finetune each representation using a small 𝐻 𝑅@𝐾, this metric also measures the probability of hav- grid-search optimization over learning rate, embedding ing ground truth recommendations among the system’s size and number of training epochs. We also add the One- predictions, but it also accounts for the rank (order) of the hot encoding of each talk (each talk is represented by a prediction: the closest it is to the top of the predictions, binary vector which represent the presence or absence the better. For T1 we get the formula: of each tag and theme in the metadata) to see if there is 𝑈 𝐾 an advantage for using graph embeddings over a simple 1 ℎ𝑖𝑡(𝑢, 𝑟𝑒𝑐𝑖 (𝑢)) flat representation of the nodes, i.e. whether the graph 𝑀𝑅𝑅@𝐾 = ∑ ∑ 𝑈 𝑡=1 𝑖=1 𝑟𝑎𝑛𝑘(𝑢, 𝑟𝑒𝑐𝑖 (𝑢)) embeddings encode some semantics between the annota- tions that a simple binary representation cannot pick up For T2, and again to account for varying number of on (e.g. the presence of one tag may be related to some talks in the ground truth, we slightly alter the previous 4 https://github.com/Sujit-O/pykg2vec other tag/theme, in other words that the annotations are task itself, for our experiments, we will take this not mutually orthogonal). model as our embedding method of choice (with We report the results on tables 1 and 2, for T1 and T2, a learning rate of 0.001, embedding and hidden respectively. size of 300, all trained for 1000 epochs. The other hyperparameters are left at their default values). Embedding method HIT@10 MRR@10 – One-hot node embeddings perform well on ConvE 0.0183 0.0062 both tasks, which shows that on clean, con- DistMult 0.0088 0.0030 trolled, human-annotated metadata, a simple ex- NTN 0.0533 0.0192 act matching of metadata is good enough to pro- Rescal 0.0112 0.0031 duce good results. The fact that T r a n s D outper- TransD 0.0765 0.0315 TransE 0.0663 0.0258 forms One-hot embeddings even in this setting TransH 0.0678 0.0251 shows that the graph embeddings capture some TransM 0.0691 0.0268 semantics beyond exact matching, which means TransR 0.0641 0.0234 that it learns to find latent meaning between the One-hot 0.0661 0.0256 tags and themes, which ultimately justifies the use of graph embeddings. Table 1 The best performance of different embedding methods on T1 4.5. Automatic annotations In this section, we observe the performance gain of the Embedding method HIT@10 MRR@10 different automatic enrichment methods we have intro- duced in Section 3. ConvE 0.0163 0.0094 DistMult 0.0176 0.0099 NTN 0.1244 0.0720 4.5.1. Topic Modeling Rescal 0.0143 0.0083 In Table 3, we report on the results of adding the output TransD 0.2403 0.1542 TransE 0.2270 0.1352 of the topic modeling annotations to the KG. We evalu- TransH 0.2182 0.1309 ate the results as we vary two parameters: the number TransM 0.2219 0.1316 of topics and the cutoff threshold (the confidence score TransR 0.1910 0.1123 above which we assign a talk to a given topic). One-hot 0.2215 0.1293 # topics Threshold HIT@10 MRR@10 Table 2 The best performance of different embedding methods on T2 T1 No topics added 0.0765 0.0315 From these tables of results, we make the following 10 0.03 0.0612 0.0246 10 0.3 0.0629 0.0262 observations: 40 0.03 0.0769 0.0317 – Over the studied configurations of hyperparam- 40 0.3 0.0782 0.0326 eters, models generally have the same ranking 100 0.03 0.0562 0.0220 in performance whether used on T1 or T2, i.e. 100 0.3 0.0606 0.0230 models which perform well on one task tend to T2 perform well on the other task. This means that No topics added 0.2403 0.1542 whatever properties an embedding method has, 10 0.03 0.2096 0.033 they seem to translate similarly on both tasks. 10 0.3 0.2135 0.1294 The poor performance of some methods may be 40 0.03 0.2365 0.1623 due to their high sensitivity to hyperparameter 40 0.3 0.2475 0.1716 finetuning. 100 0.03 0.1921 0.1196 100 0.3 0.2074 0.1226 – Over the studied configurations of hyperparame- ters, translation-based methods perform the best Table 3 empirically, with T r a n s D [25] performing the best The results of enriching the metadata KG with Topic nodes, (by quite a margin) in both set of experiments. varying the number of topics and the cutoff threshold While further experiments may be needed to de- termine how much this performance is due to the From this small sample of hyperparameters values, we nature of the dataset (size, sparsity, etc.) and the see that both the number of topics and the cutoff thresh- old impact the performance of the recommendation on 4.5.3. Keywords Extraction both tasks. Performance improves when raising the cut- In Table 5, we report on the results of adding the output off threshold, which implies that when we only assign of the Keyword Extraction to the KG. We evaluate the topics to talks, if the topic model is highly confident, it results as we add either all extracted keywords or only decreases the noisy relations in the graph and decrease the ones that the keyword extraction model assigned a the risk of accidentally connecting nodes that are not high enough confidence score to. In our experiment, a really topically similar. We also note that under the right confidence score above 0.3 has been chosen. configuration, we improve the performance on both met- rics for both tasks, whereas in most other configurations Confidence HIT@10 MRR@10 the performance suffers. We note that with the number of topics one should find a value that is befitting the stud- T1 ied corpus, as the value 40 (inspired by the ground truth No KWs added 0.0765 0.0315 number of themes in the dataset) seems to give the best All KWs added 0.0732 0.0295 results. Only with conf > 0.3 0.0772 0.0322 Topic modeling is a task that is generally very sen- T2 sitive to the initial hyper-parameters and subject to in- herent stochasticity, which means that with enough ex- No KWs added 0.2403 0.1542 All KWs added 0.2398 0.1523 periments, it is likely to find a configuration of hyper- Only with conf > 0.3 0.2494 0.1593 pamaters (not only the number of topics and the cutoff threshold but also model-specific hyperparameters such Table 5 as LDA’s alpha and beta) that yields even better improve- The results of enriching the metadata KG with Keywords ment over the reported results. nodes, varying the confidence threshold 4.5.2. Named Entity Recognition In Table 4, we report on the results of adding the output 4.5.4. Combining annotations of the Named Entity Recognition annotations to the KG. In Table 6, we summarize the results from previous ex- We evaluate the results as we switch between keeping periments, and we see that the addition of the best con- all entities we extracted in the KG and keeping only ones figuration from each experimental setting into one KG that appear with a high enough frequency: in our case, further improves the results. we only add nodes for entities that are mentioned more than 10 times in the corpus. Annotation HIT@10 MRR@10 # mentions HIT@10 MRR@10 T1 T1 No annotations added 0.0765 0.0315 Topics 0.0782 0.0326 No NEs added 0.0765 0.0315 Named Entities 0.0808 0.0314 All NEs added 0.0776 0.0304 Keywords 0.0772 0.0322 More than 10 mentions 0.0808 0.0314 All 0.0854 0.0355 T2 T2 No NEs added 0.2403 0.1542 No annotations added 0.2403 0.1542 All NEs added 0.2435 0.1548 Topics 0.2475 0.1716 More than 10 mentions 0.2575 0.1908 Named Entities 0.2575 0.1908 Keywords 0.2494 0.1593 Table 4 All 0.2613 0.1584 The results of enriching the metadata KG with Named Entity nodes, varying the number of filtered entities Table 6 The results on both recommendation tasks with all the differ- From these results, we see that adding NEs improves ent annotations added to the KG the results of the recommender system, especially af- ter removing rarely appearing Named Entities (either We observe that the automatic annotations overall im- erroneous or superfluous mentions). We also notice that prove the performance on the recommendation task on MRR increases significantly with this addition for T2, purely content-based recommendations (T2), but surpris- suggesting that the Named Entities are strong indicators ingly, they do so even for user preference-based ones (T1), of content relatedness. although the overall performance is still significantly lower. One could argue that this is because users are usu- centric recommendation problems. ally interested in similar content to what they watched previously (in other words, all recommendation tasks are partially content-based). There is a possibility, however, Acknowledgment that the user is likely to click on the suggested video This work has been partially supported by the French Na- in the “related” section, which creates a dependence be- tional Research Agency (ANR) within the ANTRACT tween the two tasks that is impossible to untangle. This (ANR-17-CE38-0010) projects, and by the European is beyond the scope of this paper, but it is interesting Union’s Horizon 2020 research and innovation program to study the feedback loop of recommendation in such within the MeMAD (GA 780069) project. setting. Finally, the results suggest that Named Entity Recognition contributes the most to the overall perfor- mance improvement of the system, as it is the closest to References the overall performance and still gives a better absolute MRR score. [1] D. Kotkov, S. Wang, J. Veijalainen, A sur- vey of serendipity in recommender systems, Knowledge-Based Systems 111 (2016) 180–192. 5. Conclusion and future work URL: https://www.sciencedirect.com/science/ article/pii/S0950705116302763. In this work, we showed how combining the knowledge [2] M. Kunaver, T. Požrl, Diversity in recommender extracted automatically using Information Extraction systems – a survey, Knowledge-Based Systems 123 techniques with the representational power of KG and (2017) 154–162. URL: https://www.sciencedirect. their embeddings can improve the performance content- com/science/article/pii/S0950705117300680. based media Recommender Systems without requiring [3] Y. Zhang, X. Chen, Explainable recommendation: any supervision or external data collection, as we demon- A survey and new perspectives, Found. Trends Inf. strated clear performance improvement as measured on Retr. 14 (2020) 1–101. two tasks: making recommendations based on manually [4] N. Pappas, A. Popescu-Belis, Combining con- curated recommendations, and based on actual users in- tent with user preferences for ted lecture recom- teraction history. Our results are reproducible using the mendation, in: 11th International Workshop on code published at https://github.com/D2KLab/ka-recsys. Content-Based Multimedia Indexing (CBMI), 2013, With these promising results showing actual improve- pp. 47–52. ment over relying only on human annotation, there are [5] J. B. Schafer, D. Frankowski, J. Herlocker, S. Sen, multiple paths for further exploration. First, other tech- Collaborative Filtering Recommender Systems, niques from the information extraction literature can Springer Berlin Heidelberg, Berlin, Heidelberg, be investigated such as entity linking, aspect extraction, 2007, pp. 291–324. and concept mining, with more exploration to be done [6] N. Pappas, A. Popescu-Belis, Sentiment analysis on the techniques already presented (i.e. experimenting of user comments for one-class collaborative fil- with other approaches for Topic Modeling, Named Entity tering over ted talks, in: 36th international ACM Extraction and Keyword Extraction). What’s more, as SIGIR conference on Research and development in shown experimentally, the way these automatic annota- information retrieval, 2013, pp. 773–776. tions are processed and filtered (thus changing the struc- [7] A. Merchant, N. Singh, Hybrid trust-aware model ture of the generated KG), the results can vary, which for personalized top-n recommendation, in: Fourth calls for further study of how to balance the quantity of ACM IKDD Conferences on Data Sciences, Associ- automatic annotations and the cutback on the necessary ation for Computing Machinery, 2017. noise that comes with it. Another direction of work is to [8] N. Pappas, A. Popescu-Belis, Combining content further explore models that go beyond simple graph em- with user preferences for non-fiction multimedia beddings. We should also consider combining the results recommendation: a study on ted lectures, Multime- of such annotations with the original textual context, as dia Tools and Applications 74 (2013) 1175–1197. our early experiments suggest that combining both the [9] R. Sun, X. Cao, Y. Zhao, J. Wan, K. Zhou, F. Zhang, low-level features (text embeddings) and high level ones Z. Wang, K. Zheng, Multi-Modal Knowledge (graph embeddings) improve further upon the perfor- Graphs for Recommender Systems, Association for mance. Furthermore, as these extracted annotations live Computing Machinery, New York, NY, USA, 2020, on a KG, multiple methods in the direction of Explainable p. 1405–1414. URL: https://doi.org/10.1145/3340531. Recommendations can be explored in tandem. 3411947. Finally, we would like to test this approach on other [10] M. de Gemmis, P. Lops, C. Musto, F. Narducci, G. Se- datasets to see if it can be as successful on other content- meraro, Semantics-Aware Content-Based Recom- mender Systems, Springer US, Boston, MA, 2015, ter of the Association for Computational Linguis- pp. 119–159. tics: Human Language Technologies, Volume 1 [11] G. A. Miller, Wordnet: A lexical database for (Long and Short Papers), Association for Com- english, Commun. ACM 38 (1995) 39–41. URL: putational Linguistics, Minneapolis, Minnesota, https://doi.org/10.1145/219717.219748. 2019, pp. 4171–4186. URL: https://aclanthology.org/ [12] Q. Guo, F. Zhuang, C. Qin, H. Zhu, X. Xie, H. Xiong, N19-1423. Q. He, A survey on knowledge graph-based recom- [21] M. Honnibal, I. Montani, S. Van Landeghem, mender systems, 2020. URL: https://arxiv.org/abs/ A. Boyd, spaCy: Industrial-strength Natural Lan- 2003.00911. guage Processing in Python, 2020. URL: https://doi. [13] Y. Cao, X. Wang, X. He, Z. Hu, C. Tat-seng, Unifying org/10.5281/zenodo.1212303. knowledge graph learning and recommendation: [22] M. Grootendorst, Keybert: Minimal keyword ex- Towards a better understanding of user preference, traction with bert., 2020. URL: https://doi.org/10. in: WWW, 2019. URL: https://arxiv.org/abs/1906. 5281/zenodo.4461265. 04239. [23] S. Rendle, Factorization machines, in: IEEE In- [14] H. Cai, V. Zheng, K. Chang, A comprehensive sur- ternational Conference on Data Mining, 2010, pp. vey of graph embedding: Problems, techniques, and 995–1000. applications, IEEE Transactions on Knowledge and [24] S. Y. Yu, S. Rokka Chhetri, A. Canedo, P. Goyal, Data Engineering 30 (2018) 1616–1637. M. A. A. Faruque, Pykg2vec: A python library for [15] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet knowledge graph embedding, 2019. allocation 3 (2003) 993–1022. [25] G. Ji, S. He, L. Xu, K. Liu, J. Zhao, Knowledge graph [16] F. Bianchi, S. Terragni, D. Hovy, Pre-training is a embedding via dynamic mapping matrix, in: ACL, hot topic: Contextualized document embeddings 2015. improve topic coherence, in: Proceedings of the 59th Annual Meeting of the Association for Com- putational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Association for Computa- tional Linguistics, Online, 2021, pp. 759–766. URL: https://aclanthology.org/2021.acl-short.96. [17] T. Tian, Z. F. Fang, Attention-based autoen- coder topic model for short texts, Procedia Computer Science 151 (2019) 1134–1139. URL: https://www.sciencedirect.com/science/ article/pii/S1877050919306283. doi:h t t p s : //doi.org/10.1016/j.procs.2019.04.161, the 10th International Conference on Ambient Sys- tems, Networks and Technologies (ANT 2019) / The 2nd International Conference on Emerging Data and Industry 4.0 (EDI40 2019) / Affiliated Workshops. [18] I. Harrando, P. Lisena, R. Troncy, Apples to apples: A systematic evaluation of topic models, in: RANLP, volume 260, 2021, pp. 488–498. [19] I. Yamada, A. Asai, H. Shindo, H. Takeda, Y. Mat- sumoto, LUKE: Deep contextualized entity rep- resentations with entity-aware self-attention, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 6442–6454. URL: https://aclanthology.org/ 2020.emnlp-main.523. [20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chap-