=Paper=
{{Paper
|id=Vol-2941/paper4
|storemode=property
|title=Knowledge Graph Embeddings for News Article Tag Recommendation
|pdfUrl=https://ceur-ws.org/Vol-2941/paper4.pdf
|volume=Vol-2941
|authors=Nora Engleitner,Werner Kreiner,Nicole Schwarz,Theodorich Kopetzky,Lisa Ehrlinger
|dblpUrl=https://dblp.org/rec/conf/i-semantics/EngleitnerKSKE21
}}
==Knowledge Graph Embeddings for News Article Tag Recommendation==
<pdf width="1500px">https://ceur-ws.org/Vol-2941/paper4.pdf</pdf>
<pre>
Knowledge Graph Embeddings for News Article
          Tag Recommendation∗†

Nora Engleitner1 , Werner Kreiner2 , Nicole Schwarz2 , Theodorich Kopetzky2 ,
                           and Lisa Ehrlinger2,3
                               1
                               Newsadoo GmbH, Austria
                                 nora@newsadoo.com
               2
                 Software Competence Center Hagenberg GmbH, Austria
                            firstname.lastname@scch.at
                      3
                        Johannes Kepler University Linz, Austria
                              lisa.ehrlinger@jku.at


         Abstract. Newsadoo is a media startup that provides news articles from
         different sources on a single platform. Users can create individual time-
         lines, where they follow the latest development of a specific topic. To
         support the topic creation process, we developed an algorithm that au-
         tomatically suggests related tags to a set of given reference tags. In this
         paper, we first introduce the Newsadoo tag recommendation system,
         which consists of three components: (1) item-based similarity, (2) knowl-
         edge graph similarity, and (3) actuality. We describe the knowledge graph
         component in more detail and analyze the suitability of different knowl-
         edge graphs and embedding techniques to enhance the quality of the
         overall Newsadoo tag recommendation. The paper concludes with a list
         of lessons learned and interesting future work.

         Keywords: Knowledge Graph Embeddings · Tag Recommendation.


1       Introduction

Newsadoo4 is a European media startup that provides articles from various
regional, national, and international newspapers as well as magazines on a single
platform. The aim is to keep users broadly and well informed while offering
a certain degree of personalization to facilitate news consumption at the same
time. In particular, users can select sources they trust and prefer to read, thereby
influencing the news presented in their personalized timeline. Newsadoo further
offers users the possibility to create individual timelines (so-called “topics”) for
their areas of interest, thereby staying up-to-date with the latest developments
concerning a specific topic. These personal timelines can be generated either with
    ∗
     The research reported in this paper has been funded by BMK, BMDW, and the
Province of Upper Austria in the frame of the COMET Programme managed by FFG
and the FFG General Programme project “TIDE”, GA no. 880693.
   †
     Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
   4
     https://newsadoo.com (July 2021)
2         Engleitner et al.

custom search terms or by selecting tags that are provided within Newsadoo.
Tags represent keywords for an article and are extracted automatically by using a
combination of named entity recognition (for detecting the keywords) and entity
linking with Wikipedia and Wikidata (for obtaining uniform and unique tags).
    To support the user in the topic creation process, we developed an algorithm
that suggests related tags to a set of given reference tags. We obtain these tag
recommendations by analyzing common tag occurrences in Newsadoo articles on
the one hand, and by incorporating information from a public knowledge base on
the other hand. Section 2 details on the tag recommendation system. In Section 3,
we evaluate three existing knowledge graph (KG) embeddings as well as self-
trained embeddings to increase the quality of automated tag recommendation.


2      The Newsadoo Tag Recommendation


                          Fig. 1. Newsadoo tag recommendation


    Fig. 1 depicts a schematic representation of the Newsadoo tag recommenda-
tion system. As input, a set of reference tags (e.g., “COVID-19”, “AstraZeneca”,
“European Union”) is provided and as output, we obtain a ranked list of tags,
which are related to the combination of the input tag set. The recommender itself
is based on an ensemble algorithm, which uses the following three components:

    – The item-based similarity (IBS) component evaluates which tags appear
      most often together with the reference tags in Newsadoo articles.
    – The knowledge graph similarity (KGS) employs KG embeddings (containing
      the Newsadoo tags) to determine the most similar entities for the reference
      tags by computing the cosine similarity between the reference tags and tags
      in the KG. The development of the KGS is discussed in detail in Section 3.
    – The actuality component increases the rating of tags, which appear more
      frequently in recent articles. Thus, the recommendation can be influenced
      by recent events, which is an important factor for a news platform.

   The final tag recommendation result is obtained by merging the related tags
provided by the IBS and KGS components, computing a combined similarity
score for these tags and sorting the result accordingly.
                                   KG Embeddings for Tag Recommendation           3

3       A Comparison of KG Embedding Techniques

To select the most suitable KG embedding technique for the tag recommenda-
tion, we evaluated three existing KG embeddings: KGvec2go [5], Wembedder [4],
and pre-trained embeddings from PyTorch–BigGraph [2]. Further, we trained
our own embeddings using pyRdf2Vec [6] (based on Wikidata and DBpedia)
and Wikipedia2Vec [8] (based on the German and English Wikipedia).


Pre-trained embeddings. We found that the existing embeddings from KGvec2go
and Wembedder were not suitable for our application since the results were out-
dated or very unrelated to the input tags. The pre-trained embeddings from
PyTorch–BigGraph performed generally well with the exception for location
tags, where the results were often not relevant enough. Therefore, we decided
to try another method and compute a self-trained KG embedding, which allows
use-case-specific optimization, for our application.


Self-trained embeddings. For building our own embeddings, we experimented
with Wikidata5 , DBpedia6 , and DBpedia Live7 . All three KGs performed well
for a small amount of items, but were not suited for practical application in
Newsadoo. As there are tools to build dumps for Wikidata, and since DBpedia
and DBpedia Live are language-specific, we focused on the language-independent
Wikidata as Newsadoo offers news articles in different languages. With Wikidata
the major challenge was to identify a suitable approach for creating embeddings
for the vast amount of entities provided in this KG.
    Available dumps could not be processed directly due to memory limitations.
Accessing the online SPARQL endpoint5 during training would have led to an
evaluation time of several weeks. The effort to host an endpoint locally was
considered disproportional high. Therefore, we built our own local Wikidata
dump to train the embeddings locally. This subgraph was obtained by querying
the SPARQL endpoint for each of the 400,000 items and restricting the result to
triples containing a Wikidata item as object. We optimized our walking strategies
and parameters according to the findings from [1] and [7]. The best results for a
runtime of one day was achieved with the Weisfeiler-Lehman strategy, max. 100
walks per item, a walking depth of 4, and a vector size of 100.
    These embeddings yielded generally good results for our application with
a few exceptions: In some cases, the resulting items were too similar to each
other, e.g., for a car manufacturer as reference tag we obtained a list of different
car models from this manufacturer. This might be acceptable or even desirable
for other applications, but in our case we require a certain diversity within the
results. Additionally, we observed examples, where the result contained elements
that would be considered irrelevant when using it for tag recommendation, e.g.,
“Austrian Sign Language” for the reference tag “Austria”.
    5
      https://query.wikidata.org/sparql (July 2021)
    6
      https://dbpedia.org/sparql (July 2021)
    7
      https://live.dbpedia.org/sparql (July 2021)
4      Engleitner et al.

Wikipedia2Vec. Due to the drawbacks mentioned above, we considered a third
approach and created embeddings via Wikipedia2Vec. This model is strictly
speaking not a KG embedding, but rather an embedding of regular vocabulary
and Wikipedia entities into the same vector space via skip-gram based models [3].
More precisely, the Wikipedia2Vec model is trained by jointly optimizing three
different models: one of these models utilizes the Wikipedia link graph and learns
to predict neighboring entities in this graph. The second model is a conventional
skip-gram model applied to the text on a Wikipedia page. The third model learns
to predict neighboring words of a target entity and thereby places similar words
and entities near to each other in the vector space.
    Since there are currently English and German tags available in Newsadoo,
we require embeddings for both languages and therefore combine the results
for obtaining language-independent recommendations. Furthermore, we incor-
porate the frequency of an item, which is also computed during the embedding
algorithm, into the similarity score to filter out less relevant entities.

Final decision. In real-world applications, it is generally challenging to deter-
mine the quality of the results, since typically, no annotated data is available.
In addition, for our tag recommendation system, the quality of a result is highly
subjective and dependent on the expectations of the user. Since user feedback
was not available at the development stage, we decided to rely on the domain
knowledge of experts for evaluating the quality of the results for this specific
application. Therefore, we defined a representative set of reference tags and per-
formed a qualitative evaluation of the top 10 recommended results for different
embeddings. Table 1 shows a subset of this evaluation. Note that the set of
feasible results is restricted to the set of available tags in Newsadoo.
    Eventually, we decided to use Wikipedia2Vec as most suitable embedding
for our application due to the following reasons: First, this approach provides
consistently good results without any completely irrelevant tags as opposed to
other models. Second, we found that Wikipedia2Vec yields a higher diversity
than pure KG embeddings as discussed in the car manufacturer example above.


4   Conclusion and Research Outlook

In this paper, we introduced the Newsadoo tag recommendation system, which
provides related tags to a set of given reference tags (with tags being special
keywords extracted from a news article). One crucial component in this system
are KG embeddings, which were investigated and evaluated with respect to tag
recommendation in greater detail.
    We found that Wikipedia2Vec delivered the best results (in terms of suit-
ability and diversity) for our application based on a qualitative evaluation with
domain experts. Preparing data for training was challenging due to (1) per-
formance issues using online SPARQL endpoints within the training process,
(2) memory limitations for available dumps, and (3) maintenance overhead with
a locally hosted endpoint. For future work, we plan to extend the current so-
lution with more research on the tuning of the subgraphs and an approach for
                                   KG Embeddings for Tag Recommendation                 5

Table 1. Comparison of different KG embedding techniques for tag recommendation.

               PBG                 Wikipedia2Vec (en+de)      pyRdf2Vec – Wikidata
Austria
      Maissauer (noble family)             Germany                    Vienna
   State Gallery of Lower Austria        Switzerland                Switzerland
   State Gallery of Lower Austria           Vienna            Municipality (Austria)
            Klafferkessel                Tyrol (state)                 Italy
           Langschwarza                     Styria                   Hungary
EU-protected-area March-Thaya-Auen          France            Austrian Sign Language
Netflix
          Facebook Watch                 Prime Video             Ask the StoryBots
          Amazon Studios                     Hulu              Amazon Web Services
           Red Bull TV                Video on demand           Amazon (company)
         YouTube Premium                 Crunchyroll          Alliance for Open Media
                Hulu                         HBO               Big Mouth (TV series)
            Set-top box                Streaming media                  Hulu
BMW
         Volkswagen Group               Mercedes-Benz              BMW 6 Series
           Daimler-Benz                      Audi                  BMW 1 Series
          BMW Motorrad                     Porsche                   BMW Z
           Chrysler LHS                BMW Motorrad                 BMW GS
              Cadillac                Volkswagen Group              BMW 320
        Mercedes-Benz Cars                  Volvo                   BMW X1


evaluating the quality of the tag recommendation in greater detail, e.g., with an
information-retrieval-style relevancy evaluation. We also plan to investigate the
suitability of even more recent approaches, e.g., graph neural networks.


References
1. Iana, A., Paulheim, H.: More is not always better: The negative impact of a-box
   materialization on rdf2vec knowledge graph embeddings. arXiv:2009.00318 (2020)
2. Lerer, A., Wu, L., Shen, J., Lacroix, T., Wehrstedt, L., Bose, A., Peysakhovich, A.:
   Pytorch-biggraph: A large-scale graph embedding system. arXiv:1903.12287 (2019)
3. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word represen-
   tations in vector space. arXiv:1301.3781 (2013)
4. Nielsen, F.: Wembedder: Wikidata entity embedding web service. arXiv:1710.04099
   (2017)
5. Portisch, J., Hladik, M., Paulheim, H.: KGvec2go – knowledge graph embeddings as
   a service. In: Proceedings of the 12th Language Resources and Evaluation Confer-
   ence. pp. 5641–5647. European Language Resources Association, Marseille (2020)
6. Vandewiele, G., Steenwinckel, B., Agozzino, T., Weyns, M., Bonte, P., Ongenae, F.,
   Turck, F.D.: pyRDF2Vec: Python implementation and extension of rdf2vec (2020),
   https://github.com/IBCNServices/pyRDF2Vec (July 2021)
7. Vandewiele, G., Steenwinckel, B., Bonte, P., Weyns, M., Paulheim, H., Ristoski,
   P., Turck, F.D., Ongenae, F.: Walk extraction strategies for node embeddings with
   rdf2vec in knowledge graphs. arXiv:2009.04404 (2020)
8. Yamada, I., Asai, A., Sakuma, J., Shindo, H., Takeda, H., Takefuji, Y., Matsumoto,
   Y.: Wikipedia2vec: An efficient toolkit for learning and visualizing the embeddings
   of words and entities from wikipedia. In: Proceedings of the 2020 Conference on
   Empirical Methods in Natural Language Processing: System Demonstrations. pp.
   23–30. Association for Computational Linguistics (2020)

</pre>