<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Knowledge Graph Embeddings for News Article Tag Recommendation∗†</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nora Engleitner</string-name>
          <email>nora@newsadoo.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Werner Kreiner</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicole Schwarz</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Theodorich Kopetzky</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lisa Ehrlinger</string-name>
          <email>lisa.ehrlinger@jku.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Johannes Kepler University Linz</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Newsadoo GmbH</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Software Competence Center Hagenberg GmbH</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Newsadoo is a media startup that provides news articles from di erent sources on a single platform. Users can create individual timelines, where they follow the latest development of a speci c topic. To support the topic creation process, we developed an algorithm that automatically suggests related tags to a set of given reference tags. In this paper, we rst introduce the Newsadoo tag recommendation system, which consists of three components: (1) item-based similarity, (2) knowledge graph similarity, and (3) actuality. We describe the knowledge graph component in more detail and analyze the suitability of di erent knowledge graphs and embedding techniques to enhance the quality of the overall Newsadoo tag recommendation. The paper concludes with a list of lessons learned and interesting future work.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge Graph Embeddings</kwd>
        <kwd>Tag Recommendation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Newsadoo4 is a European media startup that provides articles from various
regional, national, and international newspapers as well as magazines on a single
platform. The aim is to keep users broadly and well informed while o ering
a certain degree of personalization to facilitate news consumption at the same
time. In particular, users can select sources they trust and prefer to read, thereby
in uencing the news presented in their personalized timeline. Newsadoo further
o ers users the possibility to create individual timelines (so-called \topics") for
their areas of interest, thereby staying up-to-date with the latest developments
concerning a speci c topic. These personal timelines can be generated either with
custom search terms or by selecting tags that are provided within Newsadoo.
Tags represent keywords for an article and are extracted automatically by using a
combination of named entity recognition (for detecting the keywords) and entity
linking with Wikipedia and Wikidata (for obtaining uniform and unique tags).</p>
      <p>To support the user in the topic creation process, we developed an algorithm
that suggests related tags to a set of given reference tags. We obtain these tag
recommendations by analyzing common tag occurrences in Newsadoo articles on
the one hand, and by incorporating information from a public knowledge base on
the other hand. Section 2 details on the tag recommendation system. In Section 3,
we evaluate three existing knowledge graph (KG) embeddings as well as
selftrained embeddings to increase the quality of automated tag recommendation.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The Newsadoo Tag Recommendation</title>
      <p>{ The item-based similarity (IBS) component evaluates which tags appear
most often together with the reference tags in Newsadoo articles.
{ The knowledge graph similarity (KGS) employs KG embeddings (containing
the Newsadoo tags) to determine the most similar entities for the reference
tags by computing the cosine similarity between the reference tags and tags
in the KG. The development of the KGS is discussed in detail in Section 3.
{ The actuality component increases the rating of tags, which appear more
frequently in recent articles. Thus, the recommendation can be in uenced
by recent events, which is an important factor for a news platform.</p>
      <p>The nal tag recommendation result is obtained by merging the related tags
provided by the IBS and KGS components, computing a combined similarity
score for these tags and sorting the result accordingly.</p>
      <p>
        A Comparison of KG Embedding Techniques
To select the most suitable KG embedding technique for the tag
recommendation, we evaluated three existing KG embeddings: KGvec2go [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Wembedder [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
and pre-trained embeddings from PyTorch{BigGraph [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Further, we trained
our own embeddings using pyRdf2Vec [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] (based on Wikidata and DBpedia)
and Wikipedia2Vec [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] (based on the German and English Wikipedia).
Pre-trained embeddings. We found that the existing embeddings from KGvec2go
and Wembedder were not suitable for our application since the results were
outdated or very unrelated to the input tags. The pre-trained embeddings from
PyTorch{BigGraph performed generally well with the exception for location
tags, where the results were often not relevant enough. Therefore, we decided
to try another method and compute a self-trained KG embedding, which allows
use-case-speci c optimization, for our application.
      </p>
      <p>Self-trained embeddings. For building our own embeddings, we experimented
with Wikidata5, DBpedia6, and DBpedia Live7. All three KGs performed well
for a small amount of items, but were not suited for practical application in
Newsadoo. As there are tools to build dumps for Wikidata, and since DBpedia
and DBpedia Live are language-speci c, we focused on the language-independent
Wikidata as Newsadoo o ers news articles in di erent languages. With Wikidata
the major challenge was to identify a suitable approach for creating embeddings
for the vast amount of entities provided in this KG.</p>
      <p>
        Available dumps could not be processed directly due to memory limitations.
Accessing the online SPARQL endpoint5 during training would have led to an
evaluation time of several weeks. The e ort to host an endpoint locally was
considered disproportional high. Therefore, we built our own local Wikidata
dump to train the embeddings locally. This subgraph was obtained by querying
the SPARQL endpoint for each of the 400,000 items and restricting the result to
triples containing a Wikidata item as object. We optimized our walking strategies
and parameters according to the ndings from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The best results for a
runtime of one day was achieved with the Weisfeiler-Lehman strategy, max. 100
walks per item, a walking depth of 4, and a vector size of 100.
      </p>
      <p>These embeddings yielded generally good results for our application with
a few exceptions: In some cases, the resulting items were too similar to each
other, e.g., for a car manufacturer as reference tag we obtained a list of di erent
car models from this manufacturer. This might be acceptable or even desirable
for other applications, but in our case we require a certain diversity within the
results. Additionally, we observed examples, where the result contained elements
that would be considered irrelevant when using it for tag recommendation, e.g.,
\Austrian Sign Language" for the reference tag \Austria".</p>
      <p>5https://query.wikidata.org/sparql (July 2021)
6https://dbpedia.org/sparql (July 2021)
7https://live.dbpedia.org/sparql (July 2021)</p>
      <p>
        Wikipedia2Vec. Due to the drawbacks mentioned above, we considered a third
approach and created embeddings via Wikipedia2Vec. This model is strictly
speaking not a KG embedding, but rather an embedding of regular vocabulary
and Wikipedia entities into the same vector space via skip-gram based models [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
More precisely, the Wikipedia2Vec model is trained by jointly optimizing three
di erent models: one of these models utilizes the Wikipedia link graph and learns
to predict neighboring entities in this graph. The second model is a conventional
skip-gram model applied to the text on a Wikipedia page. The third model learns
to predict neighboring words of a target entity and thereby places similar words
and entities near to each other in the vector space.
      </p>
      <p>Since there are currently English and German tags available in Newsadoo,
we require embeddings for both languages and therefore combine the results
for obtaining language-independent recommendations. Furthermore, we
incorporate the frequency of an item, which is also computed during the embedding
algorithm, into the similarity score to lter out less relevant entities.
Final decision. In real-world applications, it is generally challenging to
determine the quality of the results, since typically, no annotated data is available.
In addition, for our tag recommendation system, the quality of a result is highly
subjective and dependent on the expectations of the user. Since user feedback
was not available at the development stage, we decided to rely on the domain
knowledge of experts for evaluating the quality of the results for this speci c
application. Therefore, we de ned a representative set of reference tags and
performed a qualitative evaluation of the top 10 recommended results for di erent
embeddings. Table 1 shows a subset of this evaluation. Note that the set of
feasible results is restricted to the set of available tags in Newsadoo.</p>
      <p>Eventually, we decided to use Wikipedia2Vec as most suitable embedding
for our application due to the following reasons: First, this approach provides
consistently good results without any completely irrelevant tags as opposed to
other models. Second, we found that Wikipedia2Vec yields a higher diversity
than pure KG embeddings as discussed in the car manufacturer example above.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and Research Outlook</title>
      <p>In this paper, we introduced the Newsadoo tag recommendation system, which
provides related tags to a set of given reference tags (with tags being special
keywords extracted from a news article). One crucial component in this system
are KG embeddings, which were investigated and evaluated with respect to tag
recommendation in greater detail.</p>
      <p>We found that Wikipedia2Vec delivered the best results (in terms of
suitability and diversity) for our application based on a qualitative evaluation with
domain experts. Preparing data for training was challenging due to (1)
performance issues using online SPARQL endpoints within the training process,
(2) memory limitations for available dumps, and (3) maintenance overhead with
a locally hosted endpoint. For future work, we plan to extend the current
solution with more research on the tuning of the subgraphs and an approach for
BMW</p>
      <p>PBG
Austria</p>
      <p>Maissauer (noble family)
State Gallery of Lower Austria
State Gallery of Lower Austria</p>
      <p>Kla erkessel</p>
      <p>Langschwarza
EU-protected-area March-Thaya-Auen
Net ix</p>
      <p>Facebook Watch
Amazon Studios</p>
      <p>Red Bull TV
YouTube Premium</p>
      <p>Hulu</p>
      <p>Set-top box
Volkswagen Group</p>
      <p>Daimler-Benz
BMW Motorrad
Chrysler LHS</p>
      <p>Cadillac
Mercedes-Benz Cars</p>
      <p>Germany
Switzerland</p>
      <p>Vienna
Tyrol (state)</p>
      <p>Styria</p>
      <p>France
Prime Video</p>
      <p>Hulu
Video on demand</p>
      <p>Crunchyroll</p>
      <p>HBO
Streaming media
Mercedes-Benz</p>
      <p>Audi</p>
      <p>Porsche
BMW Motorrad
Volkswagen Group</p>
      <p>Volvo</p>
      <p>Vienna</p>
      <p>Switzerland
Municipality (Austria)</p>
      <p>Italy</p>
      <p>Hungary
Austrian Sign Language</p>
      <p>Ask the StoryBots
Amazon Web Services</p>
      <p>Amazon (company)
Alliance for Open Media
Big Mouth (TV series)</p>
      <p>Hulu
BMW 6 Series
BMW 1 Series</p>
      <p>BMW Z
BMW GS
BMW 320
BMW X1</p>
      <p>Wikipedia2Vec (en+de) pyRdf2Vec { Wikidata
evaluating the quality of the tag recommendation in greater detail, e.g., with an
information-retrieval-style relevancy evaluation. We also plan to investigate the
suitability of even more recent approaches, e.g., graph neural networks.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Iana</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>More is not always better: The negative impact of a-box materialization on rdf2vec knowledge graph embeddings</article-title>
          . arXiv:
          <year>2009</year>
          .
          <volume>00318</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Lerer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wehrstedt</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bose</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peysakhovich</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Pytorch-biggraph: A large-scale graph embedding system</article-title>
          . arXiv:
          <year>1903</year>
          .
          <volume>12287</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient estimation of word representations in vector space</article-title>
          .
          <source>arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Nielsen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Wembedder: Wikidata entity embedding web service</article-title>
          .
          <source>arXiv:1710.04099</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Portisch</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hladik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>KGvec2go { knowledge graph embeddings as a service</article-title>
          .
          <source>In: Proceedings of the 12th Language Resources and Evaluation Conference</source>
          . pp.
          <volume>5641</volume>
          {
          <fpage>5647</fpage>
          .
          <string-name>
            <surname>European Language Resources Association</surname>
          </string-name>
          ,
          <string-name>
            <surname>Marseille</surname>
          </string-name>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Vandewiele</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steenwinckel</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agozzino</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weyns</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonte</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ongenae</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turck</surname>
            ,
            <given-names>F.D.</given-names>
          </string-name>
          : pyRDF2Vec:
          <article-title>Python implementation and extension of rdf2vec (</article-title>
          <year>2020</year>
          ), https://github.com/IBCNServices/pyRDF2Vec (July
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Vandewiele</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steenwinckel</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonte</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weyns</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ristoski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turck</surname>
            ,
            <given-names>F.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ongenae</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Walk extraction strategies for node embeddings with rdf2vec in knowledge graphs</article-title>
          . arXiv:
          <year>2009</year>
          .
          <volume>04404</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Yamada</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakuma</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shindo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takeda</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takefuji</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matsumoto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Wikipedia2vec: An e cient toolkit for learning and visualizing the embeddings of words and entities from wikipedia</article-title>
          .
          <source>In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          . pp.
          <volume>23</volume>
          {
          <fpage>30</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>