<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>EL-NEL: Entity Linking for Greek News Articles</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Katerina Papantoniou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasilis Efthymiou</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgos Flouris</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Crete</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computer Science</institution>
          ,
          <addr-line>FORTH</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Readers, as well as journalists, are overwhelmed with the information available in online news articles, making it very di cult to verify and validate their content. An important tool to support readers in this task is that of named entity linking (NEL), i.e., semantically annotating entities mentioned in text with entities described in knowledge bases (KBs). In this poster, we introduce EL-NEL, the NEL component of DebateLab, a project with the more general objective of representing, mining and reasoning with online arguments. The models trained in EL-NEL are o ered as open source.</p>
      </abstract>
      <kwd-group>
        <kwd>Entity Linking</kwd>
        <kwd>Entity Disambiguation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The current news ecosystem is characterised by a rapid spread of news and a vast
increase in the quantity of information available. This abundance of information
makes it di cult for readers to identify good-quality journalistic resources, which
are set apart by the use of credible and justi ed information and opinions, in the
form of well-formulated arguments based on facts. The problem is aggravated by
the fact that the provided information and arguments are often con icting, and it
is very di cult for a reader to evaluate their credibility without access to the
relevant factual resources. This di culty often leads to unwanted behaviours, such
as the placement of blind and unjusti ed trust to speci c opinions or sources,
the generation of echo chambers, con rmation bias, and increased radicalisation
and polarisation.</p>
      <p>The DebateLab project1 is conducting research towards representing, mining
and reasoning with online arguments. The goal of DebateLab is to o er a suite of
tools and services that will assist both the work of the professional journalist in
accomplishing everyday tasks (e.g., writing, archiving, retrieving articles), as well
as the activity of the ordinary Web user (reader) who wishes to be well-informed
about topics or entities of interest (e.g., persons, locations, events).</p>
      <p>An essential component of the envisioned DebateLab tools is the task of
identifying sub-strings (aka entity mentions ) that refer to real-world entities within
an unstructured/textual argument, and mapping them to uniquely identi ed
entity descriptions provided in target Knowledge Bases (KBs), a task known
as named entity linking (NEL). NEL would allow the association of arguments
with factual resources, thereby facilitating the validation of an argument, and
the evaluation of its strength, veracity and applicability.</p>
      <p>
        This work describes EL-NEL, the NEL component of DebateLab. Although
NEL is quite popular in the literature, and various pre-trained models exist,
the vast majority of those works are only applicable to the English language.
Indeed, many multilingual or language-agnostic approaches are usually based
on resources that are of poor quality [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and biased towards the English
language [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. EL-NEL performs NEL for the Greek language, the language
considered by the DebateLab project.
      </p>
      <p>
        More speci cally, this poster describes a pipeline for non-English NEL, which
employs alternative state-of-the-art tools (e.g., BERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], wikipedia2vec [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
fastText [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) for its components. The generated resources are made publicly
available as open source2. The goal of this work is not to merely showcase the
preliminary results for Greek NEL, but rather to present the process followed,
and, hopefully, help researchers studying NEL for languages other than English.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <sec id="sec-2-1">
        <title>2 https://gitlab.isl.ics.forth.gr/papanton/debatelab-nel</title>
        <p>
          Named Entity Recognition (NER). In the case of arguments only (and
not tags), we perform NER to extract entity mentions. We evaluated o
-theshelf NER tools (e.g., SpaCy3, polyglot4), but we also ne-tuned a Greek BERT
model [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. For the ne-tuning, we used a manually annotated Greek corpus for
NER (elNER [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]) consisting of 21,153 sentences from the news domain. The
training was performed using the NERDA framework5.
        </p>
        <p>Our preliminary results, presented in Table 1, indicate that the ne-tuned
BERT model (with a micro-averaged F1 score 91%) yields a negligible
improvement over a SpaCy model (90% micro-averaged F1) that we trained on the same
Greek corpus (elNER), and a signi cant improvement over the pre-trained Greek
SpaCy model (51% micro-averaged F1).</p>
        <p>
          Candidate Generation. For each entity mention, we generate a list of
possible entities from a multitude of wiki-based KBs (Wikidata, DBpedia, Wikipedia,
YAGO). This component is mostly relying on existing lookup services o ered by
those KBs. For each entity mention, we perform several lookup queries to those
services and rank the generated candidates based on the employed service and
the pre-processing applied to each query (e.g., lemmatizing using TreeTagger [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
and Stanza [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], stemming using a snowball stemmer [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], and expanding
abbreviated rst names using a list of common Greek names).
        </p>
        <p>Wikidata. We look up the entity mentions in Greek using the wbsearchentities
service and retrieve the top-k results per query (k empirically set to 5). If we get
no lookup results, we try again rst with the lemmatized and then the stemmed
version of the original query (i.e., entity mention). The order of executing those
lookup queries is maintained in the output rankings of the candidate entities.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3 https://spacy.io/ 4 https://github.com/aboSamoor/polyglot 5 https://github.com/ebanalyse/NERDA</title>
        <p>Wikipedia and DBpedia. We extract o ine alternative names and redirect
pages from the Greek Wikipedia, which we add to the list of candidate entities
for each entity mention. In addition, the search mechanism of Wikipedia was
also employed and the high ranked results were added to the candidate set. For
DBpedia candidates, we simply use the Wikipedia su xes.</p>
        <p>
          Wiki er. We get additional Wikipedia candidates by calling the Wiki er [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
lookup service for each entity mention, and retrieving the top candidate. This
tool enables multilingual semantic annotation of text with links to Wikipedia
articles through a PageRank-based approach for disambiguation.
        </p>
        <p>Disambiguation. We distinguish between two cases for entity
disambiguation: (i) NER has detected a single entity mention, and we return the results
obtained by a single ranked list of candidates, and (ii) more than one entity
mentions are detected, so we consider the cohesion of the candidates, returning
those whose entity embeddings are more similar to each other.</p>
        <p>
          We examine three di erent types of entity embeddings, two context-independent
(fastText [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], wikipedia2vec [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]) and one context-dependent (BERT [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]) based
on transformers. For each embedding method, we compute the cosine similarities
of the candidates in a k-partite graph, where k is the number of entity mentions,
and candidates for the same entity mention belong to the same partition.
        </p>
        <p>In our preliminary experimental results, we observe that BERT performs
best for the case of single entity mentions, since it is the only context-dependent
method, while fastText performs better when more than one entity mentions
appear in an argument, since in that case we can also consider the other entity
mentions as context.</p>
        <p>
          Extending this approach to other languages. The proposed pipeline
can be extended to other languages too, as long as a medium-resource coverage
(i.e., NER corpora and/or pre-trained NER models, POS taggers, a relevant
sizeable Wikipedia corpus) is available in the target language. The adoption
of the approach in low-resource languages is also applicable and can bene t
from cross-lingual transfer learning approaches [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] or unsupervised learning
approaches. However, in all these cases an extra provision must be taken for the
idiosyncrasies of each language such as morphology, syntax etc. Finally, we note
that, unlike other target KBs, Wikidata uses the same entity identi ers across
all available languages.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusions and Future work</title>
      <p>
        In this poster, we introduced EL-NEL, a Greek named entity linking
component which is employed in the DebateLab project. We have brie y described
the modular architecture and alternative options for each component of
ELNEL, as well as some preliminary ndings. In our ongoing and future work, we
plan to explore KGs beyond the Wiki-based ones. We also plan to incorporate
graph embeddings (e.g., ED-GNN [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]) for better capturing the context of
entity mentions, and to nd a suitable approximation to the NP-hard problem of
disambiguating multiple entity mentions in the same argument.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgement</title>
      <p>This project has received funding from the Hellenic Foundation for Research and
Innovation (HFRI) and the General Secretariat for Research and Technology
(GSRT), under grant agreement No 4195.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alyafeai</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , AlShaibani,
          <string-name>
            <given-names>M.S.</given-names>
            ,
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.:</surname>
          </string-name>
          <article-title>A survey on transfer learning in natural language processing</article-title>
          . CoRR abs/
          <year>2007</year>
          .04239 (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bartziokas</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mavropoulos</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kotropoulos</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Datasets and performance metrics for greek named entity recognition</article-title>
          .
          <source>In: SETN</source>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>Trans. Assoc. Comput. Linguistics</source>
          <volume>5</volume>
          ,
          <issue>135</issue>
          {
          <fpage>146</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Brank</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leban</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grobelnik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Annotating documents with relevant wikipedia concepts</article-title>
          .
          <source>In: SiKDD</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Conneau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khandelwal</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaudhary</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wenzek</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guzman</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ott</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoyanov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          .
          <source>In: ACL</source>
          . pp.
          <volume>8440</volume>
          {
          <issue>8451</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In: NAACL-HLT</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Koutsikakis</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chalkidis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malakasiotis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Androutsopoulos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>GREEKBERT: the greeks visiting sesame street</article-title>
          .
          <source>In: SETN</source>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ntais</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saroukos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berki</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dalianis</surname>
          </string-name>
          , H.:
          <article-title>Development and enhancement of a stemmer for the greek language</article-title>
          .
          <source>In: PCI</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Bolton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Stanza: A Python natural language processing toolkit for many human languages</article-title>
          .
          <source>In: ACL</source>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Schmid</surname>
          </string-name>
          , H.:
          <article-title>Probabilistic part-of-speech tagging using decision trees</article-title>
          .
          <source>In: ICNMLP</source>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Vretinaris</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lei</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efthymiou</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O</surname>
          </string-name>
          zcan, F.:
          <article-title>Medical entity disambiguation using graph neural networks</article-title>
          .
          <source>In: SIGMOD</source>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Wali</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mahoney</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Middleton</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Babaeianjelodar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Njie</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matthews</surname>
            ,
            <given-names>J.N.:</given-names>
          </string-name>
          <article-title>Is machine learning speaking my language? A critical look at the nlp-pipeline across 8 human languages</article-title>
          .
          <source>CoRR abs/2007</source>
          .05872 (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Yamada</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakuma</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shindo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takeda</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takefuji</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matsumoto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Wikipedia2vec: An e cient toolkit for learning and visualizing the embeddings of words and entities from wikipedia</article-title>
          .
          <source>In: EMNLP</source>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>