<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Building and Evaluating Universal Named-Entity Recognition English corpus</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Humanities and Social Sciences, University of Zagreb</institution>
          ,
          <addr-line>Zagreb 10000</addr-line>
          ,
          <country country="HR">Croatia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This article presents the application of the Universal Named Entity framework to generate automatically annotated corpora. By using a work ow that extracts Wikipedia data and meta-data and DBpedia information, we generated an English dataset which is described and evaluated. Furthermore, we conducted a set of experiments to improve the annotations in terms of precision, recall, and F1-measure. The nal dataset is available and the established work ow can be applied to any language with existing Wikipedia and DBpedia. As part of future research, we intend to continue improving the annotation process and extend it to other languages.</p>
      </abstract>
      <kwd-group>
        <kwd>named entity recognition</kwd>
        <kwd>data extraction</kwd>
        <kwd>multilingual nlp</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Named entity recognition and classi cation (NERC) is an important eld inside
Natural Language Processing (NLP), being a crucial task of information
extraction from texts. It was rst de ned in 1995 in the 6th Message Understanding
Conference (MUC-6)[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and since then has been used in multiple NLP
applications such as events and relations extraction, question answering systems and
entity-oriented search.
      </p>
      <p>
        As shown by Alves et al.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], NERC corpora and tools present an immense
variety in terms of annotation hierarchy and formats. NERC hierarchy
structure is usually locally de ned considering the nal NLP application where it
will be used at. If certain types like "Person", "Location", and "Organization"
are present in almost every NERC system, some corpora are composed of more
complex types of annotation. It is the case, for example, of Portuguese Second
HAREM[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Czech Named Entity Corpus 2.0[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and Romanian Ronec[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Multilingual alternatives also exist, it is the case of spaCy software[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] proposing two
di erent single-level hierarchies composed of either 18 or 4 NERC types following
OntoNotes 5.0[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and Wikipedia[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] respectively.
      </p>
      <p>Unlike Part-of-Speech and Dependency Parsing tagging which have Universal
Dependencies1, there is no universal alternative for NERC in terms of annotation
framework and multilingual repository following the same standards.</p>
      <p>
        Hence, we use the Universal Named Entity (UNER) framework which is
composed of a complex NERC hierarchy inspired by Sekine's work[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and propose a
process which parses data from Wikipedia2, extract named entities through
hyperlinks, aligns them with DBpedia3[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] entity classes and translates them into
UNER types and subtypes. This process can be applied to any language present
in both Wikipedia and DBpedia, therefore generating multilingual NERC
corpora following the same procedure and hierarchy.
      </p>
      <p>Thus, UNER is useful for multilingual NLP tasks which need recognition and
classi cation of named entities going beyond classical NERC hierarchy involving
only a few types. UNER data can be used in its totality or can be easily adapted
to speci c needs. For example, considering the classic \Location" type usually
present in NERC corpora, UNER can be used to get more detailed information:
if an entity is more speci cally a country, mountain, island, etc.</p>
      <p>This paper presents the UNER hierarchy and its work ow for data
extraction and annotation. It details the application of the proposed process on English
language with qualitative and quantitative evaluation of the automatically
annotated data. It also presents the evaluation of di erent alternatives implemented
for the improvement of the generated dataset.</p>
      <p>This article is organized as follows: In Section 2, we present the
state-ofthe-art concerning NERC automatic data generation work ows; in Section 3, we
describe UNER framework and hierarchy, and in Section 4 the details of the data
extraction and annotation work ow and evaluation of the generated dataset. In
Section 5, we report the experiments that were conducted for improving the
dataset quality in terms of Precision and Recall. Section 6 is dedicated to the
discussion of the results, and in Section 7, we present our conclusions and possible
future directions for research.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        UNER framework was rst introduced by us in a previous article [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], where we
de ned its hierarchy. In this article, the framework was revised and a work ow for
automatic text annotation was developed and applied to generate an annotated
corpus in English with the respective evaluation.
      </p>
      <p>
        Deep learning has been employed in NERC systems in recent years improving
stat-of-the-art performance, and, therefore, increasing the need of quality
annotated datas-sets as stated by Yadav and Bethard[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and Li et al.[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. These
authors have provided a large overview of existing techniques for extracting and
classifying named entities using machine and deep learning methods.
      </p>
      <sec id="sec-2-1">
        <title>1 https://universaldependencies.org</title>
      </sec>
      <sec id="sec-2-2">
        <title>2 https://www.wikipedia.org/</title>
      </sec>
      <sec id="sec-2-3">
        <title>3 https://www.dbpedia.org/</title>
        <p>
          The problem of composing new NERC datasets has been the object of the
study proposed by Lawson et al.[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Manual annotation of large corpora is
usually very costly, the authors propose, then, the usage of Amazon Mechanical
Turk as a low-cost alternative. However, this method still depends on a speci c
budget for the task and can be very time-consuming. It also depends on the
availability of annotators for each speci c language which may be problematic
if the aim is to generate large multilingual corpora.
        </p>
        <p>
          A generic method for extracting entities from Wikipedia articles was
proposed in Bekavac and Tadic[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and includes Multi-Word Extraction of named
entities using local regular grammars. Therefore, for each targeted language, a
new set of rules must be de ned. Another automatic multilingual solutions has
been proposed by Ni et al.[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and Kim et al.[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] using either annotation
projection on comparable corpora or Wikipedia metadata on parallel datasets. Both
methods, however, still require manual annotations that are language-dependent
and cannot be applied universally. Furthermore, Weber &amp; Vieira[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] use a
similar process to the one presented in this article for annotating Wikipedia texts
using DBpedia information. However, their focus is on Portuguese only with a
very simple NERC hierarchy.
        </p>
        <p>
          The idea of using Wikipedia metadata to annotate multilingual corpora has
also been proposed by Nothman et al.[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] for English, German, Spanish, Dutch,
and Russian. Despite the multilingual approach, it also requires manually
annotated text.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>UNER Dataframe Description</title>
      <p>
        As mentioned in the previous section, the UNER hierarchy was introduced
by Alves et al.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It was built upon the 4-level NERC hierarchy proposed by
Sekine[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], which was chosen as it presents a high conceptual hierarchy. The
changes comparing both structures have been detailed by the authors. The
proposed UNER hierarchy is also composed of 4 levels. Level 0 being the root node
from which derives all the other levels. Level 1 consists of three main classes:
Name, Time Expression and Numerical Expression. Level 2 is composed of 29
named-entity categories which can be detailed in a third level with 95 types.
Additionally, level 4 contains 129 subtypes (Alves et al.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]).
      </p>
      <p>This rst version of the UNER hierarchy, therefore, encompasses 215 labels
which can contain up to four levels of granularity depending on how detailed
is the named-entity type. UNER labels are composed of tags from each level
separated by a hyphen "-". As level 0 is the root and common for all entities, it
is not present in the label. For example:
{ UNER label Name-Event-Natural Phenomenon-Earthquake is composed of
level 1 Name, level 2 Event, level 3 Natural Phenomenon and level 4
Earthquake.</p>
      <p>The idea of using both Wikipedia data and metadata associated with
DBpedia information to generate UNER annotated datasets compelled us to revise
the rst proposed UNER hierarchy. The main reason is that the automatic
annotation process is based on a list of equivalences between UNER labels and
DBpedia classes.</p>
      <p>By generating the list of equivalences, it was noticeable that not all UNER
labels would have a DBpedia equivalent class. This is case for majority of Time
and Numerical expressions. These cases will have to be dealt with by other
automatic methods in future work.</p>
      <p>Therefore, for this article, we consider version 2 of UNER, presented in the
GitHub webpage of the project4. It is composed of 124 labels and its hierarchy
is detailed in Table 1.</p>
      <p>
        In the annotation process, we have decided to use the IOB format[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] as it
is widely used by many NERC systems as showed by Alves et al.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Therefore,
each annotated entity token also receives in the beginning of the UNER label
the letter \B" if the token is the rst of the entity or \I" if inside. Non-entity
tokens receive only the tag \O".
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Data Extraction and Annotation</title>
      <p>The work ow we have developed allows the extraction of texts and metadata
from Wikipedia (for any language present in this database), followed by the
identi cation of the DBpedia classes via the hyperlinks associated with certain
tokens (entities) and the translation to UNER types and subtypes (these last
two steps being language independent).</p>
      <p>
        Once the main process of data extraction and annotation is over, the work ow
proposes post-processing steps to improve the tokenization, implement the IOB
format [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and gather statistical information concerning the generated corpus.
      </p>
      <p>The whole work ow is presented in detail in the project GitHub webpage
together with all scripts that have been used, and that can be applied to any
other Wikipedia language.
4.1</p>
      <sec id="sec-4-1">
        <title>Process description</title>
        <p>1. UNER/DBpedia Mapping: This is a mapper that connects each
pertinent DBpedia class with a single UNER tag. It was created by the members</p>
        <sec id="sec-4-1-1">
          <title>4 https://github.com/cleopatra-itn/MIDAS/</title>
          <p>of the project, analysing each DBpedia class and associating it to the most
pertinent UNER tag. A single extracted named entity might have more than
one DBpedia class. For example, entity 2015 European Games have the
following DBpedia classes with the respective UNER equivalences:
{ dbo:Event { Name-Event-Historical-Event
{ dbo:SoccerTournament { Name-Event-Occasion-Game
{ dbo:SocietalEvent { Name-Event-Historical-Event
{ dbo:SportsEvent { Name-Event-Occasion-Game
{ owl:Thing { NULL
The value on the left represents a DBpedia class and its UNER equivalent
is on the right side of the class. It maps all the DBpedia classes to UNER
equivalent classes.
2. DBpedia Hierarchy: This mapper assigns priorities to each DBpedia class.</p>
          <p>This is used to select a single DBpedia class from the collection of classes
that are associated with an entity. Following are examples of classes and
their priorities.</p>
          <p>{ dbo:Event { 2
{ dbo:SoccerTournament { 4
{ dbo:SocietalEvent { 2
{ dbo:SportsEvent { 4
{ owl:Thing { 1
For entity 2015 European Games , the DBpedia class
SoccerTournament presides over the other classes as it has a higher priority value. If the
extracted entity has two assigned classes with the same hierarchy value the
rst from the list is chosen as the nal one. All the DBpedia classes were
assigned with a hierarchy value according to DBpedia Ontology5, where classes
are presented in a structural order which allowed us to de ne the hierirchal
levels.
4.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Main process</title>
        <p>The main process is schematized in the gure below and is divided into three
sub-processes.
1. Extraction from Wikipedia dumps: For a given language, we obtain its
latest dump from the Wikimedia website6. Next, we perform text extraction
preserving the hyperlinks in the article using WikiExtractor7. These are
hyperlinks to other Wikipedia pages as well as unique identi ers to those
named-entities. We extract all the unique hyperlinks and sort them
alphabetically. These hyperlinks will be referred to as named-entities henceforth.</p>
        <sec id="sec-4-2-1">
          <title>5 http://mappings.dbpedia.org/server/ontology/classes/</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>6 https://dumps.wikimedia.org/</title>
        </sec>
        <sec id="sec-4-2-3">
          <title>7 https://github.com/attardi/wikiextractor</title>
          <p>2. Wikipedia-DBpedia entity linking: For all the unique named-entities
from the dumps, we query the DBpedia endpoint using a SPARQL query
with SPARQLWrapper8 to identify the various classes associated with the
entity. This step produces, for each named-entity from step 1, a set of
DBpedia classes it belongs to.
3. Wikipedia-DBpedia-UNER back-mapping: For every extracted
namedentity obtained in step 1, we use the set of classes produced in step 2, along
with a UNER/DBpedia mapping schema, to assign UNER classes to each
named-entity. For an entity, all the classes obtained from the DBpedia
response are mapped to a hierarchy value, a highest valued class is resolved
and chosen, and then it is mapped to UNER class. For constructing the
nal annotation dataset, we only select those sentences that have at least
one single named entity. This reduces the sparsity of annotations and thus
reduces the false negatives rate in our test models. This step produces an
initial tagged corpus from the whole Wikipedia dump for a speci c language.</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>8 https://rd ib.dev/sparqlwrapper/</title>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Post-processing steps</title>
        <p>
          The post-processing steps correspond to three di erent scripts that provide:
1. The improvement of the tokenization (using regular expressions) by
isolating punctuation characters that were connected with words. In addition, it
applies the IOB format[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] to the UNER annotations inside the text.
2. The calculation of the following statistic information concerning the
generated corpus: Total number of tokens, Number of Non-entity Tokens (tag
\O"), Number of Entity Tokens (tags \B" or \I"), and Number of Entities
(tag \B"). The script also provides a list of all UNER tags with the number
of occurrences of each tag inside the corpus.
3. Listing the entities inside the corpus (tokens and the corresponding UNER
tag). Each identi ed entity appears once in this list, even if it has multiple
occurrences in the corpus.
        </p>
        <p>The whole process and post-processing steps were applied to English
language, generating the UNER English corpus which is described and evaluated
in the following section. This baseline corpus is the base for the improvement
experiments presented in later sections.
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>UNER English Corpus (Baseline)</title>
        <p>
          General Information The English Wikipedia [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] is composed of 6,188,204
articles. After applying the main process of the proposed work ow, we obtained
annotated text les divided into folders. The size of the English UNER corpus
is presented in the following table.
        </p>
        <p>Statistical information concerning the corpus is obtained by applying the
post-processing steps previously described. Table 3 presents the main statistics
about number of tokens and entities. Inside UNER English Corpus, 8.9% of
tokens are entities.</p>
        <p>As presented in section 3, the UNER hierarchy used for annotating the
English Wikipedia texts is composed of 124 di erent multi-leveled labels with
equivalences to DBpedia classes. However, baseline UNER English corpus contains 99
di erent UNER tags (80%).</p>
        <p>As explained previously, the UNER hierarchy is composed of categories,
types, and subtypes. UNER includes the most common classes used in NERC
(Person, Location, Organization), being more detailed (subtypes):</p>
        <p>{ Person: correspond to UNER Name-Person-Name
{ Location: correspond to all subtypes inside the UNER types Name-Location.
{ Organization: correspond to all subtypes inside the UNER type Name-Organization</p>
        <p>Therefore, it is possible to analyse the generated corpora in terms of these
more generic classes.</p>
        <p>These main classes correspond to 68.2% of NEs in the generated corpus.
Qualitative evaluation The proposed process requires the identi cation of the
DBpedia classes associated with the respective tokens (via hyperlink) and the
translation to UNER using (UNER/DBpedia equivalences).</p>
        <p>An analysis of 943 entities randomly selected from UNER English Corpus
has been performed to evaluate this step of the work ow. For each one, we have
checked the DBpedia associated classes and the nal UNER chosen tag. Table 5
presents the results of this evaluation.</p>
        <p>In the selected sample, 91% of the entities are correctly tagged with UNER
tags. Nevertheless, 6% are associated with the correct UNER type but to a
generic subtype. For example, Bengkulu should be tagged as
Name-LocationGPE-City but received the tag Name-Location-GPE-GPE Other. Errors may
come from mistakes in the DBpedia classes associated with the tokens or due to
the prioritization rules and equivalences de ned between DBpedia and UNER:
{ Buddhism is associated only to the DBpedia class EthnicGroup and,
therefore, is wrongly tagged as Name-Organization-Ethnic Group other while it
should be associated to the UNER tag Name-Product-Doctrine Method-Religion.
{ Brit Awards , due to the prioritization of DBpedia class hierarchy in the
choice of UNER tags, is wrongly tagged as
Name-Organization-CorporationCompany while it should receive the tag Name-Product-Award.</p>
        <p>
          UNER English Golden dataset Beside the statistical information presented
above, a sample from the generated corpus has been selected and corrected using
WebAnno[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] by one annotator. The sample corresponds to one entire le from
the output folder and contains 519 sentences and 105 di erent UNER labels (out
of 124 from the list of UNER-DBpedia equivalences). The annotations were done
by a non-native English speaker who is a member of the project. He followed
objective guidelines, and for some speci c entities, research using Wikipedia
were done. In cases of multi-possible assignments, a nal choice was done by the
annotator so that each entity would have only one label in the golden set.
        </p>
        <p>Table 6 presents the evaluation results of the baseline annotations of the le
used to create the Golden dataset in terms of Precision, Recall, and F1-measure,
considering the mean value of all 105 labels for each metric.</p>
        <p>As explained previously, the annotation of a certain named-entity depends
on the existence of hyperlinks. However, these links are not always associated
with the tokens if the entity is mentioned repeatedly in the article. This may be
one of the main reasons of the low value obtained for recall.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Dataset Improvement</title>
      <p>Evaluation of the baseline annotated le using the Golden UNER English
Corpus shows that the automatic annotation work ow has room for improvement,
especially in terms of reducing the number of false negatives. Strategies for
completing the annotation using dictionaries and knowledge graph were applied to
the English Corpus. The ensemble of experiments and the evaluation is presented
in the subsections below.
5.1</p>
      <sec id="sec-5-1">
        <title>Experiment Design</title>
        <p>Seven di erent experiments were conducted:
1. Global Dictionary: From the whole UNER English Corpus, we have
established a dictionary of entities and the respective UNER label. As the same
entity may appear in the corpus with di erent UNER tags (due to the
associated DBpedia classes), we have selected for each entity the label with the
highest number of occurrences. This dictionary is then used to complete the
annotations of the corpus. Only entities with length longer than 2 characters
were considered and numerical entities were excluded from the dictionary.</p>
        <p>Final size of the global dictionary is of 826,371 entities.
2. Global Dictionary only with multiple token entities: Similar to the previous
experiment but in this case only entities with more than one token were
considered. In total, the global dictionary is composed of 665,081 multi-token
entities.
3. Local Dictionaries: In this setup we processed every Wikipedia dump le as
a single article. Every entity in the article that is linked to UNER is cached
into a local lookup dictionary with its text as the key and UNER class as
the value. For every subsequent occurrence of the text in the given article we
annotated the text with corresponding UNER class. We performed this step
with the speculation that entities are more likely to appear within a single
article than in a completely unrelated article. For example, Barrack Obama
as person is more likely to appear in an article describing him as president
than as a ctional character which appears in ctional content about him.
4. Global OEKG Dictionary: Open Event Knowledge Graph (OEKG)9 is a
multilingual event-centric resource. Its instances have speci c DBpedia classes,
therefore, we intersected all the entries from the global dictionary with
elements from OEKG. For each entity, its associated DBpedia class from OEKG
was then mapped to UNER. The global OEKG dictionary contains 128,813
entries.
5. Global OEKG Dictionary only with multiple tokens entities: Similar to
experiment 4 only in this case only entities with more than one token were
considered (110,226 entities in total).
6. Local Dictionaries followed by Global OEKG Dictionary: Combination of
experiment 3 with completion of annotations using dictionary established
for experiment 4.
7. Local Dictionaries followed by OEKG Dictionary only with multiple tokens
entities: Corpus from experiment 3 is completed using dictionary from
experiment 5.</p>
        <p>In all experiments, dictionaries were ordered from the longest entities to the
shortest ones to guarantee that preferably multi-token entities were annotated
and not mono-token ones.</p>
        <sec id="sec-5-1-1">
          <title>9 http://cleopatra-project.eu/index.php/open-event-knowledge-graph/</title>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Evaluation</title>
        <p>The evaluation was conducted using the Golden Corpus presented previously.
The baseline is the correspondent le with automatic annotations as result of
the work ow described in section 4.</p>
        <p>
          Golden Corpus has 105 di erent UNER labels, however, the baseline
annotated le has only 62. For each possible label, we calculated precision, recall, and
F1-measure. The IOB format[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] was applied, therefore, each UNER label can
start either with "B" or "I", and non-entity tokens were tagged with "O".
        </p>
        <p>From the 62 labels of the baseline, only 45 presented results di erent than 0.
Therefore, the values present in the following table consider only these tags and
represent the mean value of all the tags taken into account. Table 7 presents the
metrics obtained for the baseline and each one of the experiments described in
the previous sub-section.</p>
        <p>Using the global dictionary (experiment 1) provides the highest value of
recall (+3.7 compared to the baseline) but precision is considerably lower (-40.8).
Similar situation when the global dictionary is used only with multi-token entities
(experiment 2). Other experiments do not decrease precision so drastically and
in some cases this metric is even increased. Recall is increased, compared to the
baseline, for all experiments except for 3 and 6, 7. The usage of local dictionaries
was not an e ective solution for improving this evaluation metric.</p>
        <p>Best option, considering F1 measure, is the usage of the dictionary veri ed
with OEKG (experiment 4). Precision is slightly lower than the baseline (-1.8)
while recall and F1 measure are higher (+1.9 and +1.6 respectively). If we
consider only level 3 of UNER hierarchy, the possible tags are: Disease, Event,
Facility, Location, Natural Object, Organization, Person and Product</p>
        <p>The evaluation of each experiment considering only this upper level of the
UNER hierarchy is presented in table 8. The IOB format was also considered,
therefore, UNER labels could be preceded by either "B" or "I" and non-entity
tokens were tagged with "O".</p>
        <p>In this scenario, the highest precision is from the baseline. The Best recall
is obtained when the global dictionary is used (experiment 1) but, as it was
observed before, in this case precision is heavily impacted compared to the
baseline (-51.0). Experiment 4 is the one with the highest F1 measure, same for the
previous evaluation where all UNER levels were considered.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Discussion</title>
      <p>In section 4 we presented the whole process for generating UNER annotated
corpora and the application of this work ow to create a dataset for the English
language. The evaluation of the entity's extraction and translation to UNER
(table 5) showed that 85% of analysed entities were correctly annotated. However,
errors are introduced to the dataset mainly because of the wrong association of
the tokens inside Wikipedia text and DBpedia classes. Furthermore, the selection
rule in our process (choosing the DBpedia class having the highest granularity
inside DBpedia hierarchy) is also a source of mistakes.</p>
      <p>The corpus generated by the proposed work ow was also evaluated using
the manually annotated Golden dataset. It is noticeable that even with the
reduced version of the UNER hierarchy (UNER v2, 124 labels with DBpedia
equivalences) used for this task, the whole dataset was only annotated with
99 di erent labels, from which only 62 presented values of precision and recall
di erent from zero in the baseline sample le when compared to the Golden.
Therefore, further changes in the UNER hierarchy should be implemented, for
example, Time and Numerical Expressions (already reduced in UNER v2) should
be excluded in this step of the annotation and some detailed labels concerning
Events, Organizations and Locations can be joined in more generic categories.</p>
      <p>
        As expected, recall is much inferior to precision in all conducted evaluations.
This is due to the fact that inside Wikipedia not all entities are linked to
DBpedia. This problem was also encountered by Weber &amp; Vieira[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] who used a
similar work ow for Portuguese named-entity task with a much simpler
hierarchy. The authors did not conduct any intrinsic evaluation of the generated
corpus but, instead, used it to train models and evaluate results with existing
Golden sets.
      </p>
      <p>Concerning the improvement experiments, the best-identi ed option was to
use a dictionary ne-tuned from the Open Event Knowledge Graph. This graph
allows to identify a more precise speci c DBpedia class and therefore helps in
improving recall without considerable loss in precision. In this article, we
presented the use of this method as a post-processing step to complete the initial
annotations of the baseline. However, the usage of OEKG information can be
also implemented inside the work ow as a way of improving overall precision.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and Future Directions</title>
      <p>In this paper, we described an automatic work ow for generating multilingual
Named-Entity recognition corpora by using Wikipedia and DBpedia data and
following the UNER hierarchy. The whole process is available and can be
applied to any language having Wikipedia and DBpedia. We also presented the
application of the extraction and annotation method used to generate UNER
English corpus. The generated dataset has been described and evaluated with a
manually annotated Golden set.</p>
      <p>Furthermore, an ensemble of experiments were conducted to improve nal
annotated dataset. We have identi ed that the best results were obtained by
using a dictionary of entities with veri cation of the associated DBpedia class
using Open Event Knowledge Graph: 76.9 for precision, 31.0 for recall and 36.0
for F1-measure. Nevertheless, there is still room for improvement in both recall
and F-measure.</p>
      <p>In our future work, we plan to continue exploring OEKG by integrating it
to extraction and annotation the work ow and not only as a post-processing
step. Also, we intend to extend our corpus to other languages, especially
underresourced ones, while evaluating our work ow's performance across the
languages. Moreover, to complete this intrinsic evaluation of our dataset, we plan
to evaluate it extrinsically by using the generated datasets to train the machine
and deep learning models.
8</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>The work presented in this paper has received funding from the European
Union's Horizon 2020 research and innovation program under the Marie
SklodowskaCurie grant agreement no. 812997 and under the name CLEOPATRA
(Crosslingual Event-centric Open Analytics Research Academy)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alves</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuculo</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amaral</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thakkar</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tadic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Uner:
          <article-title>Universal namedentity recognitionframework</article-title>
          .
          <source>In: Proceedings of the 1st International Workshop on Cross-lingual Event-centric Open Analytics</source>
          . pp.
          <volume>72</volume>
          {
          <fpage>79</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2020</year>
          ), https://www.aclweb.org/anthology/W10-0712
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alves</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thakkar</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tadic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Evaluating language tools for fteen EU-o cial under-resourced languages</article-title>
          .
          <source>In: Proceedings of the 12th Language Resources and Evaluation Conference</source>
          . pp.
          <year>1866</year>
          {
          <year>1873</year>
          . European Language Resources Association, Marseille, France (May
          <year>2020</year>
          ), https://www.aclweb.org/anthology/2020. lrec-
          <volume>1</volume>
          .
          <fpage>230</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bekavac</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tadic</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A generic method for multi word extraction from wikipedia</article-title>
          .
          <source>In: Proceedings of the 30th International Conference on Information Technology Interfaces</source>
          (
          <year>2008</year>
          ), https://www.bib.
          <source>irb.hr/348724</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Eckart de Castilho, R.,
          <string-name>
            <surname>Mujdricza-Maydt</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yimam</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hartmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Biemann</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A web-based tool for the integrated annotation of semantic and syntactic structures</article-title>
          .
          <source>In: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)</source>
          . pp.
          <volume>76</volume>
          {
          <fpage>84</fpage>
          .
          <string-name>
            <surname>The</surname>
            <given-names>COLING</given-names>
          </string-name>
          2016
          <string-name>
            <given-names>Organizing</given-names>
            <surname>Committee</surname>
          </string-name>
          , Osaka,
          <source>Japan (Dec</source>
          <year>2016</year>
          ), https://www.aclweb.org/anthology/W16-4011
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chinchor</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robinson</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Appendix</surname>
            <given-names>E</given-names>
          </string-name>
          : MUC-7
          <source>Named Entity Task De nition (version 3.5)</source>
          .
          <source>In: Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29 - May 1</source>
          ,
          <year>1998</year>
          (
          <year>1998</year>
          ), https://www.aclweb.org/anthology/M98-1028
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Dumitrescu</surname>
            ,
            <given-names>S.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Avram</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Introducing RONEC - the romanian named entity corpus</article-title>
          . CoRR abs/
          <year>1909</year>
          .01247 (
          <year>2019</year>
          ), http://arxiv.org/abs/
          <year>1909</year>
          .01247
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carvalho</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goncalo</surname>
            <given-names>Oliveira</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Mota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            :
            <surname>Second</surname>
          </string-name>
          <string-name>
            <surname>HAREM</surname>
          </string-name>
          <article-title>: advancing the state of the art of named entity recognition in Portuguese</article-title>
          . In: quot; In Nicoletta Calzolari; Khalid Choukri; Bente Maegaard; Joseph Mariani; Jan Odijk; Stelios Piperidis; Mike Rosner; Daniel Tapias (ed)
          <source>Proceedings of the International Conference on Language Resources and Evaluation (LREC</source>
          <year>2010</year>
          )
          <article-title>(Valletta 17-</article-title>
          23 May de 2010)
          <article-title>European Language Resources Association</article-title>
          .
          <source>European Language Resources Association</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Honnibal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montani</surname>
            , I., Van Landeghem,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boyd</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>spaCy: Industrial-strength Natural Language Processing in Python (</article-title>
          <year>2020</year>
          ). https://doi.org/10.5281/zenodo.1212303, https://doi.org/10.5281/zenodo. 1212303
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
          </string-name>
          , H.:
          <article-title>Multilingual named entity recognition using parallel data and metadata from Wikipedia</article-title>
          . In:
          <article-title>Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</article-title>
          . pp.
          <volume>694</volume>
          {
          <fpage>702</fpage>
          . Association for Computational Linguistics, Jeju Island,
          <source>Korea (Jul</source>
          <year>2012</year>
          ), https://www.aclweb.org/anthology/P12-1073
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lawson</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eustice</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perkowitz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yetisgen-Yildiz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Annotating large email datasets for named entity recognition with Mechanical Turk</article-title>
          .
          <source>In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech</source>
          and
          <article-title>Language Data with Amazon's Mechanical Turk</article-title>
          . pp.
          <volume>71</volume>
          {
          <fpage>79</fpage>
          . Association for Computational Linguistics, Los Angeles (jun
          <year>2010</year>
          ), https://www.aclweb.org/anthology/W10-0712
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isele</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jentzsch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morsey</surname>
            , M., van Kleef,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Dbpedia - a large-scale, multilingual knowledge base extracted from wikipedia</article-title>
          .
          <source>Semantic Web</source>
          <volume>6</volume>
          (
          <issue>2</issue>
          ),
          <volume>167</volume>
          {
          <fpage>195</fpage>
          (
          <year>2015</year>
          ). https://doi.org/10.3233/SW-140134, https://madoc.bib. uni-mannheim.de/37476/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>li</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Sun</surname>
            , A., Han,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A survey on deep learning for named entity recognition</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering PP</source>
          ,
          <volume>1</volume>
          {
          <issue>1</issue>
          (
          <issue>03</issue>
          <year>2020</year>
          ). https://doi.org/10.1109/TKDE.
          <year>2020</year>
          .2981314
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Menezes</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savarese</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milidiu</surname>
            ,
            <given-names>R.L.</given-names>
          </string-name>
          :
          <article-title>Building a massive corpus for named entity recognition using free open data sources (</article-title>
          <year>2019</year>
          ), http://arxiv.org/abs/
          <year>1908</year>
          .05758
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ni</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dinu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Florian</surname>
          </string-name>
          , R.:
          <article-title>Weakly supervised cross-lingual named entity recognition via e ective annotation and representation projection</article-title>
          .
          <source>CoRR abs/1707</source>
          .02483 (
          <year>2017</year>
          ), http://arxiv.org/abs/1707.02483
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Nothman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ringland</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Curran</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <article-title>Learning multilingual named entity recognition from wikipedia</article-title>
          .
          <source>Artif. Intell</source>
          .
          <volume>194</volume>
          ,
          <issue>151</issue>
          {
          <fpage>175</fpage>
          (
          <year>2013</year>
          ), http://dblp.uni-trier.de/db/journals/ai/ai194.html/ NothmanRRMC13
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Ramshaw</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcus</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Text chunking using transformation-based learning</article-title>
          .
          <source>In: Third Workshop on Very Large Corpora</source>
          (
          <year>1995</year>
          ), https://www.aclweb.org/ anthology/W95-0107
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Sekine</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The De nition of Sekine's Extended Named Entities</article-title>
          . https://nlp.cs. nyu.edu/ene/version7_1_0Beng.
          <source>html (07</source>
          <year>2007</year>
          ), (Accessed on 28/02/2020)
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. Sevc kova, M.,
          <string-name>
            <surname>Zabokrtsky</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kruza</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Named entities in Czech: annotating data and developing NE tagger</article-title>
          . In: International Conference on Text,
          <source>Speech and Dialogue</source>
          . pp.
          <volume>188</volume>
          {
          <fpage>195</fpage>
          . Springer (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Weischedel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palmer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcus</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pradhan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramshaw</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>Kaufman</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Franchini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>El-Bachouti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belvin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Houston, A.:
          <source>OntoNotes Release 5.0</source>
          (
          <year>2013</year>
          ). https://doi.org/11272.1/AB2/MKJJ2R, https: //hdl.handle.
          <source>net/11272</source>
          .1/AB2/MKJJ2R
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Wikipedia</surname>
          </string-name>
          <article-title>: English Wikipedia | Wikipedia, the free encyclopedia</article-title>
          . http://en. wikipedia.org/w/index.php?title=
          <source>English%20Wikipedia&amp;oldid=987449701</source>
          (
          <year>2020</year>
          ), [Online; accessed 14-November-2020]
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Yadav</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A survey on recent advances in named entity recognition from deep learning models</article-title>
          . CoRR abs/
          <year>1910</year>
          .11470 (
          <year>2019</year>
          ), http://arxiv.org/ abs/
          <year>1910</year>
          .11470
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>