<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Simple ways to improve NER in every language using markup</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luis Adrian Cabrera-Diego</string-name>
          <email>diego@univ-lr.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose G. Moreno</string-name>
          <email>jose.moreno@irit.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antoine Doucet</string-name>
          <email>antoine.doucet@univ-lr.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>La Rochelle Universite</institution>
          ,
          <addr-line>L3i, La Rochelle, 17031, France luis.cabrera</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universite Paul Sabatier</institution>
          ,
          <addr-line>IRIT, Toulouse, 31062</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We explore three di erent methods for improving Named Entity Recognition (NER) systems based on BERT, each responding to one of three potential issues: the processing of uppercase tokens, the detection of entity boundaries and low generalization. Speci cally, we rst explore the marking of uppercase tokens for providing extra casing information. We then randomly mask tokens, as in a masked language model, and predict them along with the NER task to improve NER generalization. Finally, we predict entity boundaries to ameliorate named entity detection. The experiments were done over ve languages, three of which are low-resourced: Slovene, Croatian, Finnish, English and Spanish. Results show that predicting masked tokens can be bene cial for most languages, while marking uppercase tokens can be a simple method for dealing with uppercase sentences in NER. Furthermore, our methods improved the state of the art for Croatian and Finnish.</p>
      </abstract>
      <kwd-group>
        <kwd>Named Entity Recognition</kwd>
        <kwd>BERT</kwd>
        <kwd>multi-task</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Named Entity Recognition (NER) is a fundamental task in the processing of
texts that consists of extracting entities that semantically refer to notions such
as locations, people and organizations [
        <xref ref-type="bibr" rid="ref19 ref32">19,32</xref>
        ]. In 2019, Devlin et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
presented the deep neural network model called Bidirectional Encoder
Representations from Transformers (BERT) and demonstrated that pre-trained models
based on BERT can be ne-tuned to achieve high performance in multiple tasks.
As a consequence, multiple BERT-based NER systems have been created in the
last couple of years [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].In this work1, we explore three di erent aspects that
might play a role in the performance of NER systems:
      </p>
      <p>
        Uppercase words: Although it is uncommon to have to have texts not
following standard casing rules, some NER datasets, such as CoNLL 2003 [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] and
CoNLL 2002 [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] may contain a small percentage of sentences with uppercase
Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
1 github.com/EMBEDDIA/NER BERT Multitask
words. These sentences might be harder to predict by systems based on language
models where BPE tokenizer are used, such as BERT or RoBERTa [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], because
uppercase versions of a word are not tokenized in the same way as their title
or lowercase versions [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. For instance, the words Italy and ITALY, are split
by BERTBASE tokenizer as [Italy] and [IT,##AL,##Y], respectively. If we ask
BERTBASE to predict the masked word in \I live in Rome, [MASK].", the top
prediction is Italy (50.5%), followed by too (8.9%), though (3.6%), Rome (3.0%)
and now (2.4%). Nonetheless, predicting the masked word for the same phrase
but using uppercase words, i.e. \I LIVE IN ITALY, [MASK].", BERT proposes
the following words too (4.0%), please (1.6%), then (1.1%), now (1.1%) and Mom
(0.7%). Moreover, if we mask only one subtoken of the word ITALY, BERT
produces top predictions such as ##E (20.7%) IS (18.6%) and AND (15.7%). The
reason that uppercase words are harder to process correctly is that di erent
BPE tokens have di erent dense representations and, in consequence, the
language model might not have enough knowledge about them [
        <xref ref-type="bibr" rid="ref22 ref24">24,22</xref>
        ]. Therefore,
it might be necessary to process di erently uppercase words in NER systems.
      </p>
      <p>
        Entity boundaries: Although the prediction of named entity boundaries is
associated to nested named entities [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], in Li et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], the authors determined
that the prediction of boundaries in at named entities in English (CoNLL 2003
[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]) can be as high as 97.40. Therefore, we ask ourselves whether this
performance is always reached in all languages and datasets. Or whether, in some cases,
the correct prediction of boundaries is a bottleneck for improving the detection
of named entities.
      </p>
      <p>
        Low generalization: One of the biggest challenges in NER systems is the
prediction of named entities that were never seen during the training or that
have weak or zero regularity, such as titles of books and movies [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. In the last
year, there have been some interesting methods for increasing NER systems'
generalization, such as the manual creation of triggers [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and, the permutation
of named entities along with the reduction of context as in Lin et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. In this
work, we explore another method that might improve the generalization while,
at the same time, might adequate the language model, to the domain of the
dataset analyzed.
      </p>
      <p>We counter these cases with three di erent approaches that could be used to
improve the performance of a NER system. First, we explore whether the
marking of uppercase tokens and the addition of supplementary cases can improve
the detection of named entities. Second, we determine whether training, in a
multi-task fashion, a named entities boundaries detector could improve the
performance of a named entity system. Finally, we investigate whether the masking
and prediction of tokens during training, could increase the NER system
generalization.</p>
      <p>
        Therefore, we present our experiments and conclusions on ve di erent
datatsets. Two of them in high-resourced languages: English (CoNLL 2003 [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]) and
Spanish (CoNLL 2002 [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]). And three from low-resourced languages: Croatian
(HR500k [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]), Slovene (SSJ500k [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]) and Finnish [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The obtained results
show an improvement over the state of the art for Croatian, but also for Finnish,
while we have interesting results for the rest of the languages. Notably, we can
observe a bene t of marking uppercase tokens but also on predicting masked
tokens during the training of the models.
      </p>
      <p>The rest of the paper is structured as follows. In Section 2, we present the
most relevant related work regarding NER systems for the languages explored
in this paper. Then, in Section 3, we introduce the methodology explored in this
work. The explored datasets and the experimental setup are described in
Section 4 and Section 5, respectively. The results and their discussion are presented
in Section 6. Finally, the conclusions and future work are presented in Section 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Recent multilingual NER systems have opted for BERT-based architectures.
For instance, Luoma et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] presented a new dataset in Finnish based on
the Universal Dependency Finnish corpus and evaluated it using di erent NER
systems from the state of the art, including FinBERT [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], a Finnish BERT.
      </p>
      <p>
        For Croatian and Slovene, the Janes Project [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed Janes-NER, an
NER system that uses a Conditional Random Fields (CRF) classi er, along with
lexica and Brown clusters; it is based on the work of Ljubesic and Erjavec [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. It
was trained and tested on HR500k [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and SSJ500k [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] using 5 possible entity
types: Location, Person, Person-Derived, Organization and Miscellaneous. Both
languages have been evaluated2 using the Babushka-Bench3.
      </p>
      <p>
        The work of Ulcar and Robnik-Sikonja [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] presented CroSloEngual (CSE),
a multilingual BERT for Croatian, Slovene and English. The pre-trained model
was evaluated on NER using the datasets of HR500k [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and SSJ500k [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]; only
entities of type Location, Person and Organization were predicted.
      </p>
      <p>
        In Alves et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the authors evaluated two NER systems from the state
of the art: Polyglot [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and the Croatian NERC System (CNERC) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] over the
corpus HR500k [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Only the entities of type Location, Person and Organization
were considered.
      </p>
      <p>
        Yu et al. [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] used BERT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], FastText [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and character embeddings, with a
bia ne model [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] in a new NER system. Their results improved state-of-the-art
results in multiple datasets including Spanish CoNLL 2002 [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
      <p>
        In Li et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], the authors created BdryBot a tool for detecting named
entities boundaries. It is based on multiple recursive neural networks, a pointer
mechanism and BERT. On English CoNLL 2003 [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], they reached an F-score
of 0.974 on the prediction of entity boundaries. Comparing this value with the
current state of the art for the detection of named entities, 0.943 [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], it means
that the detection of named entity boundaries is easier to achieve than the
prediction of their types.
2 github.com/clarinsi/janes-ner
3 github.com/clarinsi/babushka-bench
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>As indicated previously in Section 1, we present in this work an NER system
that along with three methods looks for reducing the e ects of three aspects:
uppercase words, entity boundaries as bottleneck and low-generalization. The
architecture of the proposed methodology is shown in Figure 1 and it is
composed of four key elements: prediction of named entities, prediction of entity
boundaries, prediction of masked tokens and processing of uppercase tokens.
Each of the components is described as follows.</p>
      <p>
        The prediction of named entities is done through a linear layer that it is
connected to the output generated by a BERT model, similarly to the work
proposed by Devlin et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. However, to improve the correct annotation of
entities, we add as well a CRF such as in Ma and Hovy [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>For the prediction of named entity boundaries, we make use the same
architecture used for the prediction of named entities. However, the linear and CRF
layers focus on a reduced set of labels, which are only related to entity
boundaries. The objective of this component is to determine whether, the prediction
of boundaries could improve the prediction of named entities as the former is an
easier task than the latter.</p>
      <p>
        Regarding the prediction of masked tokens, the architecture follows the same
one proposed by Devlin et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for training a masked language model. This
consists of introducing the output of a BERT model into a linear layer, which
has the same size of the pre-trained vocabulary. The linear layer is expected
to predict the masked token. The component's goal is to force BERT to learn
patterns that could detect named entities even when a portion of the information
is hidden. At the same time, ne-tune BERT embeddings to the domain of the
NER dataset.
      </p>
      <p>The prediction components, as show in Figure 1, are coupled in a multi-task
fashion. This means, that each of the previously mentioned components, are
associated to a speci c loss function, which produces values related to each task.
During training, the losses produced by all the tasks, are summed. However,
during prediction time, as we are only interested in the prediction of named
entities, only the NER part is active.</p>
      <p>
        With respect to the tokenization of uppercase words, we decided to follow a
pre-processing method. In this case, we explore whether introducing additional
cases to the input sentence could bring additional information to BERT
regarding the correct context in which a named entity occur. For doing this, we follow
an approach similar to the one of Baldini Soares et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], in which entity
markers are used to focus BERT on speci c information. In the context of uppercase
words, we add to BERT's vocabulary two special tokens [UP], [up], which mark
the occurrence of an uppercase word. Inside these special tokens, we introduce
the tokens produced by the original uppercase word, but also the tokens obtained
of the word in title-case and lowercase. For instance, in Figure 1, we can observe
that two words are in uppercase, i.e. ROME and ITALY, thus we change their
respective tokens to a marked and enriched version. In other words, the tokens
NER
      </p>
      <p>Entities Boundaries
S-LOC O S-LOC O B-PER E-PER O O</p>
      <p>S-X O O B-X E-X O O</p>
      <p>Masked Tokens</p>
      <p>Minister
Linear+CRF Layers</p>
      <p>Linear+CRF Layers</p>
      <p>Linear Layer</p>
      <p>BERT
[UP] [ROM, ##E] [Rome] [r, ##ome] [up]
[UP] [IT, ##AL, ##Y] [Italy] [it, ##aly] [up]</p>
      <p>Output
Pre-trained</p>
      <p>Model
Modification of
uppercase tokens
[ROM, ##E] [,] [IT, ##AL, ##Y] [.] [Giuseppe] [Con, ##te] , [Prime] [MASK]...</p>
      <p>Tokenization
ROME , ITALY . Giuseppe Conte , Prime Minister...</p>
      <p>
        Input
associated to ROME become \[UP] [ROM, ##E] [Rome] [r,##ome] [up]".4 It
should be indicated, that the prediction of named entities (or boundaries), are
done uniquely over the rst token, which correspond to [UP]. This approach is
similar to the one used by Devlin et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for words split into multiple
subtokens by BERT's tokenizer but also by Baldani Soares et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], regarding the
use of entity markers.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Datasets</title>
      <p>
        For English, we use the collection proposed in CoNLL 2003 [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] and for Spanish
the dataset created in CoNLL 2002 [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. Both corpora have been annotated using
4 types of named entities: Location, Person, Organization and Miscellaneous.
      </p>
      <p>
        For Croatian and Slovene, we use the corpus HR500k [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and SSJ500k [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
respectively. According to their respective authors, both corpora have been
annotated with 5 types of named entities: Location, Person, Person-derived,
Organization and Miscellaneous. However, in the case of HR500k, we did not nd
entries tagged with the Miscellaneous type, as it happened as well in [
        <xref ref-type="bibr" rid="ref2 ref27">2,27</xref>
        ].
Following some previous works [
        <xref ref-type="bibr" rid="ref2 ref27">2,27</xref>
        ], we removed the type Person-derived, as it is
the less frequent type in both corpora. It should be indicated, that we use the
training and testing partitions provided by Ulcar and Robnik-Sikonja [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
However, the training partitions were split into 90% training and 10% development
using a strati ed strategy in order to use an early stop approach.
      </p>
      <p>
        Regarding the Finnish language, we have used the corpus proposed by Luoma
et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. This corpus has been annotated using 6 di erent types of named
entities: Location, Date, Person, Event, Organization and Product.
4 Neither the square brackets nor the commas surrounding the non-special tokens are
in the original representation. However, they are shown in our examples to show the
di erent sub-tokens produced for each word.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Experimental Setup</title>
      <p>
        The NER systems explored in this article are based on BERT, using Pytorch,
HuggingFace's Transformers [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] and di erent pre-trained BERT models: for
English we make use of BERTBASE [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]; for Finnish, FinBERT [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]; for Slovene
and Croatian, CroSloEngual [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] and for Spanish we use BETO [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>All the named entity tags are encoded using BIOES (Beginning, Inside,
Outside/Other, End, Single). For the detection of boundaries, we convert the named
entity tags into a generic BIOES encoding; in other words, we use a generic entity
type, e.g. B-X, I-X, E-X.</p>
      <p>
        For each language, we train 12 di erent models. The rst model, i.e. baseline,
is the implementation that only consists of BERT+Linear+CRF. The remaining
11 models, are the di erent combinations of the approaches described in Section 3
when added to our baseline. Based on the recommendation proposed by Mosbach
et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], every model is trained up to 20 epochs using an early stop approach
and AdamW with bias correction along with an epsilon of 1 10 8. The early
stop is based on the micro F-score and loss of the development datatset.
      </p>
      <p>With respect to the masking of tokens, we only a ect the sentences in the
training partitions that are longer than three tokens5. At each epoch, we select
randomly 25% of each sentence's tokens and substitute them with [MASK]. If
a token after being processed by BERT's tokenizer produces more than one
BERT's token, we randomly select one for masking it. For instance, in the case
of the last name Conte, see Figure 1, one of the sub-tokens would be masked.</p>
      <p>
        In Table 1, we present a summary of the hyperparameters used for training
the NER system. As it can be seen, all the parameters, except for the number
of epochs and optimizer, follow those used in Devlin et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        It should be noted that unlike other works, such as [
        <xref ref-type="bibr" rid="ref19 ref28 ref8">8,28,19</xref>
        ], were BERT's
input was enriched either with surrounding sentences or document context, our
models have for input only the sentence that needs to be analyzed. Moreover,
and in contrast with some BERT implementations, the inputs surpassing BERT's
token window size are split instead of truncated.6 The splitting consists of
generating a new input sentence with the rest of the tokens; during prediction, the
tokens are aligned to match the original input.
      </p>
      <p>
        We evaluate the output of the NER system using Seqeval 7. With respect to
the assessment of boundaries, we use Nervaluate8. This evaluation tool provides
exact, a metric which determines how well the boundaries of the predicted named
entities match those found in the gold standard, regardless of the type.
5 In this case, we talk about the actual de nition of tokens, rather than those obtained
by BERT's tokenizer
6 Some implementations disregard the tokens surpassing the token window or
considered these as the type Other.
7 github.com/chakki-works/seqeval
8 github.com/ivyleavedtoad ax/nervaluate/
We present in Table 2 and Table 3 the average and maximum results of ve
iterations, in terms of micro and macro F-score, regarding the prediction of
named entities for the di erent combinations of systems proposed in this work.
We also present results from the state of the art (only a selection of it for the
case of English, where the list could be very long). It should be noted that
the scores from the state of the art presented Table 2 and Table 3 are not
product of multiple iterations, except for BERTBASE [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. As well, the evaluation
of Janes-NER using the Babushka-Bench, see Table 2, does not consider errors in
boundaries, and calculates the macro F-score using the performance of 5 named
entity types plus the obtained score of predicting the Other type. Moreover, for
Croatian and Slovene, Table 2, the number of named entity types predicted by
each system might not be the same. This last issue will be discussed in detail
further ahead.
      </p>
      <p>
        We can observe, in Table 2 and Table 3, that the prediction of masked
tokens can improve, in average, the performance of the explored methods, the
exception is Finnish, where the performance decreases. In fact, we can observe
that for Finnish, the di erence between the maximum macro F-score and the
average is larger, meaning that the model becomes less stable when predicting
masked tokens. Nonetheless, we can achieve better values of micro F-scores for
Finnish. Furthermore, by masking tokens, either with other features or not, we
can improve in average the performance in Spanish CoNLL 2002. Nonetheless,
our maximum score does not reach the performance presented by Yu et al. [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ].
Although for English CoNLL 2003, we are still far from the current state of the
art, and slightly worse than BERTBASE, it should be noted that do not use any
kind of document context as Devlin et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] did. This might be a signal that
we're forcing BERT to generalize better and to deal with smaller contexts.
      </p>
      <p>It is unclear the reasons of why the masking of tokens a ected the stability of
the less frequent named entities in Finnish. It could be the case that we did not
mask enough tokens to force BERT to generalize in certain iterations. Or it might
be related of the agglutinative characteristic of Finnish, in which the masking of
tokens a ect severely key elements of the language to predict correctly named
entities.</p>
      <p>
        In Table 4, we present the results regarding the exact metric in terms of micro
and macro F-score for each language. This metric evaluates how well a system
predicts the boundaries of named entities regardless of the type associated. For
Croatian, English and Spanish, we can observe in Table 4 that the prediction of
entity boundaries is quite stable in general, either in terms of micro or macro
Fscore. It should be indicated, that the state-of-the-art micro F-score for English
CoNLL 2003 regarding the detection of boundaries is the following: BdryBot
95.90, BERT 96.90 and BdryBot+BERT 97.40 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Regarding Slovene, we can
notice that prediction of masked tokens improve the exact metric, however
training a model where we predict along the boundaries seems to do not have any
e ect in general. For Finnish, the exact metrics shows an stable performance in
terms of micro F-score, nonetheless, in terms of macro F-score we can observe a
decrement when we predict masked tokens.
      </p>
      <p>Another element to discus regarding Table 4, is that for languages such as
English and Spanish, recognizing named entity boundaries can be considered
an easy task. Nevertheless, for for Croatian, and in lesser degree Slovene and
Finnish, it is much more di cult. Moreover, for Finnish the prediction of entity
boundaries is harder to achieve for the less frequent types of named entities as it
shows the larger di erence between the micro and macro F-scores. Furthermore,
the results obtained in Table 4 give us an idea on what could be the maximum
possible score that a NER system could achieve if the task would only consist of
predicting the types of con rmed named entity boundaries. In the same line, the
results show that despite the fact that it is easy to nd named entity boundaries
ve iterations on Spanish and English; the maximum of each iteration is between
brackets. The best performance is in bold. Boundaries (B.), Uppercase (U.),
Generated Uppercase (G.U.) and Masked (M.)</p>
      <p>
        BERT Base [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
LUKE [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]
Seq2seq+BERT [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]
NER Dep.Par. [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]
Baseline
B.
      </p>
      <p>U.</p>
      <p>G.U.</p>
      <p>B.+U.</p>
      <p>B.+G.U.</p>
      <p>M.</p>
      <p>M.+B.</p>
      <p>M.+U.</p>
      <p>M.+G.U.</p>
      <p>M.+B.+U.</p>
      <p>M.+B.+G.U.</p>
      <p>Spanish</p>
      <p>
        English
evaluates the correct prediction of entity boundaries regardless their type, over
ve iterations. Boundaries (B.), Uppercase (U.), Generated Uppercase (G.U.)
and Masked (M.)
in Spanish, predicting their type is much more di cult for that language,
compared to English or Croatian, if we compare the results shown in Table 2 and
di cult, the prediction of their types lies in a range of around three points, while
in Spanish it is around seven points. Therefore, we can deduct, that in order to
improve the prediction of named entities in languages such as the Croatian, it
is necessary to primarily focus on the correct detection of boundaries. Further,
for Spanish, it is necessary to improve the prediction of types rather than
entity boundaries. However, this last issue could also be a sign of discrepancies in
the annotation, either of the training or the testing dataset, something that is
already known to occur in the English CoNLL 2003 corpus [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].
      </p>
      <p>With respect to the marking of uppercase tokens, we can notice in Table 2 and
Table 3, that we can improve the performance mainly in English and Croatian.
(a) English: Soccer - Leeds' Bowyer ned for part in fast-food fracas.
Tokens KENIAN ANGLIKAANISEN KIRKON SIHTEERI JOHN KAGO SANOI
Baseline O O O O B-PER E-PER O
M.+U. O B-ORG I-ORG E-ORG O O O
M.+G.U. S-LOC B-ORG E-ORG O B-PER E-PER O
Gold-Std. B-ORG E-ORG E-ORG O B-PER E-PER O
(b) Finnish: John Kago, secretary of the English Church of Kenya, said Thursday
Tokens
Baseline
U.</p>
      <p>G.U.</p>
      <p>Gold-Std.</p>
      <p>PROFESOR NA USLA STEPHEN HUBBELL</p>
      <p>O O O B-PER E-PER
O O O B-PER E-PER
O O S-ORG B-PER E-PER</p>
      <p>O O S-ORG B-PER E-PER
(c) Croatian: Professor of USLA Stepehn Hubbell
Tokens EL TRIBUNAL DE DEFENSA DE LA COMPETENCIA DETERMINE
Baseline O O O O O O O O
M.+U. O S-ORG O O O O O O
G.U. O B-ORG I-ORG I-ORG I-ORG I-ORG E-ORG O
Gold-Std. O B-ORG I-ORG I-ORG I-ORG I-ORG E-ORG O
(d) Spanish: The Competition Defense Court determines
However, by generating random uppercase sentence during training, the marking
of uppercase tokens can improve the performance in all the languages. In most
cases, this happens as well when applied with other methods such as prediction
of masked tokens, specially in Slovene and English.</p>
      <p>Although all the datasets contain a variable number of words only in
uppercase, there are two possible reasons why some languages bene ted more than
others. First, it can be the case that the number of uppercase tokens was not
large enough to make BERT learn about the marking. Second, it can be related
to the textual information that was used to train each BERT model.
Nonetheless, BERT is indeed capable of learning the meaning and context of uppercase
tokens if enough data has been used during their training, as we did when we
generated arti cially uppercase sentences.</p>
      <p>In Figure 5, we present four examples regarding the prediction of named
entities in uppercase sentences; the selected models are the best of each type.
As shown in Figure 5b, the prediction of entities does not become perfect when
marking uppercase words, but it can de nitely improve their recognition.
As indicated previously, the evaluation of NER systems over the Croatian
and Slovene datasets is not standard along the state-of-the-art systems. The
main reasons is that some named entity types are either not found in the corpus
or are disregarded due to their small frequency. Therefore, we present in Table 6
the recalculation of the macro F-scores. These scores are based on the three
common types of named entities used in the di erent NER systems from the
state of the art.</p>
      <p>
        With respect to Croatian, we can observe in Table 2 and Table 6 that we
can improve the results with respect to CroSloEngual, which is based as well on
BERT. Furthermore, our largest improvement, with respect to Janes-NER [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
is for the prediction of named entities of type Location and Organization. For
Slovene, we are not able to surpass the performance showed in [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
      </p>
      <p>Although it is common in the state of the art to present results for the Slovene
and Croatian corpora only in terms of macro F-score, the lack of micro F-scores
or detailed results per class makes it di cult to perform a detailed comparison
of the systems. First, the Croatian and Slovene corpora are not balanced, in
other words, the number of entities for each class is not equal. Therefore, the
macro F-scores consider equally important all the types of named entities, but
disregard their frequency in the dataset. Thus, it is impossible to know whether
systems, such as CroSloEngual, Polyglot or Croatian NERC, are focusing either
on the most frequent classes or the less frequent ones. For instance, we know
that our Croatian NER system focuses on the less frequent class Location (117
occurrences) rather than on the most frequent ones (Person and Organization,
respectively with 228 and 365 occurrences). However, our Slovene NER system
focuses on the most frequent classes Person and Location (respectively with 257
and 210 occurrences), rather than on the less frequent ones Organization (112
occurrences) and Miscellaneous (47 occurrences).</p>
      <p>
        Despite not surpassing the current state of the art in Slovene, it should be
indicated that we trained a system with four types of entities rather than three
as it was done in the work of Ulcar and Robnik-Sikonja [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. This aspect can
introduce noise or increases the di culty of the task, as the left out named entity
type, Miscellaneous, is the least frequent one. In this case, if we compare with
Janes-NER [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], this systems gets an F-score for Miscellaneous of 27.00 while
our masked baseline reaches at most 85.42.
      </p>
      <p>
        Finally, with respect to Finnish, we were able to surpass, in average, the
performance of the state of the art in terms of macro F-score, and in some
iterations the micro F-score. Based on the di erence between micro and macro
F-score, presented in Table 2, we can determine that multiple of our systems
focused slightly more on the less frequent classes, in comparison to the work of
[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. This can be observed in detail in Table 7, where we improved the prediction
of entities of type Event and Product, which are the less frequent classes (7 and
79 instances in the test set respectively), by reducing the correct prediction of a
more frequent class, i.e. Organization (208). We can observe as well, that macro
F-scores can variate more than micro F-scores values, depending on the seed
utilized during training. Despite all, we can see as well, that we can improve
the predictions of entities without having to add supplementary sentences for
increasing the context as Luoma et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] did.
7
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and future work</title>
      <p>
        Named Entity Recognition (NER) is a task that aims to extract and classify
groups of tokens referring to speci c types like locations, persons and
organizations. In the last couple of years, with the creation of BERT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], multiple NER
systems made use of its architecture to provide high-performing tools.
Nonetheless, we observed that this kind of systems could face some issues, such as the
bad prediction of uppercase sentences, the wrong detection of entity boundaries
and low generalization.
      </p>
      <p>Therefore, in this work, we presented three di erent methods that could
alleviate these issues. Experiments were done over ve languages, three of them
low-resourced ones. We improved the state of the art with a micro F-score of
up to 89.54 in Croatian by marking uppercase tokens and generating uppercase
sentences during training. By marking uppercase tokens, predicting boundaries
and tokens, we managed to improve the performance of BERTBASE to an
Fscore of up to 92.62 in English, while getting the second-best performance in
Spanish with an F-score of up to 89.56. In Finnish, we improved, in average, the
prediction of the less frequent named entity types, with a macro F-score of 82.41
versus 81.00 in the state of the art, while reaching a micro F-score of up to 92.09
versus 91.60. We could also provide a NER system for Slovene that predicts 4
types of named entities, one of which is infrequent, with results comparable to
those of another tool from the state of the art that only predicts the three most
frequent types.</p>
      <p>Furthermore, we observed that in Croatian, the prediction of named entity
boundaries is a bottleneck for the NER systems. While in Spanish, it seems to be
easy to nd the boundaries of named entities, but much harder to determine their
type. Finally, we propose a simple method that could improve the prediction of
named entities in sentences that are in uppercase words.</p>
      <p>In the future, we intend to experiment with additional languages. We would
like to asses whether the addition of some context to the left of the split sentences
could improve the performance of the NER.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was supported by the European Union's Horizon 2020 research and
innovation program under grants 770299 (NewsEye) and 825153 (Embeddia).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Al-Rfou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kulkarni</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perozzi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skiena</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>POLYGLOT-NER: Massive Multilingual Named Entity Recognition</article-title>
          .
          <source>CoRR abs/1410</source>
          .3791 (
          <year>2014</year>
          ), http:// arxiv.org/abs/1410.3791
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alves</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thakkar</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tadic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Evaluating Language Tools for Fifteen EUo cial Under-resourced Languages</article-title>
          .
          <source>In: Proceedings of the 12th Language Resources and Evaluation Conference</source>
          . pp.
          <year>1866</year>
          {
          <year>1873</year>
          . European Language Resources Association, Marseille, France (May
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Baldini</given-names>
            <surname>Soares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>FitzGerald</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>Ling</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kwiatkowski</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Matching the Blanks: Distributional Similarity for Relation Learning</article-title>
          .
          <source>In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <volume>2895</volume>
          {
          <fpage>2905</fpage>
          . Association for Computational Linguistics, Florence,
          <source>Italy (Jul</source>
          <year>2019</year>
          ). https://doi.org/10.18653/v1/
          <fpage>P19</fpage>
          -1279
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bekavac</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tadic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Implementation of Croatian NERC System</article-title>
          .
          <source>In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing</source>
          . pp.
          <volume>11</volume>
          {
          <fpage>18</fpage>
          . Association for Computational Linguistics, Prague, Czech Republic (
          <year>Jun 2007</year>
          ), https://www.aclweb.org/anthology/W07-1702
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching Word Vectors with Subword Information</article-title>
          .
          <source>Transactions of the Association of Computational Linguistics</source>
          <volume>5</volume>
          ,
          <issue>135</issue>
          {
          <fpage>146</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>R.C.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>Incorporating Boundary and Category Feature for Nested Named Entity Recognition</article-title>
          . In: Nah,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.W.</given-names>
            ,
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.X.</given-names>
            ,
            <surname>Moon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.S.</given-names>
            ,
            <surname>Whang</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.E</surname>
          </string-name>
          . (eds.)
          <article-title>Database Systems for Advanced Applications</article-title>
          . pp.
          <volume>209</volume>
          {
          <fpage>226</fpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Can~ete, J.,
          <string-name>
            <surname>Chaperon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fuentes</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez</surname>
            ,
            <given-names>J.: Spanish</given-names>
          </string-name>
          <string-name>
            <surname>Pre-Trained BERT</surname>
          </string-name>
          Model and
          <article-title>Evaluation Data</article-title>
          .
          <source>In: PML4DC at ICLR</source>
          <year>2020</year>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers). pp.
          <volume>4171</volume>
          {
          <fpage>4186</fpage>
          . Association for Computational Linguistics, Minneapolis,
          <source>Minnesota (Jun</source>
          <year>2019</year>
          ). https://doi.org/10.18653/v1/
          <fpage>N19</fpage>
          -1423
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Dozat</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Deep Bia ne Attention for Neural Dependency Parsing</article-title>
          .
          <source>CoRR abs/1611</source>
          .01734 (
          <year>2016</year>
          ), http://arxiv.org/abs/1611.01734
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Fiser</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ljubesic</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erjavec</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>The Janes project: language resources and tools for Slovene user generated content</article-title>
          .
          <source>Language Resources and Evaluation</source>
          <volume>54</volume>
          (
          <issue>1</issue>
          ),
          <volume>223</volume>
          {246 (Mar
          <year>2020</year>
          ). https://doi.org/10.1007/s10579-018-9425-z
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Krek</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dobrovoljc</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erjavec</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moze</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ledinek</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holz</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zupan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gantar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuzman</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cibej</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Arhar</given-names>
            <surname>Holdt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Kavcic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Skrjanec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Marko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Jezersek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Zajc</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <source>Training corpus ssj500k 2</source>
          .
          <issue>2</issue>
          (
          <issue>2019</issue>
          ), http://hdl.handle. net/11356/1210,
          <article-title>Slovenian language resource repository CLARIN</article-title>
          .SI
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            , A., Han,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A Survey on Deep Learning for Named Entity Recognition</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data</source>
          Engineering pp.
          <volume>1</volume>
          {
          <issue>1</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
          </string-name>
          , Y.:
          <article-title>Neural Named Entity Boundary Detection</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data</source>
          Engineering pp.
          <volume>1</volume>
          {
          <issue>1</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>B.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>D.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shiralkar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition</article-title>
          .
          <source>In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <volume>8503</volume>
          {
          <fpage>8511</fpage>
          . Association for Computational Linguistics,
          <source>Online (Jul</source>
          <year>2020</year>
          ). https://doi.org/10.18653/v1/
          <year>2020</year>
          .acl-main.
          <fpage>752</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            , J., Han,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>N.J.:</given-names>
          </string-name>
          <article-title>A Rigorous Study on Named Entity Recognition: Can Fine-tuning Pretrained Model Lead to the Promised Land?</article-title>
          <source>In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          . pp.
          <volume>7291</volume>
          {
          <fpage>7300</fpage>
          . Association for Computational Linguistics,
          <source>Online (Nov</source>
          <year>2020</year>
          ). https://doi.org/10.18653/v1/
          <year>2020</year>
          .emnlpmain.592
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ott</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoyanov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Roberta: A robustly optimized bert pretraining approach (</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Ljubesic</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erjavec</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Corpus vs</article-title>
          .
          <source>Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)</source>
          . pp.
          <volume>1527</volume>
          {
          <fpage>1531</fpage>
          .
          <string-name>
            <surname>European Language Resources Association</surname>
          </string-name>
          (ELRA), Portoroz, Slovenia (May
          <year>2016</year>
          ), https://www.aclweb.org/anthology/L16-1242
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Ljubesic</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klubicka</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agic</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jazbec</surname>
            ,
            <given-names>I.P.</given-names>
          </string-name>
          :
          <article-title>New In ectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian</article-title>
          .
          <source>In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)</source>
          . pp.
          <volume>4264</volume>
          {
          <fpage>4270</fpage>
          .
          <string-name>
            <surname>European Language Resources Association</surname>
          </string-name>
          (ELRA), Portoroz, Slovenia (May
          <year>2016</year>
          ), https://www.aclweb.org/ anthology/L16-1676
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Luoma</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oinonen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Pyykonen,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Laippala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Pyysalo</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.:</surname>
          </string-name>
          <article-title>A Broadcoverage Corpus for Finnish Named Entity Recognition</article-title>
          .
          <source>In: Proceedings of the 12th Language Resources and Evaluation Conference</source>
          . pp.
          <volume>4615</volume>
          {
          <fpage>4624</fpage>
          .
          <string-name>
            <surname>European Language Resources Association</surname>
          </string-name>
          , Marseille, France (May
          <year>2020</year>
          ), https://www.aclweb. org/anthology/2020.lrec-
          <volume>1</volume>
          .
          <fpage>567</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovy</surname>
          </string-name>
          , E.:
          <article-title>End-to-end Sequence Labeling via Bi-directional LSTMCNNs-CRF</article-title>
          .
          <article-title>In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</article-title>
          . pp.
          <volume>1064</volume>
          {
          <fpage>1074</fpage>
          . Association for Computational Linguistics, Berlin, Germany (Aug
          <year>2016</year>
          ). https://doi.org/10.18653/v1/
          <fpage>P16</fpage>
          -1101
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Mosbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andriushchenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klakow</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines (</article-title>
          <year>2020</year>
          ), https:// arxiv.org/abs/
          <year>2006</year>
          .04884
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Powalski</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stanislawek</surname>
          </string-name>
          , T.:
          <article-title>UniCase { Rethinking Casing in Language Models (</article-title>
          <year>2020</year>
          ),
          <article-title>arXiv cs</article-title>
          .CL eprint:
          <year>2010</year>
          .11936
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Strakova</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Straka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hajic</surname>
          </string-name>
          , J.:
          <article-title>Neural Architectures for Nested NER through Linearization</article-title>
          .
          <source>In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <volume>5326</volume>
          {
          <fpage>5331</fpage>
          . Association for Computational Linguistics, Florence,
          <source>Italy (Jul</source>
          <year>2019</year>
          ). https://doi.org/10.18653/v1/
          <fpage>P19</fpage>
          -1527
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hashimoto</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiong</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT</article-title>
          . arXiv preprint arXiv:
          <year>2003</year>
          .
          <volume>04985</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Tjong Kim Sang</surname>
            ,
            <given-names>E.F.</given-names>
          </string-name>
          :
          <article-title>Introduction to the CoNLL-2002 shared task: Languageindependent named entity recognition</article-title>
          .
          <source>In: COLING-02: The 6th Conference on Natural Language Learning</source>
          <year>2002</year>
          (CoNLL-2002) (
          <year>2002</year>
          ), https://www.aclweb. org/anthology/W02-2024
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Tjong Kim Sang</surname>
            ,
            <given-names>E.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Meulder</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition</article-title>
          .
          <source>In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003</source>
          . pp.
          <volume>142</volume>
          {
          <issue>147</issue>
          (
          <year>2003</year>
          ), https://www.aclweb.org/anthology/W03-0419
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Ulcar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robnik-Sikonja</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>FinEst BERT and CroSloEngual BERT</article-title>
          . In: Sojka,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Kopecek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Pala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Horak</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.) Text, Speech, and Dialogue. pp.
          <volume>104</volume>
          {
          <fpage>111</fpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Virtanen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanerva</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luoma</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luotolahti</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakoski</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ginter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pyysalo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Multilingual is not enough: BERT for Finnish (</article-title>
          <year>2019</year>
          ), https: //arxiv.org/abs/
          <year>1912</year>
          .07076
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J</given-names>
            ., Han, J
          </string-name>
          .:
          <article-title>CrossWeigh: Training Named Entity Tagger from Imperfect Annotations</article-title>
          .
          <source>In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          . pp.
          <volume>5154</volume>
          {
          <fpage>5163</fpage>
          . Association for Computational Linguistics, Hong Kong,
          <source>China (Nov</source>
          <year>2019</year>
          ). https://doi.org/10.18653/v1/
          <fpage>D19</fpage>
          -1519
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Wolf</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Debut</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanh</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaumond</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delangue</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cistac</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rault</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Louf</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Funtowicz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davison</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shleifer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Platen</surname>
            ,
            <given-names>P.v.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jernite</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scao</surname>
            ,
            <given-names>T.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gugger</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Drame</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lhoest</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rush</surname>
            ,
            <given-names>A.M.:</given-names>
          </string-name>
          <article-title>HuggingFace's Transformers: State-of-the-art Natural Language Processing</article-title>
          . ArXiv abs/
          <year>1910</year>
          .03771 (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Yamada</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shindo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takeda</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matsumoto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention</article-title>
          .
          <source>In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          . pp.
          <volume>6442</volume>
          {
          <fpage>6454</fpage>
          . Association for Computational Linguistics,
          <source>Online (Nov</source>
          <year>2020</year>
          ). https://doi.org/10.18653/v1/
          <year>2020</year>
          .emnlp-main.
          <fpage>523</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bohnet</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poesio</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Named Entity Recognition as Dependency Parsing</article-title>
          .
          <source>In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <volume>6470</volume>
          {
          <fpage>6476</fpage>
          . Association for Computational Linguistics,
          <source>Online (Jul</source>
          <year>2020</year>
          ). https://doi.org/10.18653/v1/
          <year>2020</year>
          .acl-main.
          <fpage>577</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>