-

Simple ways to improve NER in every language using markup

Luis Adrian Cabrera-Diego

diego@univ-lr.fr 0

Jose G. Moreno

jose.moreno@irit.fr 1

Antoine Doucet

antoine.doucet@univ-lr.fr 0 0 La Rochelle Universite , L3i, La Rochelle, 17031, France luis.cabrera 1 Universite Paul Sabatier , IRIT, Toulouse, 31062 , France

We explore three di erent methods for improving Named Entity Recognition (NER) systems based on BERT, each responding to one of three potential issues: the processing of uppercase tokens, the detection of entity boundaries and low generalization. Speci cally, we rst explore the marking of uppercase tokens for providing extra casing information. We then randomly mask tokens, as in a masked language model, and predict them along with the NER task to improve NER generalization. Finally, we predict entity boundaries to ameliorate named entity detection. The experiments were done over ve languages, three of which are low-resourced: Slovene, Croatian, Finnish, English and Spanish. Results show that predicting masked tokens can be bene cial for most languages, while marking uppercase tokens can be a simple method for dealing with uppercase sentences in NER. Furthermore, our methods improved the state of the art for Croatian and Finnish.

Named Entity Recognition BERT multi-task

Named Entity Recognition (NER) is a fundamental task in the processing of texts that consists of extracting entities that semantically refer to notions such as locations, people and organizations [ 19,32 ]. In 2019, Devlin et al. [ 8 ] presented the deep neural network model called Bidirectional Encoder Representations from Transformers (BERT) and demonstrated that pre-trained models based on BERT can be ne-tuned to achieve high performance in multiple tasks. As a consequence, multiple BERT-based NER systems have been created in the last couple of years [ 12 ].In this work1, we explore three di erent aspects that might play a role in the performance of NER systems:

Uppercase words: Although it is uncommon to have to have texts not following standard casing rules, some NER datasets, such as CoNLL 2003 [ 26 ] and CoNLL 2002 [ 25 ] may contain a small percentage of sentences with uppercase Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 github.com/EMBEDDIA/NER BERT Multitask words. These sentences might be harder to predict by systems based on language models where BPE tokenizer are used, such as BERT or RoBERTa [ 16 ], because uppercase versions of a word are not tokenized in the same way as their title or lowercase versions [ 22 ]. For instance, the words Italy and ITALY, are split by BERTBASE tokenizer as [Italy] and [IT,##AL,##Y], respectively. If we ask BERTBASE to predict the masked word in \I live in Rome, [MASK].", the top prediction is Italy (50.5%), followed by too (8.9%), though (3.6%), Rome (3.0%) and now (2.4%). Nonetheless, predicting the masked word for the same phrase but using uppercase words, i.e. \I LIVE IN ITALY, [MASK].", BERT proposes the following words too (4.0%), please (1.6%), then (1.1%), now (1.1%) and Mom (0.7%). Moreover, if we mask only one subtoken of the word ITALY, BERT produces top predictions such as ##E (20.7%) IS (18.6%) and AND (15.7%). The reason that uppercase words are harder to process correctly is that di erent BPE tokens have di erent dense representations and, in consequence, the language model might not have enough knowledge about them [ 24,22 ]. Therefore, it might be necessary to process di erently uppercase words in NER systems.

Entity boundaries: Although the prediction of named entity boundaries is associated to nested named entities [ 6 ], in Li et al. [ 13 ], the authors determined that the prediction of boundaries in at named entities in English (CoNLL 2003 [ 26 ]) can be as high as 97.40. Therefore, we ask ourselves whether this performance is always reached in all languages and datasets. Or whether, in some cases, the correct prediction of boundaries is a bottleneck for improving the detection of named entities.

Low generalization: One of the biggest challenges in NER systems is the prediction of named entities that were never seen during the training or that have weak or zero regularity, such as titles of books and movies [ 15 ]. In the last year, there have been some interesting methods for increasing NER systems' generalization, such as the manual creation of triggers [ 14 ] and, the permutation of named entities along with the reduction of context as in Lin et al. [ 15 ]. In this work, we explore another method that might improve the generalization while, at the same time, might adequate the language model, to the domain of the dataset analyzed.

We counter these cases with three di erent approaches that could be used to improve the performance of a NER system. First, we explore whether the marking of uppercase tokens and the addition of supplementary cases can improve the detection of named entities. Second, we determine whether training, in a multi-task fashion, a named entities boundaries detector could improve the performance of a named entity system. Finally, we investigate whether the masking and prediction of tokens during training, could increase the NER system generalization.

Therefore, we present our experiments and conclusions on ve di erent datatsets. Two of them in high-resourced languages: English (CoNLL 2003 [ 26 ]) and Spanish (CoNLL 2002 [ 25 ]). And three from low-resourced languages: Croatian (HR500k [ 18 ]), Slovene (SSJ500k [ 11 ]) and Finnish [ 19 ]. The obtained results show an improvement over the state of the art for Croatian, but also for Finnish, while we have interesting results for the rest of the languages. Notably, we can observe a bene t of marking uppercase tokens but also on predicting masked tokens during the training of the models.

The rest of the paper is structured as follows. In Section 2, we present the most relevant related work regarding NER systems for the languages explored in this paper. Then, in Section 3, we introduce the methodology explored in this work. The explored datasets and the experimental setup are described in Section 4 and Section 5, respectively. The results and their discussion are presented in Section 6. Finally, the conclusions and future work are presented in Section 7. 2

Related Work

Recent multilingual NER systems have opted for BERT-based architectures. For instance, Luoma et al. [ 19 ] presented a new dataset in Finnish based on the Universal Dependency Finnish corpus and evaluated it using di erent NER systems from the state of the art, including FinBERT [ 28 ], a Finnish BERT.

For Croatian and Slovene, the Janes Project [ 10 ] proposed Janes-NER, an NER system that uses a Conditional Random Fields (CRF) classi er, along with lexica and Brown clusters; it is based on the work of Ljubesic and Erjavec [ 17 ]. It was trained and tested on HR500k [ 18 ] and SSJ500k [ 11 ] using 5 possible entity types: Location, Person, Person-Derived, Organization and Miscellaneous. Both languages have been evaluated2 using the Babushka-Bench3.

The work of Ulcar and Robnik-Sikonja [ 27 ] presented CroSloEngual (CSE), a multilingual BERT for Croatian, Slovene and English. The pre-trained model was evaluated on NER using the datasets of HR500k [ 18 ] and SSJ500k [ 11 ]; only entities of type Location, Person and Organization were predicted.

In Alves et al. [ 2 ], the authors evaluated two NER systems from the state of the art: Polyglot [ 1 ] and the Croatian NERC System (CNERC) [ 4 ] over the corpus HR500k [ 18 ]. Only the entities of type Location, Person and Organization were considered.

Yu et al. [ 32 ] used BERT [ 8 ], FastText [ 5 ] and character embeddings, with a bia ne model [ 9 ] in a new NER system. Their results improved state-of-the-art results in multiple datasets including Spanish CoNLL 2002 [ 25 ].

In Li et al. [ 13 ], the authors created BdryBot a tool for detecting named entities boundaries. It is based on multiple recursive neural networks, a pointer mechanism and BERT. On English CoNLL 2003 [ 26 ], they reached an F-score of 0.974 on the prediction of entity boundaries. Comparing this value with the current state of the art for the detection of named entities, 0.943 [ 31 ], it means that the detection of named entity boundaries is easier to achieve than the prediction of their types. 2 github.com/clarinsi/janes-ner 3 github.com/clarinsi/babushka-bench

Methodology

As indicated previously in Section 1, we present in this work an NER system that along with three methods looks for reducing the e ects of three aspects: uppercase words, entity boundaries as bottleneck and low-generalization. The architecture of the proposed methodology is shown in Figure 1 and it is composed of four key elements: prediction of named entities, prediction of entity boundaries, prediction of masked tokens and processing of uppercase tokens. Each of the components is described as follows.

The prediction of named entities is done through a linear layer that it is connected to the output generated by a BERT model, similarly to the work proposed by Devlin et al. [ 8 ]. However, to improve the correct annotation of entities, we add as well a CRF such as in Ma and Hovy [ 20 ].

For the prediction of named entity boundaries, we make use the same architecture used for the prediction of named entities. However, the linear and CRF layers focus on a reduced set of labels, which are only related to entity boundaries. The objective of this component is to determine whether, the prediction of boundaries could improve the prediction of named entities as the former is an easier task than the latter.

Regarding the prediction of masked tokens, the architecture follows the same one proposed by Devlin et al. [ 8 ] for training a masked language model. This consists of introducing the output of a BERT model into a linear layer, which has the same size of the pre-trained vocabulary. The linear layer is expected to predict the masked token. The component's goal is to force BERT to learn patterns that could detect named entities even when a portion of the information is hidden. At the same time, ne-tune BERT embeddings to the domain of the NER dataset.

The prediction components, as show in Figure 1, are coupled in a multi-task fashion. This means, that each of the previously mentioned components, are associated to a speci c loss function, which produces values related to each task. During training, the losses produced by all the tasks, are summed. However, during prediction time, as we are only interested in the prediction of named entities, only the NER part is active.

With respect to the tokenization of uppercase words, we decided to follow a pre-processing method. In this case, we explore whether introducing additional cases to the input sentence could bring additional information to BERT regarding the correct context in which a named entity occur. For doing this, we follow an approach similar to the one of Baldini Soares et al. [ 3 ], in which entity markers are used to focus BERT on speci c information. In the context of uppercase words, we add to BERT's vocabulary two special tokens [UP], [up], which mark the occurrence of an uppercase word. Inside these special tokens, we introduce the tokens produced by the original uppercase word, but also the tokens obtained of the word in title-case and lowercase. For instance, in Figure 1, we can observe that two words are in uppercase, i.e. ROME and ITALY, thus we change their respective tokens to a marked and enriched version. In other words, the tokens NER

Entities Boundaries S-LOC O S-LOC O B-PER E-PER O O

S-X O O B-X E-X O O

Masked Tokens

Minister Linear+CRF Layers

Linear+CRF Layers

Linear Layer

BERT [UP] [ROM, ##E] [Rome] [r, ##ome] [up] [UP] [IT, ##AL, ##Y] [Italy] [it, ##aly] [up]

Output Pre-trained

Model Modification of uppercase tokens [ROM, ##E] [,] [IT, ##AL, ##Y] [.] [Giuseppe] [Con, ##te] , [Prime] [MASK]...

Tokenization ROME , ITALY . Giuseppe Conte , Prime Minister...

Input associated to ROME become \[UP] [ROM, ##E] [Rome] [r,##ome] [up]".4 It should be indicated, that the prediction of named entities (or boundaries), are done uniquely over the rst token, which correspond to [UP]. This approach is similar to the one used by Devlin et al. [ 8 ] for words split into multiple subtokens by BERT's tokenizer but also by Baldani Soares et al. [ 3 ], regarding the use of entity markers. 4

Datasets

For English, we use the collection proposed in CoNLL 2003 [ 26 ] and for Spanish the dataset created in CoNLL 2002 [ 25 ]. Both corpora have been annotated using 4 types of named entities: Location, Person, Organization and Miscellaneous.

For Croatian and Slovene, we use the corpus HR500k [ 18 ] and SSJ500k [ 11 ], respectively. According to their respective authors, both corpora have been annotated with 5 types of named entities: Location, Person, Person-derived, Organization and Miscellaneous. However, in the case of HR500k, we did not nd entries tagged with the Miscellaneous type, as it happened as well in [ 2,27 ]. Following some previous works [ 2,27 ], we removed the type Person-derived, as it is the less frequent type in both corpora. It should be indicated, that we use the training and testing partitions provided by Ulcar and Robnik-Sikonja [ 27 ]. However, the training partitions were split into 90% training and 10% development using a strati ed strategy in order to use an early stop approach.

Regarding the Finnish language, we have used the corpus proposed by Luoma et al. [ 19 ]. This corpus has been annotated using 6 di erent types of named entities: Location, Date, Person, Event, Organization and Product. 4 Neither the square brackets nor the commas surrounding the non-special tokens are in the original representation. However, they are shown in our examples to show the di erent sub-tokens produced for each word.

Experimental Setup

The NER systems explored in this article are based on BERT, using Pytorch, HuggingFace's Transformers [ 30 ] and di erent pre-trained BERT models: for English we make use of BERTBASE [ 8 ]; for Finnish, FinBERT [ 28 ]; for Slovene and Croatian, CroSloEngual [ 27 ] and for Spanish we use BETO [ 7 ].

All the named entity tags are encoded using BIOES (Beginning, Inside, Outside/Other, End, Single). For the detection of boundaries, we convert the named entity tags into a generic BIOES encoding; in other words, we use a generic entity type, e.g. B-X, I-X, E-X.

For each language, we train 12 di erent models. The rst model, i.e. baseline, is the implementation that only consists of BERT+Linear+CRF. The remaining 11 models, are the di erent combinations of the approaches described in Section 3 when added to our baseline. Based on the recommendation proposed by Mosbach et al. [ 21 ], every model is trained up to 20 epochs using an early stop approach and AdamW with bias correction along with an epsilon of 1 10 8. The early stop is based on the micro F-score and loss of the development datatset.

With respect to the masking of tokens, we only a ect the sentences in the training partitions that are longer than three tokens5. At each epoch, we select randomly 25% of each sentence's tokens and substitute them with [MASK]. If a token after being processed by BERT's tokenizer produces more than one BERT's token, we randomly select one for masking it. For instance, in the case of the last name Conte, see Figure 1, one of the sub-tokens would be masked.

In Table 1, we present a summary of the hyperparameters used for training the NER system. As it can be seen, all the parameters, except for the number of epochs and optimizer, follow those used in Devlin et al. [ 8 ].

It should be noted that unlike other works, such as [ 8,28,19 ], were BERT's input was enriched either with surrounding sentences or document context, our models have for input only the sentence that needs to be analyzed. Moreover, and in contrast with some BERT implementations, the inputs surpassing BERT's token window size are split instead of truncated.6 The splitting consists of generating a new input sentence with the rest of the tokens; during prediction, the tokens are aligned to match the original input.

We evaluate the output of the NER system using Seqeval 7. With respect to the assessment of boundaries, we use Nervaluate8. This evaluation tool provides exact, a metric which determines how well the boundaries of the predicted named entities match those found in the gold standard, regardless of the type. 5 In this case, we talk about the actual de nition of tokens, rather than those obtained by BERT's tokenizer 6 Some implementations disregard the tokens surpassing the token window or considered these as the type Other. 7 github.com/chakki-works/seqeval 8 github.com/ivyleavedtoad ax/nervaluate/ We present in Table 2 and Table 3 the average and maximum results of ve iterations, in terms of micro and macro F-score, regarding the prediction of named entities for the di erent combinations of systems proposed in this work. We also present results from the state of the art (only a selection of it for the case of English, where the list could be very long). It should be noted that the scores from the state of the art presented Table 2 and Table 3 are not product of multiple iterations, except for BERTBASE [ 8 ]. As well, the evaluation of Janes-NER using the Babushka-Bench, see Table 2, does not consider errors in boundaries, and calculates the macro F-score using the performance of 5 named entity types plus the obtained score of predicting the Other type. Moreover, for Croatian and Slovene, Table 2, the number of named entity types predicted by each system might not be the same. This last issue will be discussed in detail further ahead.

We can observe, in Table 2 and Table 3, that the prediction of masked tokens can improve, in average, the performance of the explored methods, the exception is Finnish, where the performance decreases. In fact, we can observe that for Finnish, the di erence between the maximum macro F-score and the average is larger, meaning that the model becomes less stable when predicting masked tokens. Nonetheless, we can achieve better values of micro F-scores for Finnish. Furthermore, by masking tokens, either with other features or not, we can improve in average the performance in Spanish CoNLL 2002. Nonetheless, our maximum score does not reach the performance presented by Yu et al. [ 32 ]. Although for English CoNLL 2003, we are still far from the current state of the art, and slightly worse than BERTBASE, it should be noted that do not use any kind of document context as Devlin et al. [ 8 ] did. This might be a signal that we're forcing BERT to generalize better and to deal with smaller contexts.

It is unclear the reasons of why the masking of tokens a ected the stability of the less frequent named entities in Finnish. It could be the case that we did not mask enough tokens to force BERT to generalize in certain iterations. Or it might be related of the agglutinative characteristic of Finnish, in which the masking of tokens a ect severely key elements of the language to predict correctly named entities.

In Table 4, we present the results regarding the exact metric in terms of micro and macro F-score for each language. This metric evaluates how well a system predicts the boundaries of named entities regardless of the type associated. For Croatian, English and Spanish, we can observe in Table 4 that the prediction of entity boundaries is quite stable in general, either in terms of micro or macro Fscore. It should be indicated, that the state-of-the-art micro F-score for English CoNLL 2003 regarding the detection of boundaries is the following: BdryBot 95.90, BERT 96.90 and BdryBot+BERT 97.40 [ 13 ]. Regarding Slovene, we can notice that prediction of masked tokens improve the exact metric, however training a model where we predict along the boundaries seems to do not have any e ect in general. For Finnish, the exact metrics shows an stable performance in terms of micro F-score, nonetheless, in terms of macro F-score we can observe a decrement when we predict masked tokens.

Another element to discus regarding Table 4, is that for languages such as English and Spanish, recognizing named entity boundaries can be considered an easy task. Nevertheless, for for Croatian, and in lesser degree Slovene and Finnish, it is much more di cult. Moreover, for Finnish the prediction of entity boundaries is harder to achieve for the less frequent types of named entities as it shows the larger di erence between the micro and macro F-scores. Furthermore, the results obtained in Table 4 give us an idea on what could be the maximum possible score that a NER system could achieve if the task would only consist of predicting the types of con rmed named entity boundaries. In the same line, the results show that despite the fact that it is easy to nd named entity boundaries ve iterations on Spanish and English; the maximum of each iteration is between brackets. The best performance is in bold. Boundaries (B.), Uppercase (U.), Generated Uppercase (G.U.) and Masked (M.)

BERT Base [ 8 ] LUKE [ 31 ] Seq2seq+BERT [ 23 ] NER Dep.Par. [ 32 ] Baseline B.

G.U.

B.+U.

B.+G.U.

M.+B.

M.+U.

M.+G.U.

M.+B.+U.

M.+B.+G.U.

Spanish

English evaluates the correct prediction of entity boundaries regardless their type, over ve iterations. Boundaries (B.), Uppercase (U.), Generated Uppercase (G.U.) and Masked (M.) in Spanish, predicting their type is much more di cult for that language, compared to English or Croatian, if we compare the results shown in Table 2 and di cult, the prediction of their types lies in a range of around three points, while in Spanish it is around seven points. Therefore, we can deduct, that in order to improve the prediction of named entities in languages such as the Croatian, it is necessary to primarily focus on the correct detection of boundaries. Further, for Spanish, it is necessary to improve the prediction of types rather than entity boundaries. However, this last issue could also be a sign of discrepancies in the annotation, either of the training or the testing dataset, something that is already known to occur in the English CoNLL 2003 corpus [ 29 ].

With respect to the marking of uppercase tokens, we can notice in Table 2 and Table 3, that we can improve the performance mainly in English and Croatian. (a) English: Soccer - Leeds' Bowyer ned for part in fast-food fracas. Tokens KENIAN ANGLIKAANISEN KIRKON SIHTEERI JOHN KAGO SANOI Baseline O O O O B-PER E-PER O M.+U. O B-ORG I-ORG E-ORG O O O M.+G.U. S-LOC B-ORG E-ORG O B-PER E-PER O Gold-Std. B-ORG E-ORG E-ORG O B-PER E-PER O (b) Finnish: John Kago, secretary of the English Church of Kenya, said Thursday Tokens Baseline U.

G.U.

Gold-Std.

PROFESOR NA USLA STEPHEN HUBBELL

O O O B-PER E-PER O O O B-PER E-PER O O S-ORG B-PER E-PER

O O S-ORG B-PER E-PER (c) Croatian: Professor of USLA Stepehn Hubbell Tokens EL TRIBUNAL DE DEFENSA DE LA COMPETENCIA DETERMINE Baseline O O O O O O O O M.+U. O S-ORG O O O O O O G.U. O B-ORG I-ORG I-ORG I-ORG I-ORG E-ORG O Gold-Std. O B-ORG I-ORG I-ORG I-ORG I-ORG E-ORG O (d) Spanish: The Competition Defense Court determines However, by generating random uppercase sentence during training, the marking of uppercase tokens can improve the performance in all the languages. In most cases, this happens as well when applied with other methods such as prediction of masked tokens, specially in Slovene and English.

Although all the datasets contain a variable number of words only in uppercase, there are two possible reasons why some languages bene ted more than others. First, it can be the case that the number of uppercase tokens was not large enough to make BERT learn about the marking. Second, it can be related to the textual information that was used to train each BERT model. Nonetheless, BERT is indeed capable of learning the meaning and context of uppercase tokens if enough data has been used during their training, as we did when we generated arti cially uppercase sentences.

In Figure 5, we present four examples regarding the prediction of named entities in uppercase sentences; the selected models are the best of each type. As shown in Figure 5b, the prediction of entities does not become perfect when marking uppercase words, but it can de nitely improve their recognition. As indicated previously, the evaluation of NER systems over the Croatian and Slovene datasets is not standard along the state-of-the-art systems. The main reasons is that some named entity types are either not found in the corpus or are disregarded due to their small frequency. Therefore, we present in Table 6 the recalculation of the macro F-scores. These scores are based on the three common types of named entities used in the di erent NER systems from the state of the art.

With respect to Croatian, we can observe in Table 2 and Table 6 that we can improve the results with respect to CroSloEngual, which is based as well on BERT. Furthermore, our largest improvement, with respect to Janes-NER [ 10 ], is for the prediction of named entities of type Location and Organization. For Slovene, we are not able to surpass the performance showed in [ 27 ].

Although it is common in the state of the art to present results for the Slovene and Croatian corpora only in terms of macro F-score, the lack of micro F-scores or detailed results per class makes it di cult to perform a detailed comparison of the systems. First, the Croatian and Slovene corpora are not balanced, in other words, the number of entities for each class is not equal. Therefore, the macro F-scores consider equally important all the types of named entities, but disregard their frequency in the dataset. Thus, it is impossible to know whether systems, such as CroSloEngual, Polyglot or Croatian NERC, are focusing either on the most frequent classes or the less frequent ones. For instance, we know that our Croatian NER system focuses on the less frequent class Location (117 occurrences) rather than on the most frequent ones (Person and Organization, respectively with 228 and 365 occurrences). However, our Slovene NER system focuses on the most frequent classes Person and Location (respectively with 257 and 210 occurrences), rather than on the less frequent ones Organization (112 occurrences) and Miscellaneous (47 occurrences).

Despite not surpassing the current state of the art in Slovene, it should be indicated that we trained a system with four types of entities rather than three as it was done in the work of Ulcar and Robnik-Sikonja [ 27 ]. This aspect can introduce noise or increases the di culty of the task, as the left out named entity type, Miscellaneous, is the least frequent one. In this case, if we compare with Janes-NER [ 10 ], this systems gets an F-score for Miscellaneous of 27.00 while our masked baseline reaches at most 85.42.

Finally, with respect to Finnish, we were able to surpass, in average, the performance of the state of the art in terms of macro F-score, and in some iterations the micro F-score. Based on the di erence between micro and macro F-score, presented in Table 2, we can determine that multiple of our systems focused slightly more on the less frequent classes, in comparison to the work of [ 19 ]. This can be observed in detail in Table 7, where we improved the prediction of entities of type Event and Product, which are the less frequent classes (7 and 79 instances in the test set respectively), by reducing the correct prediction of a more frequent class, i.e. Organization (208). We can observe as well, that macro F-scores can variate more than micro F-scores values, depending on the seed utilized during training. Despite all, we can see as well, that we can improve the predictions of entities without having to add supplementary sentences for increasing the context as Luoma et al. [ 19 ] did. 7

Conclusions and future work

Named Entity Recognition (NER) is a task that aims to extract and classify groups of tokens referring to speci c types like locations, persons and organizations. In the last couple of years, with the creation of BERT [ 8 ], multiple NER systems made use of its architecture to provide high-performing tools. Nonetheless, we observed that this kind of systems could face some issues, such as the bad prediction of uppercase sentences, the wrong detection of entity boundaries and low generalization.

Therefore, in this work, we presented three di erent methods that could alleviate these issues. Experiments were done over ve languages, three of them low-resourced ones. We improved the state of the art with a micro F-score of up to 89.54 in Croatian by marking uppercase tokens and generating uppercase sentences during training. By marking uppercase tokens, predicting boundaries and tokens, we managed to improve the performance of BERTBASE to an Fscore of up to 92.62 in English, while getting the second-best performance in Spanish with an F-score of up to 89.56. In Finnish, we improved, in average, the prediction of the less frequent named entity types, with a macro F-score of 82.41 versus 81.00 in the state of the art, while reaching a micro F-score of up to 92.09 versus 91.60. We could also provide a NER system for Slovene that predicts 4 types of named entities, one of which is infrequent, with results comparable to those of another tool from the state of the art that only predicts the three most frequent types.

Furthermore, we observed that in Croatian, the prediction of named entity boundaries is a bottleneck for the NER systems. While in Spanish, it seems to be easy to nd the boundaries of named entities, but much harder to determine their type. Finally, we propose a simple method that could improve the prediction of named entities in sentences that are in uppercase words.

In the future, we intend to experiment with additional languages. We would like to asses whether the addition of some context to the left of the split sentences could improve the performance of the NER.

Acknowledgments

This work was supported by the European Union's Horizon 2020 research and innovation program under grants 770299 (NewsEye) and 825153 (Embeddia).

1. Al-Rfou , R. , Kulkarni , V. , Perozzi , B. , Skiena , S.: POLYGLOT-NER: Massive Multilingual Named Entity Recognition . CoRR abs/1410 .3791 ( 2014 ), http:// arxiv.org/abs/1410.3791

2. Alves , D. , Thakkar , G. , Tadic , M. : Evaluating Language Tools for Fifteen EUo cial Under-resourced Languages . In: Proceedings of the 12th Language Resources and Evaluation Conference . pp. 1866 { 1873 . European Language Resources Association, Marseille, France (May 2020 )

Baldini

Soares , L. , FitzGerald , N., Ling , J. , Kwiatkowski , T. : Matching the Blanks: Distributional Similarity for Relation Learning . In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . pp. 2895 { 2905 . Association for Computational Linguistics, Florence, Italy (Jul 2019 ). https://doi.org/10.18653/v1/ P19 -1279

4. Bekavac , B. , Tadic , M. : Implementation of Croatian NERC System . In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing . pp. 11 { 18 . Association for Computational Linguistics, Prague, Czech Republic ( Jun 2007 ), https://www.aclweb.org/anthology/W07-1702

5. Bojanowski , P. , Grave , E. , Joulin , A. , Mikolov , T. : Enriching Word Vectors with Subword Information . Transactions of the Association of Computational Linguistics 5 , 135 { 146 ( 2017 )

6. Cao , J. , Wang , G. , Li , C. , Ren , H. , Cai , Y. , Wong , R.C.W. , Li , Q. : Incorporating Boundary and Category Feature for Nested Named Entity Recognition . In: Nah, Y. , Cui , B. , Lee , S.W. , Yu , J.X. , Moon , Y.S. , Whang , S.E . (eds.) Database Systems for Advanced Applications . pp. 209 { 226 . Springer International Publishing, Cham ( 2020 )

7. Can~ete, J., Chaperon , G. , Fuentes , R. , Perez , J.: Spanish Pre-Trained BERT Model and Evaluation Data . In: PML4DC at ICLR 2020 ( 2020 )

8. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). pp. 4171 { 4186 . Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019 ). https://doi.org/10.18653/v1/ N19 -1423

9. Dozat , T. , Manning , C.D.: Deep Bia ne Attention for Neural Dependency Parsing . CoRR abs/1611 .01734 ( 2016 ), http://arxiv.org/abs/1611.01734

10. Fiser , D. , Ljubesic , N. , Erjavec , T. : The Janes project: language resources and tools for Slovene user generated content . Language Resources and Evaluation 54 ( 1 ), 223 {246 (Mar 2020 ). https://doi.org/10.1007/s10579-018-9425-z

11. Krek , S. , Dobrovoljc , K. , Erjavec , T. , Moze , S. , Ledinek , N. , Holz , N. , Zupan , K. , Gantar , P. , Kuzman , T. , Cibej , J. ,

Arhar

Holdt , S. , Kavcic , T. , Skrjanec , I. , Marko , D. , Jezersek , L. , Zajc , A. : Training corpus ssj500k 2 . 2 ( 2019 ), http://hdl.handle. net/11356/1210, Slovenian language resource repository CLARIN .SI

12. Li , J. , Sun , A., Han, J. , Li , C. : A Survey on Deep Learning for Named Entity Recognition . IEEE Transactions on Knowledge and Data Engineering pp. 1 { 1 ( 2020 )

13. Li , J. , Sun , A. , Ma , Y.: Neural Named Entity Boundary Detection . IEEE Transactions on Knowledge and Data Engineering pp. 1 { 1 ( 2020 )

14. Lin , B.Y. , Lee , D.H. , Shen , M. , Moreno , R. , Huang , X. , Shiralkar , P. , Ren , X. : TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition . In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . pp. 8503 { 8511 . Association for Computational Linguistics, Online (Jul 2020 ). https://doi.org/10.18653/v1/ 2020 .acl-main. 752

15. Lin , H. , Lu , Y. , Tang , J., Han, X. , Sun , L. , Wei , Z. , Yuan , N.J.:

A Rigorous Study on Named Entity Recognition: Can Fine-tuning Pretrained Model Lead to the Promised Land?

In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) . pp. 7291 { 7300 . Association for Computational Linguistics, Online (Nov 2020 ). https://doi.org/10.18653/v1/ 2020 .emnlpmain.592

16. Liu , Y. , Ott , M. , Goyal , N. , Du , J. , Joshi , M. , Chen , D. , Levy , O. , Lewis , M. , Zettlemoyer , L. , Stoyanov , V. : Roberta: A robustly optimized bert pretraining approach ( 2019 )

17. Ljubesic , N. , Erjavec , T. : Corpus vs . Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) . pp. 1527 { 1531 . European Language Resources Association (ELRA), Portoroz, Slovenia (May 2016 ), https://www.aclweb.org/anthology/L16-1242

18. Ljubesic , N. , Klubicka , F. , Agic , Z. , Jazbec , I.P. : New In ectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian . In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) . pp. 4264 { 4270 . European Language Resources Association (ELRA), Portoroz, Slovenia (May 2016 ), https://www.aclweb.org/ anthology/L16-1676

19. Luoma , J. , Oinonen , M. , Pyykonen, M. , Laippala , V. , Pyysalo , S.: A Broadcoverage Corpus for Finnish Named Entity Recognition . In: Proceedings of the 12th Language Resources and Evaluation Conference . pp. 4615 { 4624 . European Language Resources Association , Marseille, France (May 2020 ), https://www.aclweb. org/anthology/2020.lrec- 1 . 567

20. Ma , X. , Hovy , E.: End-to-end Sequence Labeling via Bi-directional LSTMCNNs-CRF . In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . pp. 1064 { 1074 . Association for Computational Linguistics, Berlin, Germany (Aug 2016 ). https://doi.org/10.18653/v1/ P16 -1101

21. Mosbach , M. , Andriushchenko , M. , Klakow , D. : On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines ( 2020 ), https:// arxiv.org/abs/ 2006 .04884

22. Powalski , R. , Stanislawek , T.: UniCase { Rethinking Casing in Language Models ( 2020 ), arXiv cs .CL eprint: 2010 .11936

23. Strakova , J. , Straka , M. , Hajic , J.: Neural Architectures for Nested NER through Linearization . In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . pp. 5326 { 5331 . Association for Computational Linguistics, Florence, Italy (Jul 2019 ). https://doi.org/10.18653/v1/ P19 -1527

24. Sun , L. , Hashimoto , K. , Yin , W. , Asai , A. , Li , J. , Yu , P. , Xiong , C. : Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT . arXiv preprint arXiv: 2003 . 04985 ( 2020 )

25. Tjong Kim Sang , E.F. : Introduction to the CoNLL-2002 shared task: Languageindependent named entity recognition . In: COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002) ( 2002 ), https://www.aclweb. org/anthology/W02-2024

26. Tjong Kim Sang , E.F. , De Meulder , F. : Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition . In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 . pp. 142 { 147 ( 2003 ), https://www.aclweb.org/anthology/W03-0419

27. Ulcar , M. , Robnik-Sikonja , M.: FinEst BERT and CroSloEngual BERT . In: Sojka, P. , Kopecek , I. , Pala , K. , Horak , A . (eds.) Text, Speech, and Dialogue. pp. 104 { 111 . Springer International Publishing, Cham ( 2020 )

28. Virtanen , A. , Kanerva , J. , Ilo , R. , Luoma , J. , Luotolahti , J. , Salakoski , T. , Ginter , F. , Pyysalo , S. : Multilingual is not enough: BERT for Finnish ( 2019 ), https: //arxiv.org/abs/ 1912 .07076

29. Wang , Z. , Shang , J. , Liu , L. , Lu , L. , Liu , J ., Han, J .: CrossWeigh: Training Named Entity Tagger from Imperfect Annotations . In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) . pp. 5154 { 5163 . Association for Computational Linguistics, Hong Kong, China (Nov 2019 ). https://doi.org/10.18653/v1/ D19 -1519

30. Wolf , T. , Debut , L. , Sanh , V. , Chaumond , J. , Delangue , C. , Moi , A. , Cistac , P. , Rault , T. , Louf , R. , Funtowicz , M. , Davison , J. , Shleifer , S. , Platen , P.v. , Ma , C. , Jernite , Y. , Plu , J. , Xu , C. , Scao , T.L. , Gugger , S. , Drame , M. , Lhoest , Q. , Rush , A.M.: HuggingFace's Transformers: State-of-the-art Natural Language Processing . ArXiv abs/ 1910 .03771 ( 2019 )

31. Yamada , I. , Asai , A. , Shindo , H. , Takeda , H. , Matsumoto , Y. : LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention . In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) . pp. 6442 { 6454 . Association for Computational Linguistics, Online (Nov 2020 ). https://doi.org/10.18653/v1/ 2020 .emnlp-main. 523

32. Yu , J. , Bohnet , B. , Poesio , M. : Named Entity Recognition as Dependency Parsing . In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . pp. 6470 { 6476 . Association for Computational Linguistics, Online (Jul 2020 ). https://doi.org/10.18653/v1/ 2020 .acl-main. 577