Simple ways to improve NER in every language using markup Luis Adrián Cabrera-Diego1 , Jose G. Moreno2 , and Antoine Doucet1 1 La Rochelle Université, L3i, La Rochelle, 17031, France luis.cabrera diego@univ-lr.fr, antoine.doucet@univ-lr.fr 2 Université Paul Sabatier, IRIT, Toulouse, 31062, France, jose.moreno@irit.fr Abstract. We explore three different methods for improving Named En- tity Recognition (NER) systems based on BERT, each responding to one of three potential issues: the processing of uppercase tokens, the detection of entity boundaries and low generalization. Specifically, we first explore the marking of uppercase tokens for providing extra casing information. We then randomly mask tokens, as in a masked language model, and predict them along with the NER task to improve NER generalization. Finally, we predict entity boundaries to ameliorate named entity detec- tion. The experiments were done over five languages, three of which are low-resourced: Slovene, Croatian, Finnish, English and Spanish. Results show that predicting masked tokens can be beneficial for most languages, while marking uppercase tokens can be a simple method for dealing with uppercase sentences in NER. Furthermore, our methods improved the state of the art for Croatian and Finnish. Keywords: Named Entity Recognition · BERT · multi-task 1 Introduction Named Entity Recognition (NER) is a fundamental task in the processing of texts that consists of extracting entities that semantically refer to notions such as locations, people and organizations [19,32]. In 2019, Devlin et al. [8] pre- sented the deep neural network model called Bidirectional Encoder Represen- tations from Transformers (BERT) and demonstrated that pre-trained models based on BERT can be fine-tuned to achieve high performance in multiple tasks. As a consequence, multiple BERT-based NER systems have been created in the last couple of years [12].In this work1 , we explore three different aspects that might play a role in the performance of NER systems: Uppercase words: Although it is uncommon to have to have texts not fol- lowing standard casing rules, some NER datasets, such as CoNLL 2003 [26] and CoNLL 2002 [25] may contain a small percentage of sentences with uppercase Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 github.com/EMBEDDIA/NER BERT Multitask 2 Cabrera-Diego et al. words. These sentences might be harder to predict by systems based on language models where BPE tokenizer are used, such as BERT or RoBERTa [16], because uppercase versions of a word are not tokenized in the same way as their title or lowercase versions [22]. For instance, the words Italy and ITALY, are split by BERTBASE tokenizer as [Italy] and [IT,##AL,##Y], respectively. If we ask BERTBASE to predict the masked word in “I live in Rome, [MASK].”, the top prediction is Italy (50.5%), followed by too (8.9%), though (3.6%), Rome (3.0%) and now (2.4%). Nonetheless, predicting the masked word for the same phrase but using uppercase words, i.e. “I LIVE IN ITALY, [MASK].”, BERT proposes the following words too (4.0%), please (1.6%), then (1.1%), now (1.1%) and Mom (0.7%). Moreover, if we mask only one subtoken of the word ITALY, BERT pro- duces top predictions such as ##E (20.7%) IS (18.6%) and AND (15.7%). The reason that uppercase words are harder to process correctly is that different BPE tokens have different dense representations and, in consequence, the lan- guage model might not have enough knowledge about them [24,22]. Therefore, it might be necessary to process differently uppercase words in NER systems. Entity boundaries: Although the prediction of named entity boundaries is associated to nested named entities [6], in Li et al. [13], the authors determined that the prediction of boundaries in flat named entities in English (CoNLL 2003 [26]) can be as high as 97.40. Therefore, we ask ourselves whether this perfor- mance is always reached in all languages and datasets. Or whether, in some cases, the correct prediction of boundaries is a bottleneck for improving the detection of named entities. Low generalization: One of the biggest challenges in NER systems is the prediction of named entities that were never seen during the training or that have weak or zero regularity, such as titles of books and movies [15]. In the last year, there have been some interesting methods for increasing NER systems’ generalization, such as the manual creation of triggers [14] and, the permutation of named entities along with the reduction of context as in Lin et al. [15]. In this work, we explore another method that might improve the generalization while, at the same time, might adequate the language model, to the domain of the dataset analyzed. We counter these cases with three different approaches that could be used to improve the performance of a NER system. First, we explore whether the mark- ing of uppercase tokens and the addition of supplementary cases can improve the detection of named entities. Second, we determine whether training, in a multi-task fashion, a named entities boundaries detector could improve the per- formance of a named entity system. Finally, we investigate whether the masking and prediction of tokens during training, could increase the NER system gener- alization. Therefore, we present our experiments and conclusions on five different datat- sets. Two of them in high-resourced languages: English (CoNLL 2003 [26]) and Spanish (CoNLL 2002 [25]). And three from low-resourced languages: Croatian (HR500k [18]), Slovene (SSJ500k [11]) and Finnish [19]. The obtained results show an improvement over the state of the art for Croatian, but also for Finnish, Simple ways to improve NER in every language using markup 3 while we have interesting results for the rest of the languages. Notably, we can observe a benefit of marking uppercase tokens but also on predicting masked tokens during the training of the models. The rest of the paper is structured as follows. In Section 2, we present the most relevant related work regarding NER systems for the languages explored in this paper. Then, in Section 3, we introduce the methodology explored in this work. The explored datasets and the experimental setup are described in Sec- tion 4 and Section 5, respectively. The results and their discussion are presented in Section 6. Finally, the conclusions and future work are presented in Section 7. 2 Related Work Recent multilingual NER systems have opted for BERT-based architectures. For instance, Luoma et al. [19] presented a new dataset in Finnish based on the Universal Dependency Finnish corpus and evaluated it using different NER systems from the state of the art, including FinBERT [28], a Finnish BERT. For Croatian and Slovene, the Janes Project [10] proposed Janes-NER, an NER system that uses a Conditional Random Fields (CRF) classifier, along with lexica and Brown clusters; it is based on the work of Ljubešić and Erjavec [17]. It was trained and tested on HR500k [18] and SSJ500k [11] using 5 possible entity types: Location, Person, Person-Derived, Organization and Miscellaneous. Both languages have been evaluated2 using the Babushka-Bench3 . The work of Ulčar and Robnik-Šikonja [27] presented CroSloEngual (CSE), a multilingual BERT for Croatian, Slovene and English. The pre-trained model was evaluated on NER using the datasets of HR500k [18] and SSJ500k [11]; only entities of type Location, Person and Organization were predicted. In Alves et al. [2], the authors evaluated two NER systems from the state of the art: Polyglot [1] and the Croatian NERC System (CNERC) [4] over the corpus HR500k [18]. Only the entities of type Location, Person and Organization were considered. Yu et al. [32] used BERT [8], FastText [5] and character embeddings, with a biaffine model [9] in a new NER system. Their results improved state-of-the-art results in multiple datasets including Spanish CoNLL 2002 [25]. In Li et al. [13], the authors created BdryBot a tool for detecting named entities boundaries. It is based on multiple recursive neural networks, a pointer mechanism and BERT. On English CoNLL 2003 [26], they reached an F-score of 0.974 on the prediction of entity boundaries. Comparing this value with the current state of the art for the detection of named entities, 0.943 [31], it means that the detection of named entity boundaries is easier to achieve than the prediction of their types. 2 github.com/clarinsi/janes-ner 3 github.com/clarinsi/babushka-bench 4 Cabrera-Diego et al. 3 Methodology As indicated previously in Section 1, we present in this work an NER system that along with three methods looks for reducing the effects of three aspects: uppercase words, entity boundaries as bottleneck and low-generalization. The architecture of the proposed methodology is shown in Figure 1 and it is com- posed of four key elements: prediction of named entities, prediction of entity boundaries, prediction of masked tokens and processing of uppercase tokens. Each of the components is described as follows. The prediction of named entities is done through a linear layer that it is connected to the output generated by a BERT model, similarly to the work proposed by Devlin et al. [8]. However, to improve the correct annotation of entities, we add as well a CRF such as in Ma and Hovy [20]. For the prediction of named entity boundaries, we make use the same archi- tecture used for the prediction of named entities. However, the linear and CRF layers focus on a reduced set of labels, which are only related to entity bound- aries. The objective of this component is to determine whether, the prediction of boundaries could improve the prediction of named entities as the former is an easier task than the latter. Regarding the prediction of masked tokens, the architecture follows the same one proposed by Devlin et al. [8] for training a masked language model. This consists of introducing the output of a BERT model into a linear layer, which has the same size of the pre-trained vocabulary. The linear layer is expected to predict the masked token. The component’s goal is to force BERT to learn patterns that could detect named entities even when a portion of the information is hidden. At the same time, fine-tune BERT embeddings to the domain of the NER dataset. The prediction components, as show in Figure 1, are coupled in a multi-task fashion. This means, that each of the previously mentioned components, are associated to a specific loss function, which produces values related to each task. During training, the losses produced by all the tasks, are summed. However, during prediction time, as we are only interested in the prediction of named entities, only the NER part is active. With respect to the tokenization of uppercase words, we decided to follow a pre-processing method. In this case, we explore whether introducing additional cases to the input sentence could bring additional information to BERT regard- ing the correct context in which a named entity occur. For doing this, we follow an approach similar to the one of Baldini Soares et al. [3], in which entity mark- ers are used to focus BERT on specific information. In the context of uppercase words, we add to BERT’s vocabulary two special tokens [UP], [up], which mark the occurrence of an uppercase word. Inside these special tokens, we introduce the tokens produced by the original uppercase word, but also the tokens obtained of the word in title-case and lowercase. For instance, in Figure 1, we can observe that two words are in uppercase, i.e. ROME and ITALY, thus we change their respective tokens to a marked and enriched version. In other words, the tokens Simple ways to improve NER in every language using markup 5 NER Entities Boundaries Masked Tokens S-LOC O S-LOC O B-PER E-PER O O S-X O O B-X E-X O O Minister Output Linear+CRF Layers Linear+CRF Layers Linear Layer Pre-trained BERT Model [UP] [ROM, ##E] [Rome] [r, ##ome] [up] Modification of [UP] [IT, ##AL, ##Y] [Italy] [it, ##aly] [up] uppercase tokens [ROM, ##E] [,] [IT, ##AL, ##Y] [.] [Giuseppe] [Con, ##te] , [Prime] [MASK]... Tokenization ROME , ITALY . Giuseppe Conte , Prime Minister... Input Fig. 1: Proposed architecture, including an example of the expected output. associated to ROME become “[UP] [ROM, ##E] [Rome] [r,##ome] [up]”.4 It should be indicated, that the prediction of named entities (or boundaries), are done uniquely over the first token, which correspond to [UP]. This approach is similar to the one used by Devlin et al. [8] for words split into multiple sub- tokens by BERT’s tokenizer but also by Baldani Soares et al. [3], regarding the use of entity markers. 4 Datasets For English, we use the collection proposed in CoNLL 2003 [26] and for Spanish the dataset created in CoNLL 2002 [25]. Both corpora have been annotated using 4 types of named entities: Location, Person, Organization and Miscellaneous. For Croatian and Slovene, we use the corpus HR500k [18] and SSJ500k [11], respectively. According to their respective authors, both corpora have been an- notated with 5 types of named entities: Location, Person, Person-derived, Or- ganization and Miscellaneous. However, in the case of HR500k, we did not find entries tagged with the Miscellaneous type, as it happened as well in [2,27]. Fol- lowing some previous works [2,27], we removed the type Person-derived, as it is the less frequent type in both corpora. It should be indicated, that we use the training and testing partitions provided by Ulčar and Robnik-Šikonja [27]. How- ever, the training partitions were split into 90% training and 10% development using a stratified strategy in order to use an early stop approach. Regarding the Finnish language, we have used the corpus proposed by Luoma et al. [19]. This corpus has been annotated using 6 different types of named entities: Location, Date, Person, Event, Organization and Product. 4 Neither the square brackets nor the commas surrounding the non-special tokens are in the original representation. However, they are shown in our examples to show the different sub-tokens produced for each word. 6 Cabrera-Diego et al. 5 Experimental Setup The NER systems explored in this article are based on BERT, using Pytorch, HuggingFace’s Transformers [30] and different pre-trained BERT models: for English we make use of BERTBASE [8]; for Finnish, FinBERT [28]; for Slovene and Croatian, CroSloEngual [27] and for Spanish we use BETO [7]. All the named entity tags are encoded using BIOES (Beginning, Inside, Out- side/Other, End, Single). For the detection of boundaries, we convert the named entity tags into a generic BIOES encoding; in other words, we use a generic entity type, e.g. B-X, I-X, E-X. For each language, we train 12 different models. The first model, i.e. baseline, is the implementation that only consists of BERT+Linear+CRF. The remaining 11 models, are the different combinations of the approaches described in Section 3 when added to our baseline. Based on the recommendation proposed by Mosbach et al. [21], every model is trained up to 20 epochs using an early stop approach and AdamW with bias correction along with an epsilon of 1 × 10−8 . The early stop is based on the micro F-score and loss of the development datatset. With respect to the masking of tokens, we only affect the sentences in the training partitions that are longer than three tokens5 . At each epoch, we select randomly 25% of each sentence’s tokens and substitute them with [MASK]. If a token after being processed by BERT’s tokenizer produces more than one BERT’s token, we randomly select one for masking it. For instance, in the case of the last name Conte, see Figure 1, one of the sub-tokens would be masked. In Table 1, we present a summary of the hyperparameters used for training the NER system. As it can be seen, all the parameters, except for the number of epochs and optimizer, follow those used in Devlin et al. [8]. It should be noted that unlike other works, such as [8,28,19], were BERT’s input was enriched either with surrounding sentences or document context, our models have for input only the sentence that needs to be analyzed. Moreover, and in contrast with some BERT implementations, the inputs surpassing BERT’s token window size are split instead of truncated.6 The splitting consists of gen- erating a new input sentence with the rest of the tokens; during prediction, the tokens are aligned to match the original input. We evaluate the output of the NER system using Seqeval 7 . With respect to the assessment of boundaries, we use Nervaluate8 . This evaluation tool provides exact, a metric which determines how well the boundaries of the predicted named entities match those found in the gold standard, regardless of the type. 5 In this case, we talk about the actual definition of tokens, rather than those obtained by BERT’s tokenizer 6 Some implementations disregard the tokens surpassing the token window or consid- ered these as the type Other. 7 github.com/chakki-works/seqeval 8 github.com/ivyleavedtoadflax/nervaluate/ Simple ways to improve NER in every language using markup 7 Table 1: Hyperparameters used for training the proposed architecture. Hyperparameter Value Maximum Epochs 20 Early Stop Patience 3 Learning Rate 2 × 10−5 Scheduler Linear with warm-up Warm-up Ratio 0.1 Optimizer AdamW with bias correction AdamW  1 × 10−8 Random Seed 12, 24, 58, 89 Dropout rate 0.1 Weight decay 0.01 Clipping gradient norm 1.0 BERT’s Sequence Size 128 Masking Percentage 25% Training Mini-Batch 32 Testing Mini-Batch 8 6 Results and Discussion We present in Table 2 and Table 3 the average and maximum results of five iterations, in terms of micro and macro F-score, regarding the prediction of named entities for the different combinations of systems proposed in this work. We also present results from the state of the art (only a selection of it for the case of English, where the list could be very long). It should be noted that the scores from the state of the art presented Table 2 and Table 3 are not product of multiple iterations, except for BERTBASE [8]. As well, the evaluation of Janes-NER using the Babushka-Bench, see Table 2, does not consider errors in boundaries, and calculates the macro F-score using the performance of 5 named entity types plus the obtained score of predicting the Other type. Moreover, for Croatian and Slovene, Table 2, the number of named entity types predicted by each system might not be the same. This last issue will be discussed in detail further ahead. We can observe, in Table 2 and Table 3, that the prediction of masked to- kens can improve, in average, the performance of the explored methods, the exception is Finnish, where the performance decreases. In fact, we can observe that for Finnish, the difference between the maximum macro F-score and the average is larger, meaning that the model becomes less stable when predicting masked tokens. Nonetheless, we can achieve better values of micro F-scores for Finnish. Furthermore, by masking tokens, either with other features or not, we can improve in average the performance in Spanish CoNLL 2002. Nonetheless, our maximum score does not reach the performance presented by Yu et al. [32]. Although for English CoNLL 2003, we are still far from the current state of the art, and slightly worse than BERTBASE , it should be noted that do not use any kind of document context as Devlin et al. [8] did. This might be a signal that we’re forcing BERT to generalize better and to deal with smaller contexts. It is unclear the reasons of why the masking of tokens affected the stability of the less frequent named entities in Finnish. It could be the case that we did not mask enough tokens to force BERT to generalize in certain iterations. Or it might 8 Cabrera-Diego et al. Table 2: Average values of micro and macro F-score for each experiment over five iterations on Croatian, Slovene and Finnish. The maximum value of each iteration is between brackets. The best performance is in bold. Boundaries (B.), Uppercase (U.), Generated Uppercase (G.U.) and Masked (M.) Croatian Slovene Finnish Macro Micro Macro Micro Macro Micro FinBERT [19] - - - - 81.00 91.60 BiLSTM [19] - - - - - 81.50 Janes [10] 67.30 - 75.20 - - - CSE [27] 88.40 - 92.00 - - - Polyglot [1,2] 62.20 - - - - - CNERC [4,2] 65.40 - - - - - Baseline 87.35 (88.79) 84.78 (86.32) 85.30 (86.95) 89.93 (90.49) 81.35 (82.03) 91.20 (91.76) B. 87.09 (87.92) 84.60 (86.08) 84.33 (86.33) 89.45 (90.61) 80.78 (83.05) 90.38 (90.88) U. 87.64 (88.35) 85.21 (86.90) 85.12 (86.70) 89.39 (89.83) 80.41 (83.25) 90.32 (91.26) G.U. 88.08 (89.54) 86.07 (88.12) 85.86 (86.74) 89.54 (90.57) 82.41 (85.23) 91.33 (91.82) B.+U. 87.85 (88.53) 85.91 (86.77) 84.23 (88.06) 88.74 (90.68) 80.20 (82.61) 90.32 (90.73) B.+G.U. 87.46 (88.78) 85.20 (87.55) 85.83 (86.67) 89.56 (90.09) 82.14 (83.24) 91.19 (91.82) M. 87.50 (88.04) 85.06 (85.54) 85.50 (86.97) 90.56 (91.42) 80.20 (83.54) 90.81 (91.50) M.+B. 87.54 (88.21) 85.46 (86.41) 84.21 (86.62) 89.81 (90.68) 79.33 (81.52) 90.86 (91.45) M.+U. 88.20 (89.36) 85.83 (87.30) 86.32 (87.79) 90.53 (91.10) 79.81 (83.46) 91.05 (92.09) M.+G.U. 88.21 (89.36) 86.03 (88.04) 86.46 (87.21) 90.53 (90.72) 78.53 (81.93) 90.89 (91.14) M.+B.+U. 88.20 (89.05) 85.90 (86.94) 86.26 (87.89) 90.35 (91.40) 80.23 (82.09) 91.27 (92.09) M.+B.+G.U. 87.70 (89.33) 85.64 (87.70) 87.18 (89.56) 91.07 (91.90) 81.85 (85.66) 90.87 (91.64) be related of the agglutinative characteristic of Finnish, in which the masking of tokens affect severely key elements of the language to predict correctly named entities. In Table 4, we present the results regarding the exact metric in terms of micro and macro F-score for each language. This metric evaluates how well a system predicts the boundaries of named entities regardless of the type associated. For Croatian, English and Spanish, we can observe in Table 4 that the prediction of entity boundaries is quite stable in general, either in terms of micro or macro F- score. It should be indicated, that the state-of-the-art micro F-score for English CoNLL 2003 regarding the detection of boundaries is the following: BdryBot 95.90, BERT 96.90 and BdryBot+BERT 97.40 [13]. Regarding Slovene, we can notice that prediction of masked tokens improve the exact metric, however train- ing a model where we predict along the boundaries seems to do not have any effect in general. For Finnish, the exact metrics shows an stable performance in terms of micro F-score, nonetheless, in terms of macro F-score we can observe a decrement when we predict masked tokens. Another element to discus regarding Table 4, is that for languages such as English and Spanish, recognizing named entity boundaries can be considered an easy task. Nevertheless, for for Croatian, and in lesser degree Slovene and Finnish, it is much more difficult. Moreover, for Finnish the prediction of entity boundaries is harder to achieve for the less frequent types of named entities as it shows the larger difference between the micro and macro F-scores. Furthermore, the results obtained in Table 4 give us an idea on what could be the maximum possible score that a NER system could achieve if the task would only consist of predicting the types of confirmed named entity boundaries. In the same line, the results show that despite the fact that it is easy to find named entity boundaries Simple ways to improve NER in every language using markup 9 Table 3: Average values of micro and macro F-score for each experiment over five iterations on Spanish and English; the maximum of each iteration is between brackets. The best performance is in bold. Boundaries (B.), Uppercase (U.), Generated Uppercase (G.U.) and Masked (M.) Spanish English Macro Micro Macro Micro BERT Base [8] - - - 92.40 LUKE [31] - - - 94.30 Seq2seq+BERT [23] - 88.80 - 92.90 NER Dep.Par. [32] - 90.30 - 93.50 Baseline 85.85 (86.55) 88.06 (88.57) 90.21 (90.70) 91.48 (91.82) B. 86.00 (87.05) 88.20 (88.70) 90.00 (90.19) 91.41 (91.51) U. 85.69 (87.33) 87.81 (88.71) 89.97 (90.27) 91.51 (91.82) G.U. 86.53 (86.78) 88.30 (88.82) 90.43 (90.97) 91.89 (92.20) B.+U. 85.32 (86.24) 87.67 (88.14) 89.91 (90.20) 91.39 (91.62) B.+G.U. 85.76 (86.47) 87.94 (88.52) 90.60 (90.77) 91.93 (92.05) M. 87.19 (88.02) 88.99 (89.40) 90.51 (90.72) 91.86 (92.05) M.+B. 87.18 (88.24) 88.89 (89.51) 90.32 (90.60) 91.72 (91.93) M.+U. 87.27 (88.38) 88.83 (89.56) 90.71 (90.96) 92.17 (92.31) M.+G.U. 86.75 (87.93) 88.47 (89.43) 90.69 (90.89) 92.11 (92.28) M.+B.+U. 86.55 (87.64) 88.32 (89.18) 90.54 (90.81) 92.02 (92.27) M.+B.+G.U. 86.62 (87.52) 88.39 (89.22) 90.85 (91.27) 92.22 (92.62) Table 4: Average values of micro and macro F-score for the exact metric, which evaluates the correct prediction of entity boundaries regardless their type, over five iterations. Boundaries (B.), Uppercase (U.), Generated Uppercase (G.U.) and Masked (M.) Croatian Slovene Finnish Spanish English Macro Micro Macro Micro Macro Micro Macro Micro Macro Micro Baseline 89.81 87.24 90.35 93.72 83.05 92.54 93.19 94.93 94.80 95.67 B. 89.88 87.41 90.52 93.92 83.62 92.27 93.24 94.96 94.66 95.66 U. 90.14 87.73 89.77 93.30 82.37 92.26 93.28 94.97 94.53 95.66 G.U. 90.50 88.56 90.84 93.68 84.43 92.86 93.65 95.09 94.99 96.00 B.+U. 90.54 88.62 89.84 93.50 82.98 92.31 92.73 94.65 94.70 95.76 B.+G.U. 90.04 87.88 92.01 94.54 84.53 92.66 93.29 95.02 95.06 95.97 M. 90.12 87.85 90.71 94.22 81.39 91.74 93.41 94.94 94.80 95.71 M.+B. 90.55 88.47 90.49 93.97 81.42 91.92 93.77 95.04 94.87 95.77 M.+U. 90.59 88.41 91.47 94.54 81.13 92.11 93.63 95.00 94.83 95.86 M.+G.U. 90.36 88.35 91.24 94.31 80.20 91.96 93.33 94.77 95.06 95.99 M.+B.+U. 90.58 88.44 91.68 94.68 81.82 92.33 93.32 94.86 94.87 95.89 M.+B.+G.U. 90.07 88.03 92.22 94.98 83.06 91.94 93.31 94.85 95.10 96.02 in Spanish, predicting their type is much more difficult for that language, com- pared to English or Croatian, if we compare the results shown in Table 2 and Table 3. In Croatian, despite the fact that the detection of entity boundaries is difficult, the prediction of their types lies in a range of around three points, while in Spanish it is around seven points. Therefore, we can deduct, that in order to improve the prediction of named entities in languages such as the Croatian, it is necessary to primarily focus on the correct detection of boundaries. Further, for Spanish, it is necessary to improve the prediction of types rather than en- tity boundaries. However, this last issue could also be a sign of discrepancies in the annotation, either of the training or the testing dataset, something that is already known to occur in the English CoNLL 2003 corpus [29]. With respect to the marking of uppercase tokens, we can notice in Table 2 and Table 3, that we can improve the performance mainly in English and Croatian. 10 Cabrera-Diego et al. Table 5: Example of predictions for uppercase sentences in different languages. All sentences, except for English, are partial (the models processed the full sen- tence) and were converted into uppercase words before processing them. Bound- aries (B.), Uppercase (U.), Generated Uppercase (G.U.) and Masked (M.) Sentence SOCCER - LEEDS ’ BOWYER FINED FOR PART IN FAST-FOOD FRACAS . Baseline O O S-PER O O O O O O O O O M.+U. O O S-ORG O S-PER O O O O O O O M.+B.+G.U. O O S-ORG O S-PER O O O O O O O Gold-Std. O O S-ORG O S-PER O O O O O O O (a) English: Soccer - Leeds’ Bowyer fined for part in fast-food fracas. Tokens KENIAN ANGLIKAANISEN KIRKON SIHTEERI JOHN KAGO SANOI Baseline O O O O B-PER E-PER O M.+U. O B-ORG I-ORG E-ORG O O O M.+G.U. S-LOC B-ORG E-ORG O B-PER E-PER O Gold-Std. B-ORG E-ORG E-ORG O B-PER E-PER O (b) Finnish: John Kago, secretary of the English Church of Kenya, said Thursday Tokens PROFESOR NA USLA STEPHEN HUBBELL Baseline O O O B-PER E-PER U. O O O B-PER E-PER G.U. O O S-ORG B-PER E-PER Gold-Std. O O S-ORG B-PER E-PER (c) Croatian: Professor of USLA Stepehn Hubbell Tokens EL TRIBUNAL DE DEFENSA DE LA COMPETENCIA DETERMINE Baseline O O O O O O O O M.+U. O S-ORG O O O O O O G.U. O B-ORG I-ORG I-ORG I-ORG I-ORG E-ORG O Gold-Std. O B-ORG I-ORG I-ORG I-ORG I-ORG E-ORG O (d) Spanish: The Competition Defense Court determines However, by generating random uppercase sentence during training, the marking of uppercase tokens can improve the performance in all the languages. In most cases, this happens as well when applied with other methods such as prediction of masked tokens, specially in Slovene and English. Although all the datasets contain a variable number of words only in upper- case, there are two possible reasons why some languages benefited more than others. First, it can be the case that the number of uppercase tokens was not large enough to make BERT learn about the marking. Second, it can be related to the textual information that was used to train each BERT model. Nonethe- less, BERT is indeed capable of learning the meaning and context of uppercase tokens if enough data has been used during their training, as we did when we generated artificially uppercase sentences. In Figure 5, we present four examples regarding the prediction of named entities in uppercase sentences; the selected models are the best of each type. As shown in Figure 5b, the prediction of entities does not become perfect when marking uppercase words, but it can definitely improve their recognition. Simple ways to improve NER in every language using markup 11 Table 6: Best F-score values of the three common named entities for the Croatian and Slovene systems. Boundaries (B.), Uppercase (U.), Generated Uppercase (G.U.) and Masked (M.) Croatian Slovene PER LOC ORG Macro F1 PER LOC ORG Macro F1 CroSloEngual [27] NA NA NA 88.40 NA NA NA 92.00 Janes-NER [10] 89.00 85.00 72.00 82.00 89.00 80.00 67.00 78.60 Polyglot [1] [2] NA NA NA 62.20 - - - - Croatian NERC [4] [2] NA NA NA 64.00 - - - - Baseline 85.40 97.41 83.55 88.79 96.88 90.78 79.28 88.98 G.U. 85.71 95.80 87.19 89.54 95.94 91.98 81.98 89.96 B.+G.U. 85.59 95.20 87.30 89.36 95.22 91.61 79.45 88.76 M.+B.+G.U. 87.47 94.92 85.60 89.33 95.34 92.79 84.68 90.93 As indicated previously, the evaluation of NER systems over the Croatian and Slovene datasets is not standard along the state-of-the-art systems. The main reasons is that some named entity types are either not found in the corpus or are disregarded due to their small frequency. Therefore, we present in Table 6 the recalculation of the macro F-scores. These scores are based on the three common types of named entities used in the different NER systems from the state of the art. With respect to Croatian, we can observe in Table 2 and Table 6 that we can improve the results with respect to CroSloEngual, which is based as well on BERT. Furthermore, our largest improvement, with respect to Janes-NER [10], is for the prediction of named entities of type Location and Organization. For Slovene, we are not able to surpass the performance showed in [27]. Although it is common in the state of the art to present results for the Slovene and Croatian corpora only in terms of macro F-score, the lack of micro F-scores or detailed results per class makes it difficult to perform a detailed comparison of the systems. First, the Croatian and Slovene corpora are not balanced, in other words, the number of entities for each class is not equal. Therefore, the macro F-scores consider equally important all the types of named entities, but disregard their frequency in the dataset. Thus, it is impossible to know whether systems, such as CroSloEngual, Polyglot or Croatian NERC, are focusing either on the most frequent classes or the less frequent ones. For instance, we know that our Croatian NER system focuses on the less frequent class Location (117 occurrences) rather than on the most frequent ones (Person and Organization, respectively with 228 and 365 occurrences). However, our Slovene NER system focuses on the most frequent classes Person and Location (respectively with 257 and 210 occurrences), rather than on the less frequent ones Organization (112 occurrences) and Miscellaneous (47 occurrences). Despite not surpassing the current state of the art in Slovene, it should be indicated that we trained a system with four types of entities rather than three as it was done in the work of Ulčar and Robnik-Šikonja [27]. This aspect can introduce noise or increases the difficulty of the task, as the left out named entity 12 Cabrera-Diego et al. Table 7: Results, in terms of F-score, for each entity type for the Finnish corpus and their associated macro and micro F-score; between brackets we indicate the seed that produces these values. Boundaries (B.), Uppercase (U.), Generated Uppercase (G.U.) and Masked (M.) PER LOC ORG DATE EVT PROD Macro F1 Micro F1 FinBERT [19] 95.20 94.70 90.20 96.80 43.50 65.80 81.00 91.60 Baseline (12) 94.29 94.34 89.22 95.69 50.00 68.61 82.03 91.42 Baseline (58) 94.42 95.96 88.63 96.55 47.06 69.12 81.95 91.76 M.+B.+G.U. (24) 95.19 92.99 89.05 96.07 66.67 73.97 85.66 91.64 M.+B.+G.U. (12) 95.27 94.16 90.43 95.28 42.86 74.13 82.02 92.09 M.+U. (89) 95.06 93.70 90.20 96.10 50.00 75.71 83.46 92.09 type, Miscellaneous, is the least frequent one. In this case, if we compare with Janes-NER [10], this systems gets an F-score for Miscellaneous of 27.00 while our masked baseline reaches at most 85.42. Finally, with respect to Finnish, we were able to surpass, in average, the performance of the state of the art in terms of macro F-score, and in some iterations the micro F-score. Based on the difference between micro and macro F-score, presented in Table 2, we can determine that multiple of our systems focused slightly more on the less frequent classes, in comparison to the work of [19]. This can be observed in detail in Table 7, where we improved the prediction of entities of type Event and Product, which are the less frequent classes (7 and 79 instances in the test set respectively), by reducing the correct prediction of a more frequent class, i.e. Organization (208). We can observe as well, that macro F-scores can variate more than micro F-scores values, depending on the seed utilized during training. Despite all, we can see as well, that we can improve the predictions of entities without having to add supplementary sentences for increasing the context as Luoma et al. [19] did. 7 Conclusions and future work Named Entity Recognition (NER) is a task that aims to extract and classify groups of tokens referring to specific types like locations, persons and organiza- tions. In the last couple of years, with the creation of BERT [8], multiple NER systems made use of its architecture to provide high-performing tools. Nonethe- less, we observed that this kind of systems could face some issues, such as the bad prediction of uppercase sentences, the wrong detection of entity boundaries and low generalization. Therefore, in this work, we presented three different methods that could alleviate these issues. Experiments were done over five languages, three of them low-resourced ones. We improved the state of the art with a micro F-score of up to 89.54 in Croatian by marking uppercase tokens and generating uppercase sentences during training. By marking uppercase tokens, predicting boundaries and tokens, we managed to improve the performance of BERTBASE to an F- Simple ways to improve NER in every language using markup 13 score of up to 92.62 in English, while getting the second-best performance in Spanish with an F-score of up to 89.56. In Finnish, we improved, in average, the prediction of the less frequent named entity types, with a macro F-score of 82.41 versus 81.00 in the state of the art, while reaching a micro F-score of up to 92.09 versus 91.60. We could also provide a NER system for Slovene that predicts 4 types of named entities, one of which is infrequent, with results comparable to those of another tool from the state of the art that only predicts the three most frequent types. Furthermore, we observed that in Croatian, the prediction of named entity boundaries is a bottleneck for the NER systems. While in Spanish, it seems to be easy to find the boundaries of named entities, but much harder to determine their type. Finally, we propose a simple method that could improve the prediction of named entities in sentences that are in uppercase words. In the future, we intend to experiment with additional languages. We would like to asses whether the addition of some context to the left of the split sentences could improve the performance of the NER. Acknowledgments This work was supported by the European Union’s Horizon 2020 research and innovation program under grants 770299 (NewsEye) and 825153 (Embeddia). References 1. Al-Rfou, R., Kulkarni, V., Perozzi, B., Skiena, S.: POLYGLOT-NER: Massive Multilingual Named Entity Recognition. CoRR abs/1410.3791 (2014), http:// arxiv.org/abs/1410.3791 2. Alves, D., Thakkar, G., Tadić, M.: Evaluating Language Tools for Fifteen EU- official Under-resourced Languages. In: Proceedings of the 12th Language Re- sources and Evaluation Conference. pp. 1866–1873. European Language Resources Association, Marseille, France (May 2020) 3. Baldini Soares, L., FitzGerald, N., Ling, J., Kwiatkowski, T.: Matching the Blanks: Distributional Similarity for Relation Learning. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 2895–2905. Association for Computational Linguistics, Florence, Italy (Jul 2019). https://doi.org/10.18653/v1/P19-1279 4. Bekavac, B., Tadić, M.: Implementation of Croatian NERC System. In: Proceed- ings of the Workshop on Balto-Slavonic Natural Language Processing. pp. 11–18. Association for Computational Linguistics, Prague, Czech Republic (Jun 2007), https://www.aclweb.org/anthology/W07-1702 5. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Subword Information. Transactions of the Association of Computational Linguis- tics 5, 135–146 (2017) 6. Cao, J., Wang, G., Li, C., Ren, H., Cai, Y., Wong, R.C.W., Li, Q.: Incorporating Boundary and Category Feature for Nested Named Entity Recognition. In: Nah, Y., Cui, B., Lee, S.W., Yu, J.X., Moon, Y.S., Whang, S.E. (eds.) Database Systems for Advanced Applications. pp. 209–226. Springer International Publishing, Cham (2020) 14 Cabrera-Diego et al. 7. Cañete, J., Chaperon, G., Fuentes, R., Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: PML4DC at ICLR 2020 (2020) 8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1423 9. Dozat, T., Manning, C.D.: Deep Biaffine Attention for Neural Dependency Parsing. CoRR abs/1611.01734 (2016), http://arxiv.org/abs/1611.01734 10. Fišer, D., Ljubešić, N., Erjavec, T.: The Janes project: language resources and tools for Slovene user generated content. Language Resources and Evaluation 54(1), 223–246 (Mar 2020). https://doi.org/10.1007/s10579-018-9425-z 11. Krek, S., Dobrovoljc, K., Erjavec, T., Može, S., Ledinek, N., Holz, N., Zupan, K., Gantar, P., Kuzman, T., Čibej, J., Arhar Holdt, Š., Kavčič, T., Škrjanec, I., Marko, D., Jezeršek, L., Zajc, A.: Training corpus ssj500k 2.2 (2019), http://hdl.handle. net/11356/1210, Slovenian language resource repository CLARIN.SI 12. Li, J., Sun, A., Han, J., Li, C.: A Survey on Deep Learning for Named Entity Recog- nition. IEEE Transactions on Knowledge and Data Engineering pp. 1–1 (2020) 13. Li, J., Sun, A., Ma, Y.: Neural Named Entity Boundary Detection. IEEE Trans- actions on Knowledge and Data Engineering pp. 1–1 (2020) 14. Lin, B.Y., Lee, D.H., Shen, M., Moreno, R., Huang, X., Shiralkar, P., Ren, X.: TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 8503–8511. Association for Computational Lin- guistics, Online (Jul 2020). https://doi.org/10.18653/v1/2020.acl-main.752 15. Lin, H., Lu, Y., Tang, J., Han, X., Sun, L., Wei, Z., Yuan, N.J.: A Rigorous Study on Named Entity Recognition: Can Fine-tuning Pretrained Model Lead to the Promised Land? In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 7291–7300. Association for Compu- tational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.emnlp- main.592 16. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach (2019) 17. Ljubešić, N., Erjavec, T.: Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene. In: Proceedings of the Tenth International Con- ference on Language Resources and Evaluation (LREC’16). pp. 1527–1531. Eu- ropean Language Resources Association (ELRA), Portorož, Slovenia (May 2016), https://www.aclweb.org/anthology/L16-1242 18. Ljubešić, N., Klubička, F., Agić, Ž., Jazbec, I.P.: New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Ser- bian. In: Proceedings of the Tenth International Conference on Language Re- sources and Evaluation (LREC’16). pp. 4264–4270. European Language Resources Association (ELRA), Portorož, Slovenia (May 2016), https://www.aclweb.org/ anthology/L16-1676 19. Luoma, J., Oinonen, M., Pyykönen, M., Laippala, V., Pyysalo, S.: A Broad- coverage Corpus for Finnish Named Entity Recognition. In: Proceedings of the 12th Language Resources and Evaluation Conference. pp. 4615–4624. European Lan- guage Resources Association, Marseille, France (May 2020), https://www.aclweb. org/anthology/2020.lrec-1.567 Simple ways to improve NER in every language using markup 15 20. Ma, X., Hovy, E.: End-to-end Sequence Labeling via Bi-directional LSTM- CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers). pp. 1064– 1074. Association for Computational Linguistics, Berlin, Germany (Aug 2016). https://doi.org/10.18653/v1/P16-1101 21. Mosbach, M., Andriushchenko, M., Klakow, D.: On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines (2020), https:// arxiv.org/abs/2006.04884 22. Powalski, R., Stanislawek, T.: UniCase – Rethinking Casing in Language Models (2020), arXiv cs.CL eprint: 2010.11936 23. Straková, J., Straka, M., Hajic, J.: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 5326–5331. Association for Computational Linguis- tics, Florence, Italy (Jul 2019). https://doi.org/10.18653/v1/P19-1527 24. Sun, L., Hashimoto, K., Yin, W., Asai, A., Li, J., Yu, P., Xiong, C.: Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT. arXiv preprint arXiv:2003.04985 (2020) 25. Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 shared task: Language- independent named entity recognition. In: COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002) (2002), https://www.aclweb. org/anthology/W02-2024 26. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. pp. 142– 147 (2003), https://www.aclweb.org/anthology/W03-0419 27. Ulčar, M., Robnik-Šikonja, M.: FinEst BERT and CroSloEngual BERT. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds.) Text, Speech, and Dialogue. pp. 104–111. Springer International Publishing, Cham (2020) 28. Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luotolahti, J., Salakoski, T., Ginter, F., Pyysalo, S.: Multilingual is not enough: BERT for Finnish (2019), https: //arxiv.org/abs/1912.07076 29. Wang, Z., Shang, J., Liu, L., Lu, L., Liu, J., Han, J.: CrossWeigh: Training Named Entity Tagger from Imperfect Annotations. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 5154– 5163. Association for Computational Linguistics, Hong Kong, China (Nov 2019). https://doi.org/10.18653/v1/D19-1519 30. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., Platen, P.v., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T.L., Gugger, S., Drame, M., Lhoest, Q., Rush, A.M.: HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv abs/1910.03771 (2019) 31. Yamada, I., Asai, A., Shindo, H., Takeda, H., Matsumoto, Y.: LUKE: Deep Con- textualized Entity Representations with Entity-aware Self-attention. In: Proceed- ings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 6442–6454. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.emnlp-main.523 32. Yu, J., Bohnet, B., Poesio, M.: Named Entity Recognition as Dependency Parsing. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 6470–6476. Association for Computational Linguistics, Online (Jul 2020). https://doi.org/10.18653/v1/2020.acl-main.577