Exploring Transformers for Multilingual Historical Named Entity Recognition⋆ Anja Ryser1,† , Quynh Anh Nguyen1,2,† , Niclas Bodenmann1,† and Shih-Yun Chen1,3,† 1 University of Zurich, Rämistrasse 21, 8006 Zürich, Switzerland 2 University of Milan, Via Festa del Perdono, 7, 20122 Milano MI, Italy 3 Zurich University of Applied Sciences, Gertrudstrasse 15, 8401 Winterthur, Switzerland Abstract This paper explores the performance of out-of-the-box transformers language models for historical Named Entity Recognition (NER). Within the HIPE2022 (Identifying Historical People, Places, and other Entities) shared task, we participated in the NER-COARSE task of the Multilingual Newspaper Challenge (MNC). Three main approaches are experimented with: ensembling techniques on multiple fine-tuned models, using multilingual pretrained models, and relabeling the entity tags from the IOB-segmentation to a simplified version. By ensembling predictions from different system outputs, we outperformed the baseline model in the majority of cases. Moreover, through post-submission experiments, we found that using multilingual models did not yield better results compared to monolingual models. Furthermore, the relabeling experiment on the Newseye French dataset showed that merging entity labels and inferring the IOB segmentation in postprocessing increases precision but lowers recall. Last but not least, soft-label ensembling experiments on the same dataset enhanced precision, recall and thus F1-scores compared to hard-label ensembling by at least one percentage point. Keywords Named Entity Recognition, Historical Newspaper, HIPE2022, Transfer Learning, Transformers, Multilin- gual Models 1. Introduction Named Entity Recognition (NER) on historical newspaper text is a task with many pitfalls. Differences in language, its use, and the world it refers to, as well as technical artifacts, make models that perform well in contemporary texts significantly worse in historical texts. With our contribution to the HIPE2022 Shared Task, we explore the performance of transformers- architectures pretrained on historical and contemporary data available via HuggingFace [1]. We combine these models with task-specific knowledge in pre- and postprocessing, and in post-submission experiments, we further investigate the performance of only predicting on categories (without using IOB encoding), soft-label ensembling, and multilingual language models. CLEF 2022: Conference and Labs of the Evaluation Forum September 05–08, 2022, Bologna, Italy ⋆ Named Entity Recognition in historical newspapers † These authors contributed equally. $ anja.ryser@uzh.ch (A. Ryser); quynhanh.nguyen@uzh.ch (Q. A. Nguyen); niclaslinus.bodenmann@uzh.ch (N. Bodenmann); shih-yun.chen@uzh.ch (S. Chen) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Related Work Transformers [2] has rapidly become the dominant architecture for natural language processing, surpassing alternative neural models such as convolutional and recurrent neural networks in performance for tasks in both natural language understanding and natural language generation [3]. The Transformers architecture is particularly conducive to pretrain on large text corpora, leading to major gains in accuracy on downstream tasks[3]. As a result, the release of pretrained contextualised word embeddings such as BERT [4] pushed further the upper bound of modern NER performances [5] and established state-of-the-art results for modern NER [5] [6]. In the HIPE2020 Shared Task, several top solutions were developed based on pretrained lan- guage model embeddings with transformers-based architectures. Ghannay et al. [7] achieved the second-best result for French with an 81% F1-score in the strict scenario by using CamemBERT [8], a multi-layer bidirectional transformer similar to RoBERTa [9], together with a CRF decoder. Todorov et al. [10] implemented an architecture made of a modular embedding layer which was combined by newly trained and pre-trained embeddings, and a task-specific Bi- LSTM-CRF layer to handle NERC on coarse and fine-grained tags. They conclude that character-level embeddings, BERT, and a document-level data split are the most important factors in improving NER results. Besides, the experiment also shows that pretrained language models can be beneficial for NERC on low-resource historical corpora. Provatorova et al. [11] fine-tuned two pretrained BERT models [12], including bert-base-cased for English and bert-base-multilingual-cased for French and German. In order to enhance the robustness of the approach, a majority voting ensemble of 5 fine-tuned model instances was implemented per language. Their models achieved F1-scores of 68%, 52% and 47% for French, German and English respectively. Section 4.2 describes how we employed multiple fine-tuned models, exploited ensembling techniques, and applied relabeling entities method. 3. Task and Datasets We worked on coarse NER in digitized historical newspapers across different label sets and languages. More detailed information on this task can be found on the HIPE2022 website. NER on historical newspapers poses its own unique challenges; non-standard language with old lexicon and syntax, errors from digitization such as errors in layout and optical character recognition (OCR) and the lack of resources for training make this task challenging [5]. We used a part of the data provided by the organizers of this task, namely 5 datasets of historical newspapers in English, German, French, Swedish and Finnish spanning from the 18th to the 20th century. The data contains newspapers digitized through different European cultural heritage projects. While most of the data were published before HIPE2022, some unpublished parts of the datasets were used as test-sets. Each dataset is annotated following different annotation guidelines and contains NER-tags and NEL-links to Wikidata. All datasets were provided in the HIPE-format [13]. Table 1 presents an overview of the historical newspaper datasets of HIPE2022 used in our experiments. Resources We train our models on Google Colab with GPU enabled. Table 1 Description of datasets contained in the HIPE2022-data dataset languages comments HIPE2020 de, en, fr 19-20C Newseye de, fi, fr, sv 19-20 C Topres19th en 19C, only location types Sonar de 19-20C Letemps fr 19-20C, unpublished 4. Methods 4.1. Data Preprocessing We use a simple approach to preprocess the data. Lines with erroneous characters, empty lines, and lines containing metadata were removed while reading the tabulator-separated values (TSV) files. ‘Nan’-values were filled with empty strings to keep the data structure intact. Tokens are split into sentences using the EndOfSentence-tag provided in the data. The data is tokenized using the corresponding transformers model’s tokenizer without any additional fine-tuning on our data. 4.2. Training Models We employed a variety of models and pretrained weights for all different datasets. We distinguish between models that have been pretrained on historical data (historical language models, HLM) and on contemporary data. The HLMs mainly come from a single source, the Bavarian State Library [14]. We expect that the HLMs have already learned to deal with errors stemming from OCR, as these errors are prevalent in most historical datasets. We mainly use BERT-based models [12, 15] but also experimented with XLNet [16] and ELECTRA [17]. See Table 7 for a full description of which pretrained models have been used. In the submitted run 1, all models listed were used for the ensembling. In run 2, the results of the single best model were submitted (marked in bold in the table). The models are instantiated with standard token classification heads from HuggingFace. [18, p. 98]. Training Parameters We fine-tune the models with the parameters coming from the pre- trained models; were not these set, the default values of HuggingFace TrainingArguments have been used. All sentences were padded or truncated to a maximum length of 100 tokens. Because of the number of models we set out to deploy, we do not run a hyperparameter search. An initial experiment with label weights did not improve per- performance, and we returned to the defaults. We trained all models for three epochs. 4.3. Evaluation metrics The evaluation metrics used for NER tasks in HIPE2022 are Precision, Recall, and F1 score on macro and micro levels. The same evaluation metrics are used to assess our systems. F1- macro scores are computed on the document level and F1-micro scores on the entity-type level. Precisely, macro measures the average of the corresponding micro scores across all the documents, accounting for variance in document length but not for class imbalances. Additionally, the model’s performance was also measured in strict and fuzzy. In the strict scenario, a mention was only counted as correct when the exact gold-standard boundaries were met, whereas in the fuzzy evaluation, only a part of the mention needed to be recognized correctly. Because of this, in the strict measurement, predicting wrong boundaries leads to severe punishment, i.e., a mention is recognized, but one boundary is set wrong, leading to the whole entity being counted as False [13]. 4.4. Inference Single Models The single models we employ are initialized with a token classification head provided by HuggingFace. This is a linear mapping from the last encoder state to the output layer, where for every token, there are as many logits as labels in the dataset. Because the gold labels are on whole words, while the models operate on subwords, we need a non-trivial mapping regime. In preprocessing, the label of the whole word is propagated down to all subwords. All subword logits belonging to a single word are summed up during inference, and the label with the highest score is chosen for the whole word. Ács et al. [19] evaluate different pooling strategies for subword aggregation. While they tend towards neural solutions such as an additional LSTM over the subword logits, they mention how the pooling strategy has a lower influence on NER (as opposed to morphological tasks such as POS- Tagging). Still, more advanced subword pooling strategies remain to be explored. Ensembling On one dataset, predictions of all models were gathered, and the final label was chosen through a hard-ensembling method. The final prediction was the label with the most votes. In a tie between different labels, entity labels were favored, and between different entity labels, the choice was randomized. Postprocessing We employed only one postprocessing rule for the shared task submission: If a token gets a label prediction starting with an I (inside) but is not preceded by an I or a B (beginning), it is changed to a B. Erroneously, we did not consider the label class. This was remedied in the post-submission experiments. 5. Post-Submission Experiments This section introduces the three approaches we experimented with after the submission. We focused on Newseye French for the monolingual and all Newseye languages for multilingual approaches. Motivation We tried to improve the submitted results. For better comparability of the different approaches of the post-submission experiments and due to time constraints, we decided to focus on one dataset and one language for the monolingual experiments. We saw the most potential for improvement in the Newseye French dataset. Therefore, we used all available languages in the Newseye dataset for the multilingual approach. Our goal was to beat the baseline provided by the task organizers. 5.1. Soft-Label Ensembling For the submission, we employed hard-label ensembling (‘voting’). In this post-submission experiment, we evaluate the performance of soft-label ensembling on the Newseye French dataset. To infer the final label for a token, we average the probabilities (softmax logits) for the whole tokens of the individual models. We follow Ju et al. [19], who argue for averaging the softmaxed logits because different models’ logits might differ in magnitude. This is expected in our case, as the models use their own subword tokenization and, therefore, might sum over a different amount of subwords for the logits of the whole word. The same models are used for the submission, presented in Table 7. 5.2. Multilingual Models As shown in previous work, the performance of NLP- tasks can benefit from leveraging cross- lingual transfer learning and using multilingual models, which leads to more training data for a single model [20]. To test this, we used the same methods as described in chapter 4 with different multilingual BERT models (for further details, see Table 7) and Newseye data in all four available languages. In addition, we tested the single best model and the hard-label ensemble as described in paragraph ’Ensembling’ in section 4.4. The first trained model received the input sorted after language, which we assume could lead to catastrophic forgetting of the languages first seen. To avoid this, the sentences were shuffled before being fed in batches to the model during fine-tuning. 5.3. Relabeling Through the error analysis in section 7.2, we noticed that one of the most occurring errors is Right classification and wrong segmentation, i.e., the model predicts I-LOC while the ground truth is B-LOC. We assume this is because the models are fine-tuned on the whole label set where B-tags and I-tags are handled as two different labels. We chose the Newseye dataset in French to train the model used for this approach. All nine classes of the dataset, [’O’, ’B-ORG’, ’I-ORG’, ’B-LOC’, ’I-LOC’, ’B-PER’, ’I-PER’, ’B-HumanProd’, ’I-HumanProd’] are relabeled into five entity labels which are [’O’, ’ORG’, ’LOC’, ’PER’, ’Human- Prod’]. After that, the text and corresponding new label of each word are used as the input of the training pipeline. The pipeline results in predictions with new labels that were reencoded. The IOB-tagging is reconstructed in the postprocessing. Table 13 shows more details of the results. Table 2 F1-scores of Micro-strict evaluation of submitted ensembling system (Run 1) dataset en fr de sv fi avg. HIPE2020 0.513 0.678 0.725 - - 0.639 Newseye - 0.648 0.395 0.643 0.567 0.563 Topres19th 0.787 - - - - 0.787 Sonar - - 0.490 - - 0.49 Letemps - 0.644 - - - 0.644 Table 3 F1-scores of Micro-strict evaluation of submitted best single model (Run 2) dataset en fr de sv fi avg. HIPE2020 x 0.696 0.695 - - 0.696 Newseye - 0.656 0.408 0.636 0.556 0.564 Topres19th 0.781 - - - - 0.781 Sonar - - 0.477 - - 0.477 Letemps - 0.622 - - - 0.622 6. Results To keep the results section concise, we focused on the micro-strict F1 score. From all measure- ments provided by the task organizers, micro-strict is the most punishing, resulting in lower scores. For more detailed results, see Tables 8 to 13 in the Appendix. 6.1. Submission Table 2 and 3 show the F1-score over all labels for both submitted systems. ’avg’ shows the average overall languages in each dataset. The best run for each language and averaged over all available languages between the two runs are marked in bold. 6.2. Post-Submission Experiments Soft-Label Ensembling All scores benefit from switching from hard-label to soft-label ensembling by at least one percentage point from an average F1 score of 0.7 to 0.8 (see Table 4). However, our best run on the test set from the submission was not the (hard-label) ensembled model but the single model with the best scores on the validation set. With regards to Micro-F1 strict and fuzzy, the soft-label ensembled model is on par with the best individual model. Multilingual models For this section, the test sets of the different languages were labeled and evaluated separately. The results shown in Table 5 are then averaged over all languages. ‘submission ensembling’ and ‘submission best model’ contain the results we handed in for submission, trained and ensembled monolingually and averaged over all languages. ‘best model multilingual’ is the best out of the five models used for ensembling. ‘ensemble multilingual’ Table 4 Scores for all labels on Newseye French. ‘Best Model (Run 2)’ are the predictions of the best individual model with the best validation scores. The other two rows are different ensembling strategies over all models. Micro-P Micro-R Micro-F1 Macro-P Macro-R Macro-F1 strict fuzzy strict fuzzy strict fuzzy strict fuzzy strict fuzzy strict fuzzy Hard-Label (Run 1) 0.673 0.801 0.625 0.744 0.648 0.772 0.659 0.814 0.614 0.762 0.630 0.779 Best Model (Run 2) 0.655 0.785 0.657 0.787 0.656 0.786 0.630 0.775 0.623 0.777 0.621 0.766 Soft-Label 0.685 0.818 0.636 0.758 0.659 0.787 0.677 0.829 0.63 0.771 0.649 0.793 is a multilingually trained system with five different results which were then run through the ensembling process. Table 5 Micro-strict scores averaged over all newseye testsets (de, fr, fi, sv) for experiments with multilingual models System Precision Recall F1 submission ensembling 0.70 0.65 0.67 submission best model 0.54 0.52 0.56 best model multilingual 0.62 0.55 0.58 ensemble multilingual 0.63 0.53 0.57 Results of the two runs of our submission show that ensembling improved the performance over all languages and performed better than the single models. In the multilingual experiments, the best model performs slightly better than the ensembling and beats the monolingual best model. Overall, the monolingual ensembling yielded the best results. We assume the multilingual ensembling results could be improved by excluding or replacing the worst performing model used in ensembling. Table 12 in the Appendix shows more detailed results. Relabeling Table 13 shows the comparison between applying and not applying the relabeling method. The relabeling approach generally improves precision scores by around 1 to 2 percentage points. The result shows notable changes in models performances regarding precision and recall scores. While Precision improves, recall scores slightly decrease compared to the performance of model without relabeling. As a consequence, F1-scores remain similar in both conditions. Besides, the relabeling approach has also contributed to the marginal enhancement of model 4, i.e., the pretrained and fine-tuned French Europeana ELECTRA model. What stands out in Table 13 is the difference between model 4 using the relabeling method and model 4 not using relabeling method. All considered metrics uniformly rise around 0.3-3% with relabeling. Table 6 Comparison of Micro-F1 for HIPE2020 German between the BERT-base LM trained on historical data and the BERT-base model LM on contemporary data. ALL LOC ORG PERS PROD TIME strict fuzzy strict fuzzy strict fuzzy strict fuzzy strict fuzzy strict fuzzy Historical 0.702 0.805 0.814 0.870 0.441 0.572 0.690 0.841 0.356 0.525 0.596 0.808 Contemporary 0.657 0.778 0.773 0.839 0.454 0.535 0.574 0.778 0.418 0.636 0.630 0.804 7. Discussion Table 2 and 3 show that the systems fine-tuned on the German Newseye and Sonar corpora perform worse than those fine-tuned on the German HIPE2020 dataset. This could be because the used datasets differ significantly in size; Sonar and News- eye are much smaller than HIPE2020, so the lousy performance could come from overfitting. We also looked at each dataset’s best- and worst-performing label and reported them with their F1-score in Tables 8 to 11. The evaluation metrics (Tables 8 to 11) reveal that both the best model and the ensembled models better recognize the LOC, PER, and TIME labels across all datasets and measurements. While these labels dominate the corresponding datasets, hard-label ensembling (’voting’) reflects the preference to reassign the label Os to these tokens. In contrast, the ORG and other minor category labels are generally worse handled, as a system is less likely to predict them, and in many cases, voting overrode these labels in favor of a more frequent predicted label. For example, the gold standard for a label is ORG. One model predicts the correct, infrequent label, the other 3 predict the more likely O. Due to majority voting, the correct guess is overruled and the incorrect label is chosen. 7.1. Comparison of Contemporary and Historical BERT Table 6 demonstrates the Micro-F1 results for HIPE2020 German between the BERT-based model trained on historical data and the BERT-based model trained on contemporary data. The HIPE2020 data are from historical newspapers between the 19th and 20th centuries, as do the training data of the BERT-based HLM. As a result, the BERT-based model trained on historical data outperforms the BERT-based model trained on contemporary data on the overall result (column ALL in Table 6). The outcome is as expected because the pre-trained data cover historical data requirements in the historical NER task. 7.2. Error Analysis To clarify the errors in our models, we conduct error analysis on the models trained on the Newseye dataset. We compare the NER results of the hard-label ensembling models on the Newseye French test-set (as the best-performing model) and the Newseye German test-set (as the worst-performing model) to the gold standard data. The five major errors discovered are as follows: • Right classification, wrong segmentation: e.g. B-PER v.s. I-PER. • Wrong classification, right segmentation: e.g. B-LOC v.s. B-PER. • Wrong classification, wrong segmentation: The models predicted different NEs compared to the annotated data, e.g., B-LOC v.s. I-PER. • Complete false positive: All tokens of a predicted entity are labeled with O in the gold-standard. • Complete false negative: All Tokens of an entity in the gold-standard are predicted as O. Figure 1 and 2 visualize the results of the error analysis. Besides the Complete false positive and Complete false negative errors, the two most frequent errors are Right classification, wrong segmentation and Wrong classification, wrong segmentation. Figure 1: NER performance of the hard-label ensembling model on the Newseye French test-set. Figure 2: NER performance of the hard-label ensembling model using the Newseye German test-set. The most common errors appear to follow a pattern. They usually occur at the incorrect entity recognition of the beginning token. To wrong segmentation errors, our model occasionally labels just the entities of the names without their titles because it fails to recognize the beginning tokens of location or individual titles such as ’café’ or ’v’ ’.’. Subword tokenization also raises segmentation errors particularly in the French NER task. The model typically performs NER and labels the tokens without the negation symbol ’¬’. With an incorrectly labeled entity classification to the beginning token, the remaining se- quence of tokens follows the incorrect classification. We see this error randomly happen in sentences. Our model tends to start the NE with a non-location or non-personal noun, a punc- tuation, or an article. After assigning the beginning token, the following tokens will adopt the same incorrect classification. The following examples explain the two most common errors. Right classification, wrong segmentation The French gold standard data label ’café’ ’Mollard’ as ’B-LOC’ ’I-LOC’, whereas our model labels ’café’ ’Mollard’ as ’O’ ’I-LOC’. Without labeling ’café’, our model assigns ’Mollard’ as a beginning token. In the German gold standard data, the entities of the German family name ’v’ ’.’ ’Plener’ are ’B-PER’ ’I-PER’ ’I-PER’. They exist several times in the dataset, but not all have been correctly recognized. Although our model adequately recognized the first two occurrences of the three tokens, it does not consistently learn the entity pattern. Thus, there are also errors such as just ’Plener’ as ’B-PER’ or ’v’ as ’B-PER’. Subword tokenization, for instance, ’Rau¬’ ’court’, is challenging for our model to perform exactly NER. Our model predicts the entities as ’O’ ’B-LOC’ instead of ’B-LOC’ ’I-LOC’ as in the French gold standard data. Other examples include ’Ro¬’ ’mans’ (gold NER as ’I-PER’ ’I-PER’), ’Pierre¬’ ’Vaast’ (gold NER as ’I-LOC’ ’I-LOC’) or ’AI¬’ ’bert’ (gold NER as ’I-PER’ ’I-PER’) which our model does not recognize all tokens containing ’¬’ and assigns ’O’ to them. Wrong classification, wrong segmentation With an inanimate French noun like ’matinée’ (’morning’ in French), our model recognizes its entity classification as ’B-PER’, which should not be labelled. The incorrect classification leads the following tokens ’.’ ’—’ ’Le’ ’président’ ’du’ ’Conseil’ ’a’ ’reçu’ (’. - the president of the council has received’ in French) all in ’I-PER’ where only ’Conseil’ should be recognized as ’B-ORG’. Our model recognizes all possible tokens including punctuation to perform NER. For instance, only ’professeur’ ’Vaquez’ have entities of ’B-PER’ ’I-PER’ from the token sequence: ’","’ ’nièce’ ’du’ ’professeur’ ’Vaquez’ ’.’ (’, niece of professor Vaquez.’ in French). However, our NER model begins with the entity ’B-LOC’ by ’","’ and the following tokens are ’I-LOC’. Based on these identified errors, we are encouraged to propose relabeling post-submission experiments to improve the NERC task. 7.3. Post-Submission Experiments Soft-Labeling With improvements across all scores, soft-label ensembling is preferable to hard-label ensembling. It seems that the additional information embedded in soft-labeling benefits the system. However, to be able to make more profound statements extending to other datasets, more experiments would be needed. Multilingual Models Multilingual approaches did not improve the performance of our system. However, their performance is comparable to our monolingual approaches. The best multilingually trained model performed better than the average best single monolingual model. However, it seems that the performance of multilingual ensemble predictions could still be improved through a better selection of multilingual models or leveraging newer models such as XLM-R [20]. In addition, it is striking that all models performed poorly in German; more analyses should be done to investigate further reasons for this and improve the system. Relabeling Relabeling improves Precision scores and deteriorates Recall scores. This means that our systems tend to return very few but precise NE predictions. Relabeling would be suited best for scenarios where precision is more important than recall, and False positives should be avoided. 8. Future Work The existing system’s performance could be optimized by using early stopping instead of training for a fixed number of epochs and applying grid-search on hyper-parameters. We investigated the influence of frozen and unfrozen embeddings anecdotally, revealing that employing frozen word embeddings for NER tasks slightly improved results. However, due to time constraints, we could not implement our system with frozen embeddings, which would assumedly improve the overall results, especially for the relatively small training sets we used. Because of the workflow in our experiment, each dataset and languages use different pre- trained models. Using the same models for one language across datasets and using multilingual models across languages would make the system more uniform and easier to improve as a whole. Our ensembling approach could benefit from using a more careful selection of the single models, and replacing the worst model would probably improve the overall performance. This could particularly help the approach described in the post-submission experiment on multilingual models, where all models performed poorly in German. More analyses should be done to explain this bad performance and to improve it. Experiments with other frameworks, such as AdapterHub [21] or other different architectures could improve performance. Other newer models, such as RoBERTa [15], XLM-R [20], or models trained on historical newspapers, could also help to improve performance. 9. Conclusion In this paper, we reported on the performance of different language models for Named Entity Recognition (NER) in historical newspapers. One of the main challenges in this domain is digitization artifacts, a problem we address by fine-tuning models which have already been pretrained on noisy historical data. Furthermore, we experiment with ensembling, multilingual models, and label simplification. In a case study for all languages of the HIPE-CLEF 2022 Newseye dataset, we found that models that have been trained over all languages did not improve the scores compared to monolingual models. In a second case study for the Newseye French dataset, we found that solely predicting entity categories and inferring the IOB encoding in postprocessing did not help to improve F1- measures but shifted the scores to higher precision and a lower recall. On the same dataset, soft-label ensembling substantially improved all scores compared to hard-label ensembling. Acknowledgments Thanks to Simon Clematide and Andrianos Michail for lecturing the courses "Machine Learning for NLP 1 and 2" and for mentoring our group during development. References [1] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, HuggingFace’s Transformers: State-of- the-art Natural Language Processing, Technical Report arXiv:1910.03771, arXiv, 2020. URL: http://arxiv.org/abs/1910.03771. doi:10.48550/arXiv.1910.03771, arXiv:1910.03771 [cs] type: article. [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo- sukhin, Attention is all you need, 2017. arXiv:1706.03762. [3] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, A. Rush, Transformers: State-of-the-art natural lan- guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. URL: https://aclanthology.org/2020.emnlp-demos.6. doi:10.18653/v1/2020.emnlp-demos.6. [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2018. URL: https://arxiv.org/abs/1810.04805. doi:10.48550/ARXIV.1810.04805. [5] M. Ehrmann, A. Hamdi, E. Linhares Pontes, M. Romanello, A. Doucet, Named entity recognition and classification on historical documents: A survey, arXiv e-prints (2021) arXiv–2109. [6] J. Li, A. Sun, J. Han, C. Li, A survey on deep learning for named entity recognition, IEEE Transactions on Knowledge and Data Engineering 34 (2022) 50–70. doi:10.1109/TKDE. 2020.2981314. [7] S. Ghannay, C. Grouin, T. Lavergne, Experiments from limsi at the french named entity recognition coarse-grained task, in: Proc of CLEF 2020 LNCS, Thessaloniki, Greece, 2020. [8] L. Martin, B. Muller, P. J. Ortiz Suárez, Y. Dupont, L. Romary, É. de la Clergerie, D. Seddah, B. Sagot, CamemBERT: a tasty French language model, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 7203–7219. URL: https://aclanthology.org/2020.acl-main.645. doi:10.18653/v1/2020.acl-main.645. [9] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy- anov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692. [10] K. Todorov, G. Colavizza, Transfer learning for named entity recognition in historical corpora, in: CLEF (Working Notes), 2020. URL: http://ceur-ws.org/Vol-2696/paper_168.pdf. [11] V. Provatorova, S. Vakulenko, E. Kanoulas, K. Dercksen, J. M. van Hulst, Named entity recognition and linking on historical newspapers: Uva.ilps & rel at clef hipe 2020, in: L. Cappellato, C. Eickhoff, N. F. 0001, A. Névéol (Eds.), Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22- 25, 2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL: http: //ceur-ws.org/Vol-2696/paper_209.pdf. [12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirec- tional Transformers for Language Understanding, Technical Report arXiv:1810.04805, arXiv, 2019. URL: http://arxiv.org/abs/1810.04805. doi:10.48550/arXiv.1810.04805, arXiv:1810.04805 [cs] type: article. [13] M. Ehrmann, M. Romanello, A. Doucet, S. Clematide, Hipe 2022 shared task participation guidelines v1.0, 2022. URL: https://doi.org/10.5281/zenodo.6045662. [14] B. Staatsbibliothek, dbmdz (Bayerische Staatsbibliothek), 2022. URL: https://huggingface. co/dbmdz. [15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, Technical Report arXiv:1907.11692, arXiv, 2019. URL: http://arxiv.org/abs/1907.11692. doi:10.48550/ arXiv.1907.11692, arXiv:1907.11692 [cs] type: article. [16] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, Q. V. Le, XLNet: Generalized Au- toregressive Pretraining for Language Understanding, Technical Report arXiv:1906.08237, arXiv, 2020. URL: http://arxiv.org/abs/1906.08237. doi:10.48550/arXiv.1906.08237, arXiv:1906.08237 [cs] type: article. [17] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, ELECTRA: Pre-training Text Encoders as Dis- criminators Rather Than Generators, Technical Report arXiv:2003.10555, arXiv, 2020. URL: http://arxiv.org/abs/2003.10555. doi:10.48550/arXiv.2003.10555, arXiv:2003.10555 [cs] type: article. [18] L. Tunstall, L. von Werra, T. Wolf, Natural Language Processing with Transformers, O’Reilly, 2022. [19] C. Ju, A. Bibaut, M. Laan, The relative performance of ensemble methods with deep convolutional neural networks for image classification, Journal of applied statistics (2018). doi:10.1080/02664763.2018.1441383. [20] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, É. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8440–8451. [21] J. Pfeiffer, A. Rücklé, C. Poth, A. Kamath, I. Vulic, S. Ruder, K. Cho, I. Gurevych, Adapterhub: A framework for adapting transformers, CoRR abs/2007.07779 (2020). URL: https://arxiv. org/abs/2007.07779. arXiv:2007.07779. 10. Online Resources • Repository for this paper, • HIPE2022 datasets, • HIPE2022 evaluation module, • Google Colab. Table 7 Used models and their corresponding HuggingFace links. Not available dev-set scores are marked with ’-’. The best model for each dataset is marked in bold. The best multilingual model was determined by the test-set. Languages Datasets Model name (hyperlink) F1-macro on dev-set De Newseye bert-base-german-cased 0.35 De Sonar 0.87 Fi Newseye dbmdz/bert-base-finnish-europeana-cased 0.78 Fr HIPE2020 dbmdz/bert-base-french-europeana-cased 0.82 Fr Newseye 0.86 Fr Letemps 0.55 De HIPE2020 dbmdz/bert-base-german-europeana-cased 0.76 De Newseye 0.46 De Sonar 0.93 En Topres19th dbmdz/bert-base-historic-english-cased 0.62 En HIPE2020 - De HIPE2020 dbmdz/bert-base-historic-multilingual-cased 0.74 De Sonar 0.60 En Topres19th 0.72 Fi Newseye 0.80 Fr HIPE2020 0.80 Fr Newseye 0.80 Fr Letemps 0.56 Sv Newseye 0.71 Multilingual Newseye - Multilingual Newseye bert-base-multilingual-cased - Multilingual Newseye bert-base-multilingual-uncased - Sv Newseye dbmdz/bert-base-swedish-europeana-cased 0.73 Multilingual Newseye distilbert-base-multilingual-cased - Fr HIPE2020 dbmdz/electra-base-french-europeana-cased-discriminator 0.34 Fr Newseye 0.38 Fr Letemps 0.32 De HIPE2020 dbmdz/electra-base-german-europeana-cased-discriminator 0.77 Fr HIPE2020 dbmdz/flair-hipe-2022-ajmc-fr-64k - Fi Newseye EMBEDDIA/finest-bert 0.83 En Topres19th google/electra-base-discriminator 0.73 En HIPE2020 - En Topres19th Jean-Baptiste/roberta-large-ner-english 0.71 En HIPE2020 - Sv Newseye jonfd/electra-small-nordic 0.13 Sv Newseye KB/bert-base-swedish-cased 0.73 Fi Newseye setu4993/LaBSE 0.79 Fr Newseye 0.84 Fr Letemps 0.52 Fr HIPE2020 0.83 Sv Newseye 0.66 Multilingual Newseye - Fi Newseye TurkuNLP/bert-base-finnish-cased-v1 0.79 En Topres19th xlnet-base-cased 0.26 En HIPE2020 - Table 8 Macro-fuzzy evaluation of submitted systems System Dataset Eval Precision all Recall all F1 all-labels Best category Worst category Best Model HIPE2020_de macro-fuzzy 0.789 0.828 0.796 TIME (0.9) ORG (0.484) Best Model HIPE2020_fr macro-fuzzy 0.865 0.831 0.836 TIME (0.942) ORG (0.587) Best Model Letemps_fr macro-fuzzy 0.52 0.712 0.752 PERS (0.845) ORG (0.256) Best Model Newseye_de macro-fuzzy 0.392 0.502 0.547 PER (0.591) HUMANPROD (0.0) Best Model Newseye_fi macro-fuzzy 0.765 0.626 0.668 HUMANPROD (0.863) ORG (0.57) Best Model Newseye_fr macro-fuzzy 0.775 0.777 0.766 PER (0.82) ORG (0.606) Best Model Newseye_sv macro-fuzzy 0.738 0.72 0.735 LOC (0.821) ORG (0.428) Best Model Sonar_de macro-fuzzy 0.617 0.667 0.633 LOC (0.711) ORG (0.451) Best Model Topres19th_en macro-fuzzy 0.813 0.86 0.824 LOC (0.873) BUILDING (0.643) AVERAGE 0.62 0.63 0.64 Ensembled HIPE2020_de macro-fuzzy 0.819 0.826 0.808 TIME (0.89) ORG (0.501) Ensembled HIPE2020_en macro-fuzzy 0.724 0.656 0.689 TIME (1.0) PROD (0.0) Ensembled HIPE2020_fr macro-fuzzy 0.858 0.816 0.826 TIME (0.951) ORG (0.609) Ensembled Letemps_fr macro-fuzzy 0.545 0.712 0.763 LOC (0.847) ORG (0.239) Ensembled Newseye_de macro-fuzzy 0.403 0.469 0.538 LOC (0.585) HUMANPROD (0.167) Ensembled Newseye_fi macro-fuzzy 0.796 0.581 0.703 HUMANPROD (0.795) ORG (0.593) Ensembled Newseye_fr macro-fuzzy 0.814 0.762 0.779 PER (0.811) ORG (0.621) Ensembled Newseye_sv macro-fuzzy 0.765 0.722 0.747 HUMANPROD (0.861) ORG (0.417) Ensembled Sonar_de macro-fuzzy 0.663 0.672 0.654 LOC (0.758) ORG (0.432) Ensembled Topres19th_en macro-fuzzy 0.881 0.823 0.841 LOC (0.889) BUILDING (0.642) AVERAGE 0.70 0.68 0.69 Table 9 Macro-strict evaluation of submitted systems System Dataset Eval Precision all Recall all F1 all-labels Best category Worst category Best Model HIPE2020_de macro-strict 0.671 0.693 0.672 LOC (0.805) ORG (0.384) Best Model HIPE2020_fr macro-strict 0.764 0.735 0.74 LOC (0.772) ORG (0.499) Best Model Letemps_fr macro-strict 0.448 0.625 0.659 LOC (0.754) ORG (0.114) Best Model Newseye_de macro-strict 0.302 0.386 0.421 LOC (0.464) HUMANPROD (0.0) Best Model Newseye_fi macro-strict 0.682 0.561 0.596 PER (0.733) ORG (0.481) Best Model Newseye_fr macro-strict 0.63 0.623 0.621 HUMANPROD (0.699) ORG (0.419) Best Model Newseye_sv macro-strict 0.611 0.583 0.602 HUMANPROD (0.758) ORG (0.338) Best Model Sonar_de macro-strict 0.46 0.5 0.474 LOC (0.649) ORG (0.241) Best Model Topres19th_en macro-strict 0.77 0.812 0.779 LOC (0.824) BUILDING (0.527) AVERAGE 0.62 0.63 0.64 Ensembled HIPE2020_de macro-strict 0.686 0.679 0.67 LOC (0.815) ORG (0.411) Ensembled HIPE2020_en macro-strict 0.553 0.494 0.523 TIME (0.718) PROD (0.0) Ensembled HIPE2020_fr macro-strict 0.741 0.7 0.712 LOC (0.712) ORG (0.403) Ensembled Letemps_fr macro-strict 0.48 0.636 0.681 LOC (0.776) ORG (0.095) Ensembled Newseye_de macro-strict 0.316 0.364 0.419 LOC (0.494) HUMANPROD (0.167) Ensembled Newseye_fi macro-strict 0.666 0.484 0.585 LOC (0.655) ORG (0.477) Ensembled Newseye_fr macro-strict 0.659 0.614 0.63 HUMANPROD (0.766) ORG (0.471) Ensembled Newseye_sv macro-strict 0.654 0.608 0.634 HUMANPROD (0.739) ORG (0.306) Ensembled Sonar_de macro-strict 0.512 0.514 0.503 LOC (0.695) ORG (0.23) Ensembled Topres19th_en macro-strict 0.839 0.785 0.802 LOC (0.86) BUILDING (0.554) AVERAGE 0.69 0.68 0.69 Table 10 Micro-fuzzy evaluation of submitted systems System Dataset Eval Precision all Recall all F1 all-labels Best category Worst category Best Model HIPE2020_de micro-fuzzy 0.783 0.826 0.804 PERS (0.874) ORG (0.545) Best Model hipe2020_fr micro-fuzzy 0.825 0.776 0.8 PERS (0.848) PROD (0.596) Best Model Letemps_fr micro-fuzzy 0.61 0.771 0.681 LOC (0.734) ORG (0.208) Best Model Newseye_de micro-fuzzy 0.48 0.512 0.495 LOC (0.541) HUMANPROD (0.0) Best Model Newseye_fi micro-fuzzy 0.681 0.603 0.64 HUMANPROD (0.732) ORG (0.478) Best Model Newseye_fr micro-fuzzy 0.785 0.787 0.786 PER (0.849) HUMANPROD (0.579) Best Model Newseye_sv micro-fuzzy 0.786 0.704 0.742 LOC (0.799) ORG (0.457) Best Model Sonar_de micro-fuzzy 0.625 0.718 0.668 LOC (0.765) ORG (0.468) Best Model Topres19th_en micro-fuzzy 0.807 0.851 0.829 LOC (0.872) STREET (0.661) AVERAGE 0.61 0.63 0.65 Ensembled HIPE2020_de micro-fuzzy 0.812 0.833 0.822 LOC (0.866) PROD (0.574) Ensembled HIPE2020_en micro-fuzzy 0.726 0.661 0.692 TIME (0.909) PROD (0.0) Ensembled HIPE2020_fr micro-fuzzy 0.824 0.773 0.798 TIME (0.847) ORG (0.555) Ensembled Letemps_fr micro-fuzzy 0.642 0.773 0.701 LOC (0.7) ORG (0.178) Ensembled Newseye_de micro-fuzzy 0.481 0.478 0.479 LOC (0.551) HUMANPROD (0.08) Ensembled Newseye_fi micro-fuzzy 0.73 0.619 0.67 PER (0.706) ORG (0.495) Ensembled Newseye_fr micro-fuzzy 0.801 0.744 0.772 PER (0.839) ORG (0.58) Ensembled Newseye_sv micro-fuzzy 0.797 0.702 0.746 LOC (0.801) ORG (0.442) Ensembled Sonar_de micro-fuzzy 0.641 0.696 0.667 LOC (0.78) ORG (0.443) Ensembled Topres19th_en micro-fuzzy 0.869 0.81 0.838 LOC (0.88) BUILDING (0.659) AVERAGE 0.70 0.67 0.68 Table 11 Micro-strict evaluation of submitted systems System Dataset Eval Precision all Recall all F1 all-labels Best category Worst category Best Model HIPE2020_de micro-strict 0.677 0.714 0.695 LOC (0.794) ORG (0.411) Best Model HIPE2020_fr micro-strict 0.718 0.675 0.696 LOC (0.748) PROD (0.519) Best Model Letemps_fr micro-strict 0.557 0.704 0.622 LOC (0.692) ORG (0.12) Best Model Newseye_de micro-strict 0.395 0.421 0.408 LOC (0.479) HUMANPROD (0.0) Best Model newseye_fi micro-strict 0.592 0.524 0.556 HUMANPROD (0.683) ORG (0.407) Best Model Newseye_fr micro-strict 0.655 0.657 0.656 PER (0.709) ORG (0.441) Best Model Newseye_sv micro-strict 0.673 0.603 0.636 HUMANPROD (0.75) ORG (0.343) Best Model Sonar_de micro-strict 0.447 0.513 0.477 LOC (0.685) ORG (0.293) Best Model Topres19th_en micro-strict 0.761 0.802 0.781 LOC (0.833) BUILDING (0.564) AVERAGE 0.64 0.66 0.67 Ensembled HIPE2020_de micro-strict 0.716 0.735 0.725 LOC (0.82) PROD (0.452) Ensembled HIPE2020_en micro-strict 0.538 0.49 0.513 LOC (0.607) PROD (0.0) Ensembled HIPE2020_fr micro-strict 0.7 0.657 0.678 LOC (0.761) PROD (0.421) Ensembled Letemps_fr micro-strict 0.589 0.71 0.644 LOC (0.715) ORG (0.089) Ensembled Newseye_de micro-strict 0.396 0.394 0.395 LOC (0.485) HUMANPROD (0.08) Ensembled Newseye_fi micro-strict 0.618 0.524 0.567 HUMANPROD (0.615) ORG (0.385) Ensembled Newseye_fr micro-strict 0.673 0.625 0.648 PER (0.712) ORG (0.455) Ensembled Newseye_sv micro-strict 0.686 0.604 0.643 LOC (0.716) ORG (0.288) Ensembled Sonar_de micro-strict 0.47 0.511 0.49 LOC (0.709) ORG (0.268) Ensembled Topres19th_en micro-strict 0.816 0.76 0.787 LOC (0.84) BUILDING (0.551) AVERAGE 0.69 0.66 0.67 Table 12 Evaluation multilingual experiments. Model1: dbmdz/bert-base-historic-multilingual-cased, Model2: setu4993/LaBSE, Model3: bert-base-multilingual-cased, Model4: bert-base-multilingual-uncased, Model5: distilbert-base-multilingual-cased System Dataset Eval Precision all Recall all F1 all-labels sub_Best Model Newseye_de micro-strict 0.395 0.421 0.408 sub_Best Model Newseye_fi micro-strict 0.592 0.524 0.556 sub_Best Model Newseye_fr micro-strict 0.655 0.657 0.656 sub_Best Model Newseye_sv micro-strict 0.673 0.603 0.636 AVERAGE 0.54 0.52 0.56 sub_Ensembled Newseye_de micro-strict 0.396 0.394 0.395 sub_Ensembled Newseye_fi micro-strict 0.618 0.524 0.567 sub_Ensembled Newseye_fr micro-strict 0.673 0.625 0.648 sub_Ensembled Newseye_sv micro-strict 0.686 0.604 0.643 AVERAGE 0.70 0.65 0.67 set_random Newseye_de micro-strict 0.006 0.022 0.009 set_random Newseye_fi micro-strict 0.005 0.01 0.007 set_random Newseye_fr micro-strict 0.005 0.013 0.008 set_random Newseye_sv micro-strict 0.01 0.023 0.013 AVERAGE 0.01 0.02 0.01 Model1 Newseye_de micro-strict 0.407 0.399 0.403 Model1 Newseye_fi micro-strict 0.671 0.564 0.613 Model1 Newseye_fr micro-strict 0.653 0.616 0.634 Model1 Newseye_sv micro-strict 0.729 0.627 0.674 AVERAGE 0.62 0.55 0.58 Model2 Newseye_de micro-strict 0.406 0.416 0.411 Model2 Newseye_fi micro-strict 0.624 0.514 0.563 Model2 Newseye_fr micro-strict 0.65 0.607 0.628 Model2 Newseye_sv micro-strict 0.693 0.599 0.643 AVERAGE 0.59 0.53 0.56 Model3 Newseye_de micro-strict 0.407 0.44 0.423 Model3 Newseye_fi micro-strict 0.586 0.462 0.517 Model3 Newseye_fr micro-strict 0.648 0.585 0.615 Model3 Newseye_sv micro-strict 0.643 0.548 0.592 AVERAGE 0.57 0.51 0.54 Model4 Newseye_de micro-strict 0.405 0.428 0.416 Model4 Newseye_fi micro-strict 0.563 0.438 0.493 Model4 Newseye_fr micro-strict 0.626 0.587 0.606 Model4 Newseye_sv micro-strict 0.637 0.53 0.579 AVERAGE 0.56 0.50 0.52 Model5 Newseye_de micro-strict 0.232 0.157 0.188 Model5 Newseye_fi micro-strict 0.266 0.113 0.159 Model5 Newseye_fr micro-strict 0.245 0.239 0.242 Model5 Newseye_sv micro-strict 0.356 0.182 0.241 AVERAGE 0.27 0.17 0.21 Ensembling Newseye_de micro-strict 0.434 0.408 0.421 Ensembling Newseye_fi micro-strict 0.649 0.502 0.566 Ensembling Newseye_fr micro-strict 0.691 0.608 0.647 Ensembling Newseye_sv micro-strict 0.742 0.596 0.661 AVERAGE 0.629 0.5285 0.57375 Table 13 Relabeling post-submission experiment results Evaluation No relabeling Relabeling Model Pretrained model setting Precision Recall F1 Precision Recall F1 Language-agnostic BERT 1 0.772 0.724 0.747 0.782 0.699 0.738 Sentence Encoder (LaBSE) micro Historic Language 2 - fuzzy 0.77 0.734 0.752 0.778 0.72 0.748 Multilingual Models 3 French Europeana BERT 0.785 0.787 0.786 0.806 0.765 0.785 4 French Europeana ELECTRA 0.468 0.521 0.493 0.47 0.544 0.504 Language-agnostic BERT 1 0.645 0.605 0.624 0.654 0.585 0.618 Sentence Encode (LaBSE) micro Historic Language 2 - strict 0.651 0.621 0.636 0.661 0.611 0.635 Multilingual Models 3 French Europeana BERT 0.655 0.657 0.656 0.672 0.638 0.654 4 French Europeana ELECTRA 0.213 0.238 0.225 0.233 0.269 0.25