=Paper=
{{Paper
|id=Vol-2696/paper_173
|storemode=property
|title=Triple E - Effective Ensembling of Embeddings and Language Models for NER of Historical German
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_173.pdf
|volume=Vol-2696
|authors=Stefan Schweter,Luisa März
|dblpUrl=https://dblp.org/rec/conf/clef/SchweterM20
}}
==Triple E - Effective Ensembling of Embeddings and Language Models for NER of Historical German==
Triple E - Effective Ensembling of Embeddings and Language Models for NER of Historical German. Stefan Schweter1 and Luisa März2 1 Bayerische Staatsbibliothek München, Digital Library/Munich Digitization Center stefan.schweter@bsb-muenchen.de 2 Center for Information and Language Processing (CIS), LMU Munich maerz@cis.lmu.de Abstract. Named entity recognition (NER) for historical texts is a chal- lenging task compared to NER for contemporary texts. Historical texts come with several peculiarities that differ greatly from modern texts and large labeled corpora for training a neural tagger are hardly available. In this work we tackle NER for historical German with an ensembling ap- proach, combining different labeled and unlabeled resources of historical and contemporary texts as part of the CLEF HIPE 2020 evaluation lab. We stack different word/subword embeddings and transformer-based lan- guage models to train a powerful NER tagger for historical German. We conduct experiments with different word embeddings, Flair embeddings and pretrained Bert models. The named entities are classified in literal and in metonymic sense, for which we have developed a separate tagger each. Our experiments show that the usage of Bert is particularly help- ful, when trained on a large amount of historical data. Our best ensemble is a combination of FastText embeddings trained on German Wikipedia, Flair embeddings trained on CLEF HIPE data (historical German) and a Bert language model trained on a large corpus of historical German. We release our code and models3 . Keywords: Named Entity Recognition · Transformer-based language models · Embeddings · Historical texts · Flair · FastText · Byte Pair Encoding. 1 Introduction In NER neural networks achieve good accuracy on high resource domains such as modern news text or Twitter ([2, 4]). But on historical text, NER taggers often perform poorly. This is due to domain shift and to the fact that historical texts contain systematic errors not found in modern text, since historical datasets Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. 3 Our code and models are available at: https://github.com/stefan-it/clef-hipe usually stem from optical character recognition (OCR). OCR is noisy and the Gothic type face (Fraktur) is a low resource font, that is very challenging for OCR. Another problem is that a large amount of data is required when training neural models and only relatively small corpora (e.g. [20]) exist for historical NER. All of these challenges mean that NER for contemporary texts differs greatly from NER for historical texts and that existing models cannot be used. From a resource orientated and ecological point of view it is reasonable to reuse existing models to save both computing power and emissions. Therefore, we reuse existing models on the one hand and make our newly developed language models publicly available on the other hand. In the NLP community there are several approaches and models provided, one of which is Flair [1]. Flair allows to apply state-of-the-art natural language processing (NLP) models, such as NER, part-of-speech tagging (PoS), word sense disambiguation or classification to various input texts. In this work we built our systems with that framework. Transformer-based language models are widely used and Bert [8] can be considered as a powerful standard resource. There are several recent approaches that use Bert for NER in different languages, such as [25] or [16]. The latter conduct experiments with historical German using Bert and unsupervised pre- training on a large corpus of historical German texts together with supervised pretraining on a contemporary German corpus. 1.1 Task and Objective In this work, we address neural NER tagging on historical German data. With our approach we aim to solve coarse grained NER in the CLEF HIPE shared task [11] (bundle 4) for historical German as best as possible. The tagset of the provided data contains person, location, organisation, product and time. The organizers arranged two scenarios to be solved: NER for the literal sense of the words and NER for metonymic sense. The example below shows that the tags for the literal (first sentence) and metonymic (second sentence) sense can differ. Hannover can be interpreted as an organization as well as a location depending on its context and the metonymic category addresses this issue. Example: Unterhandlungen über das Konkordat mit B-loc Hannover schreiten voran. Unterhandlungen über das Konkordat mit B-org Hannover schreiten voran. (Negotiations on the Concordat with Hanover are progressing.) This paper is structured as follows: The next section describes data sets and other resources that are used in the experiments presented. Section 3 outlines our method and section 4 explains details on implementation and the conducted experiments. The outcome of the experiments is discussed in that section as well. Then, section 5 overviews ideas for future work and we conclude the paper with section 6. 2 Data and Resources This section describes the data provided by the shared task organizers as well as additional resources and data that we used for our experiments. 2.1 CLEF HIPE Data The shared task corpus for German is composed of articles sampled among sev- eral Swiss and Luxembourgish historical newspapers on a diachronic basis and is provided by the CLEF-HIPE-2020 organizers. The articles that were chosen for the train, development and test data are journalistic articles only, that had to match certain selection criteria such as length or format. Feuilleton, tabular data, crosswords, weather forecasts, time schedules and obituaries were excluded as well as articles that were fully illegible due to massive ORC noise. The news- paper content stems for the time period from 1798 until 2018 and thus there is different OCR quality present in the data which covers a broad spectrum of text composition. The corpora were manually annotated by native speakers according the HIPE impresso guidelines ([10, 9]). 2.2 Additional Data and Resources Table 1 gives an overview of all resources and shows time period and domain of each data set. The sizes of the training data used for the embeddings/models is shown in Table 2. Our approach includes data from different time periods as well as from various domains to reuse existing resources optimally. Embeddings We use different FastText-based word embeddings [19] trained on Wikipedia4 , Common Crawl5 and on historic data (provided by the orga- nizers) as well as Byte Pair Encoding-based embeddings (BPE, [24]) trained on Wikipedia. We use the FastText embeddings trained on Wikipedia (FastText Wiki ) and Common Crawl (FastText CC ) in a ”classic” word embeddings man- ner, that means we do not use subwords. To include subword information we use German subword embeddings [12] with a dimension of 300 and a vocab size of 200k (BPEmb). Additionally, we experiment with multilingual subword embed- dings [13] with a dimension size of 300 and a vocab size of 1M (MultiBPEmb). We use Flair embeddings [3, 2] provided by the organizers (CLEF-HIPE ) and compared them to other Flair embeddings that were trained on historic data. We use two historic Flair embeddings that were trained by [23]: embeddings trained on the Hamburger Anzeiger newspaper corpus (HHA) and embeddings trained on the Wiener Zeitung newspaper corpus (WZ ). Both embeddings are available in the Flair framework. In addition we use the data of the recently published REDEWIEDERGABE corpus [6] that consists of fictional and non- fictional texts. We also experiment with the Flair embeddings provided by [3] (German Flair). 4 https://fasttext.cc/docs/en/pretrained-vectors.html 5 https://fasttext.cc/docs/en/crawl-vectors.html usage name time period domain train data 1798 - 2018 news FastText FastText Wiki contemp. various FastText FastText CC contemp. various BPE BPEmb 1798 - 2018 news BPE MultiBPEmb contemp. news Flair HHA 1888 - 1945 news Flair WZ 1703 - 1875 news Flair Redewiedergabe 1840 - 1920 various Flair German Flair contemp. various Flair CLEF-HIPE 1798 - 2018 news Bert Europeana Bert 1618 - 1990 news Bert German Bert historical various Table 1. Overview of time periods and domains of the training data used for the embeddings and language models. usage name data tokens size train data CLEF HIPE* 0.071 S FastText FastText Wiki Wikipedia 1400 L BPE BPEmb Wikipedia ≈ 1400 L BPE MultiBPEmb Wikipedia < 7000 L FastText FastText CC Common Crawl 65648 XL Flair Redewiedergabe REDEWIEDERGABE 0.489 S Flair German Flair OPUS project 500 M Flair HHA Hamburger Anzeiger 742 M Flair WZ Wiener Zeitung 802 M Flair CLEF-HIPE CLEF-HIPE* 1722 L Bert Europeana Bert Europeana 8000 L Bert German Bert - ≈ 24000 XL Table 2. Overview of different training data used. Number of tokens is given in millions. * indicates that data was provided by the organizers. Transformer-based language models For transformer-based language mod- els we conduct experiments with self-trained Bert models, Europeana Bert6 and large German Bert7 (German Bert). In preliminary experiments we also used publicly available German Bert models (deepset8 and DBMDZ9 ). Since their performance was not convincing we did not include them in our final setup. The Europeana Bert data comes from the Europeana Newspapers collec- tion10 , which contains historical news articles in 12 languages published between 1618 and 1990. The Europeana Bert model was trained on 51GB of newspa- pers, extracted from German Europeana. It mainly covers newspaper articles from the 18th to 20th century. German Bert was trained on a huge collection of various historical resources. 3 Methods To develop an efficient NER tagger for historical texts we experiment with stack- ing methods described in the following. We experiment with different kinds of ensembling/stacking approaches on the development set to figure out the optimal combination of embeddings and lan- guage models. Our final system Cisteria uses an ensemble of word embeddings, transformer-based language models and Flair embeddings. To arrive at the best combination of embeddings for Cisteria we conduct experiments where we a) select the best word embeddings, Flair embeddings and transformer-based lan- guage models independently and b) combine the best selected word embedding, the best transformer-based language model and the best Flair embeddings and feed those to our network. The network for the classification is a bidirectional LSTM with a conditional random field (CRF) as final output layer as proposed by [14]. Note that we train separate models for the metonymic and the literal sense span. 4 Implementation and Experiments The following describes the implementation of our approach, overviews the dif- ferent experiments and presents the results. Our final system for the CLEF HIPE 2020 evaluation lab is referred to as Cisteria. To feed the CLEF-HIPE data into our tagger we need several preprocessing steps. Our preprocessing includes sentence splitting (rule based method) and nor- malizing word hyphenations. The motivation behind normalizing hyphenation is that pretrained language models normally include normalized text and the word hyphenation character in the CLEF-HIPE shared task is a special symbol (¬) 6 https://github.com/stefan-it/europeana-bert 7 Under review. 8 https://huggingface.co/bert-base-german-cased 9 https://github.com/dbmdz/berts 10 http://www.europeana-newspapers.eu/ and does not occur in training corpora for pretrained language models. As we use contextualized word embeddings, the correct hyphenation is very important to produce high quality embeddings. To get the data ready for evaluation with the officially provided evaluation script, we perform a reverse process and add word hyphenation and sentence boundaries again. We use the Flair [1] library to train our NER tagging models and we make use of Bert embeddings in a feature-based setting. In order to get a represen- tation for an input token, we first compute the mean of the first subword over all layers of the transformer-based architecture and feed the resulting represen- tation into a bidirectional LSTM with a CRF as the final layer, following [3]. To ensemble different embeddings and language models their representations are concatenated and the resulting vector is processed by the neural model. Ciste- ria was trained on the official training and development data and does not use any other additional labeled training data. For the experiments with transformer-based language models, we fine-tune Bert models using the Hugging Face Transformers library [29]. For these fine- tuning experiments, we use a batch size of 16 and train 10 epochs. We perform three runs per transformer-based model and select the best model based on development F1-score. We do not perform extensive hyperparameter search. We then use the fine-tuned model in Flair (feature-based approach) for all further experiments. We use a bidirectional LSTM with 256 hidden states and a batch size of 16. The original Bert paper [8] uses the last four layers of the transformer-based model for a feature-based NER model. Additionally, we reduce the learning rate by a factor of 0.5 with a patience of 3. This factor determines the number of epochs with no improvement after which the learning rate will be reduced and can be seen as early stopping. We found that fine-tuning a Bert model for the metonymic sense span was very unstable resulting in zero F1-scores. This is a well known problem for datasets when only a small number of training instances are available and a solution could be to use a different dropout strategy [17]. For that reason we trained a model using the CLEF-HIPE Flair embeddings. In the prediction phase we only do predictions when an entity is detected for the literal sense span. Our final system for the literal sense span uses FastText embeddings trained on Wikipedia (FastText Wiki ) and a self-trained large German Bert model. For the metonymic sense span we train a separate model that uses FastText embed- dings trained on Wikipedia and Flair embeddings provided by the organizers. 4.1 Results For the evaluation of NER there are two regimes: strict and fuzzy. The strict regime corresponds to exact boundary matching whereas the fuzzy takes over- lapping boundaries into account, a detailed description can be found in [11]. In addition spans are evaluated w.r.t literal or metonymic sense (see section 1.1). We evaluate our systems using the official evaluation script11 . All our reported results on the development set refer to the F1 score for coarse grained NER in the strict scenario for the literal sense. For the test set we report precision, recall and F1 score for both scenarios in the literal sense as well as in the metonymic sense (see Table 8). According to the overview paper of the shared task [11] the baseline in the strict evaluation scenario for German Coarse NER in literal sense results in 47.6% F1-score (see Table 7). Our results of the experiments with different word embeddings show that the FastText Wiki embeddings perform best, see Table 3. With an F1-score of approx. 69% they can overcome the baseline by more than 20 percentage points. Interesting is that the FastText Wiki embeddings are not trained on the biggest amount of data compared to the other word embeddings (see Table 2). Model F1 FastText Wiki 69.28 ±0.65 FastText CC 66.38 ±0.51 BPEmb [12] 67.71 ±0.48 MultiBPEmb [13] 66.22 ±0.14 Table 3. Experiments with different word Embeddings on German development set. Averaged F1-score over 3 runs is reported here. Best result in bold. Different Flair embeddings lead consistently to better results than using word embeddings. The Flair embeddings provided by the organizers (CLEF- HIPE ) perform best, with an F1-score of 77.04% (see Table 4). The gap between the different Flair embeddings is comparably large and ranges from seven to three percentage points difference. Here the embeddings that were trained on the biggest amount of data perform best and the Redewiedergabe embeddings that were trained on the least amount perform worst. Model F1 Hamburger Anzeiger [23] 74.14 ±0.11 Wiener Zeitung [23] 75.07 ±0.11 Redewiedergabe [6] 70.21 ±0.27 German (Flair) [3] 74.98 ±0.30 CLEF-HIPE 77.04 ±0.12 Table 4. Experiments with different Flair Embeddings on German development set. Averaged F1-score over 3 runs is reported here. Best result in bold. 11 https://github.com/impresso/CLEF-HIPE-2020-scorer Model F1 Europeana Bert (cased) 80.41 ±0.14 Europeana Bert (uncased) 79.66 ±0.32 German Bert (cased, large) 82.11 ±0.50 Table 5. Experiments with different Bert models on German development set. Aver- aged F1-score over 3 runs is reported here. Best result in bold. The usage of Bert enhances the performance once more. The German Bert model performs best and results in 82.11% F-score (see Table 5). Again this is the model that was trained on the biggest amount of data. The cased version of Europeana Bert leads to a similar performance with approx. two percentage points less. Since German is case sensitive it is understandable that the cased models perform better than the uncased ones. Like with the Flair embeddings every setup with Bert outperforms the models of our previous experiments. Model F1 FastText (Wikipedia) + CLEF-HIPE + German Bert 83.57 ±0.36 FastText (Wikipedia) + CLEF-HIPE 77.97 ±0.47 FastText (Wikipedia) + German Bert 83.69 ±0.08 Table 6. Experiments with different stacking experiments on German development set. Averaged F1-score over 3 runs is reported here. Best result in bold. Finally the combination of German Bert with the FastText Wiki embed- dings outperforms all of our other systems on the development set and results in 83.69% (see Table 6). This result is plausible if we compare it to the best F1-scores of [16] on other historical datasets. For two datasets their performance is around 84%. The addition of the best Flair embeddings decreases the results slightly. If combining the best Flair embeddings with the best FastText em- beddings the model performs better than using Flair embeddings only but still worse than the other stacking approaches. The performance of our best system is approx. 40% better than the baseline, which is a large improvement. 4.2 Discussion of Results We want to relate our final results on the test set to those of the other partic- ipating teams. Compared to the baseline our final systems (CISTERIA) could perform very good. If we take a look at the median of all participating teams our system for the literal sense performs approx. 2% points better in the strict scenario and is almost on par with the median in the fuzzy scenario (see Table 7). For both regimes the best system L3i [5] outperforms ours by slightly more than 10% points. This could be due to the fact that they use powerful transformer- based embeddings for different languages and a hierarchical transformer-based attention model [28] together with a multi task learning setting approach. Our experiments with Bert embeddings show that the model can benefit from the German Europeana Bert language model a lot and that only a model trained with even more data could outperform it. Therefore it is not surprising that a model trained with more of these powerful Bert embeddings performs even bet- ter. The benefit of the combination of models for different languages is at hand and we suppose that our model performances can be enhanced if we integrate multilinguality as well. Strict Fuzzy Team P R F1 P R F1 Cisteria 0.745 0.578 0.651 0.880 0.683 0.769 Ehrmama [27] 0.697 0.659 0.678 0.814 0.765 0.789 L3i [5] 0.790 0.805 0.797 0.870 0.886 0.878 Sbb [15] 0.499 0.484 0.491 0.730 0.708 0.719 SinNer [21] 0.658 0.658 0.658 0.775 0.819 0.796 UPB [7] 0.677 0.575 0.621 0.788 0.740 0.763 Uva-ilps [22] 0.499 0.556 0.526 0.689 0.768 0.726 Webis [26] 0.695 0.337 0.454 0.833 0.405 0.545 Baseline 0.643 0.378 0.476 0.790 0.464 0.558 Median 0.686 0.576 0.636 0.801 0.752 0.766 Table 7. Results for NERC-Coarse literal with micro precision, recall and F1-score on the test set. Bold font indicates highest, underlined the second highest result. In the evaluation w.r.t the metonymic sense it turns out that our approach to train a separate model was constructive. In both regimes our system performs clearly above the median and in the fuzzy regime our F1-score is the second best (see Table 8). Again the L3i system can reach the best scores, probably due to the same reasons as mentioned above. Our results support our strategy that we only do predictions for tokens where the literal sense is classified as an entity. Regarding the precision our system performs very well and reaches second best performance in all cases, except for the fuzzy evaluation in the literal sense where our system performs best. Unfortunately the recall is relatively low with around 50% for the metonymic sense and 57%/68% for the strict/fuzzy evalu- ation in the literal sense. Our system has the ability to classify correctly if it identifies a token as a possible entity but has problems with finding the entities as such. Strict Fuzzy Team P R F1 P R F1 Cisteria 0.738 0.500 0.596 0.787 0.534 0.636 Ehrmama [27] 0.696 0.542 0.610 0.707 0.551 0.619 L3i [5] 0.571 0.712 0.634 0.626 0.780 0.694 Baseline 0.814 0.297 0.435 0.814 0.297 0.435 Table 8. Results for NERC-Coarse metonymic with micro precision, recall and F1- score. Bold font indicates highest, underlined the second highest result. 5 Future Work The approach of the winning team suggests to include multilingual language models and/or more data. Since a lot of powerful pretrained language models are available we will integrate some of them in Cisteria. Another strategy is to take into account the domain of historical language even more. Since there is a lot of noise in the data due to OCR it greatly differs from modern standard language. Nevertheless there are many modern corpora available on which transformer-based language models can be trained. Our goal is to increase the similarity of those modern corpora to historical data. Therefore we want to recreate some of the phenomena in historical corpora in the modern corpora that we use for training the language models. Besides that, manual rule-based sentence segmentation could have drawbacks (e.g. bad segmentation could lead to short sentences). So in future experiments we could use the context before and after the actual training sentence, such as in [18]. This approach could eliminate potential drawbacks of an automatically sentence segmented training corpus, because shorter sentences are now enhanced with longer contexts. 6 Conclusion We proposed a system to solve coarse grained NER for German in the CLEF HIPE shared task. We conducted experiments with ensembling different word and subword embeddings as well as transformer-based language models on the basis of a bidirectional LSTM with a CRF as final layer. To use historical re- sources at best we trained large language models on historical German data, such as the German Europeana collection. Our best system uses FastText embeddings trained on German Wikipedia data in combination with a large German Bert language model. With a performance of 65.1% F1-score our best system per- forms slightly better than the median in the strict scenario for the literal sense and with an F1-score of 76.9% on par with the median in the fuzzy scenario. For the metonymic sense our best system performs clearly above the baseline and reaches the second best performance in the fuzzy scenario. References 1. Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., Vollgraf, R.: FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). pp. 54–59. Association for Computational Linguis- tics, Minneapolis, Minnesota (Jun 2019), https://www.aclweb.org/anthology/ N19-4010 2. Akbik, A., Bergmann, T., Vollgraf, R.: Pooled Contextualized Embeddings for Named Entity Recognition. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long and Short Papers). pp. 724–728. Associa- tion for Computational Linguistics, Minneapolis, Minnesota (Jun 2019), https: //www.aclweb.org/anthology/N19-1078 3. Akbik, A., Blythe, D., Vollgraf, R.: Contextual String Embeddings for Sequence Labeling. In: Proceedings of the 27th International Conference on Computational Linguistics. pp. 1638–1649. Association for Computational Linguistics, Santa Fe, New Mexico, USA (Aug 2018), https://www.aclweb.org/anthology/C18-1139 4. Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., Auli, M.: Cloze-driven Pre- training of Self-attention Networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Associ- ation for Computational Linguistics, Hong Kong, China (Nov 2019), https: //www.aclweb.org/anthology/D19-1539 5. Boros, E., Linhares Pontes, E., Cabrera-Diego, L.A., Hamdi, A., Moreno, J.G., Sidère, N., Doucet, A.: Robust Named Entity Recognition and Linking on Histori- cal Multilingual Documents. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR-WS (2020) 6. Brunner, A., Engelberg, S., Jannidis, F., Tu, N.D.T., Weimer, L.: Corpus RE- DEWIEDERGABE. In: Proceedings of The 12th Language Resources and Evalua- tion Conference. pp. 803–812. European Language Resources Association, Mar- seille, France (May 2020), https://www.aclweb.org/anthology/2020.lrec-1. 100 7. Craita, C.C., Cercel, D.C.: Multilingual Named Entity Recognition on Historical Texts Using Transfer and Multi-Task Learning. In: Cappellato, L., Eickhoff, C., Ferro, N., Névé ol, A. (eds.) CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR-WS (2020) 8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019), https://www.aclweb.org/anthology/N19-1423 9. Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: HIPE - Shared Task Participation Guidelines (v1.1) (2020). https://doi.org/10.5281/zenodo.3677171 10. Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Impresso Named Entity Annotation Guidelines (Jan 2020). https://doi.org/10.5281/zenodo.3604227 11. Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Overview of CLEF HIPE 2020: Named Entity Recognition and Linking on Historical Newspapers. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eickhoff, C., Névéol, A., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 11th Interna- tional Conference of the CLEF Association (CLEF 2020). Lecture Notes in Com- puter Science (LNCS), vol. 12260. Springer (2020) 12. Heinzerling, B., Strube, M.: BPEmb: Tokenization-free Pre-trained Subword Em- beddings in 275 Languages. In: chair), N.C.C., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S., Tokunaga, T. (eds.) Proceedings of the Eleventh Interna- tional Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan (May 7-12, 2018 2018) 13. Heinzerling, B., Strube, M.: Sequence Tagging with Contextual and Non- Contextual Subword Representations: A Multilingual Evaluation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 273–291. Association for Computational Linguistics, Florence, Italy (Jul 2019), https://www.aclweb.org/anthology/P19-1027 14. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015) 15. Labusch, K., Neudecker, C.: Named Entity Disambiguation and Linking Historic Newspaper OCR with BERT. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR-WS (2020) 16. Labusch, K., Neudecker, C., Zellhöfer, D.: Bert for named entity recognition in contemporary and historic german. In: Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers. pp. 1–9. German Society for Computational Linguistics & Language Technology, Erlangen, Germany (2019) 17. Lee, C., Cho, K., Kang, W.: Mixout: Effective Regularization to Finetune Large- scale Pretrained Language Models. In: International Conference on Learning Rep- resentations (2020), https://openreview.net/forum?id=HkgaETNtDB 18. Luoma, J., Pyysalo, S.: Exploring Cross-sentence Contexts for Named Entity Recognition with BERT. arXiv e-prints arXiv:2006.01563 (Jun 2020) 19. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in Pre- Training Distributed Word Representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018) (2018) 20. Neudecker, C.: An Open Corpus for Named Entity Recognition in Historic News- papers. In: Proceedings of the Tenth International Conference on Language Re- sources and Evaluation (LREC’16). pp. 4348–4352. European Language Resources Association (ELRA), Portorož, Slovenia (May 2016), https://www.aclweb.org/ anthology/L16-1689 21. Ortiz, S., Pedro, J., Dupont, Y., Lejeune, G., Tian, T.: SinNer@Clef-Hipe2020: Sin- ful adaptation of SotA models for Named Entity Recognition in historical French and German newspapers. In: Cappellato, L., Eickhoff, C., Ferro, N., Névé ol, A. (eds.) CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR-WS (2020) 22. Provatorova, V., Vakulenko, S., Kanoulas, E., Dercksen, K., van Hulst, J.M.: CLEF HIPE Working Notes: UvA ILPS & REL. In: Cappellato, L., Eickhoff, C., Ferro, N., Névé ol, A. (eds.) CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR-WS (2020) 23. Schweter, S., Baiter, J.: Towards Robust Named Entity Recognition for Historic German. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019). pp. 96–103. Association for Computational Linguistics, Florence, Italy (Aug 2019), https://www.aclweb.org/anthology/W19-4312 24. Sennrich, R., Haddow, B., Birch, A.: Neural Machine Translation of Rare Words with Subword Units. In: Proceedings of the 54th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers). pp. 1715– 1725. Association for Computational Linguistics, Berlin, Germany (Aug 2016), https://www.aclweb.org/anthology/P16-1162 25. Souza, F., Nogueira, R., Lotufo, R.: Portuguese Named Entity Recognition using BERT-CRF (2019) 26. Tobollik, T., Wiegmann, M., Wolska, M., Stein, B.: Enrichement-based Oversam- pling for Coarse-grained NER in Historical Text. In: Cappellato, L., Eickhoff, C., Ferro, N., Névé ol, A. (eds.) CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR-WS (2020) 27. Todorov, K., Colavizza, G.: Transfer Learning for Named Entity Recognition in Historical Corpora. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR-WS (2020) 28. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention Is All You Need. CoRR abs/1706.03762 (2017), http://arxiv.org/abs/1706.03762 29. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., Rush, A.M.: HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv e-prints arXiv:1910.03771 (Oct 2019)