hmBERT: Historical Multilingual Language Models
for Named Entity Recognition
Stefan Schweter1 , Luisa März2,3 , Katharina Schmid1 and Erion Çano2
1
  Bayerische Staatsbibliothek München, Digital Library/ Munich Digitization Center, Munich, Germany
2
  Digital Philology, Research Group Data Mining and Machine Learning, University of Vienna, Austria
3
  Natural Language Processing Expert Center, Data:Lab, Volkswagen AG, Munich, Germany


                                         Abstract
                                         Compared to standard Named Entity Recognition (NER), identifying persons, locations, and organizations
                                         in historical texts constitutes a big challenge. To obtain machine-readable corpora, the historical text
                                         is usually scanned and Optical Character Recognition (OCR) needs to be performed. As a result, the
                                         historical corpora contain errors. Also, entities like location or organization can change over time, which
                                         poses another challenge. Overall, historical texts come with several peculiarities that differ greatly
                                         from modern texts and large labeled corpora for training a neural tagger are hardly available for this
                                         domain. In this work, we tackle NER for historical German, English, French, Swedish, and Finnish by
                                         training large historical language models. We circumvent the need for large amounts of labeled data by
                                         using unlabeled data for pretraining a language model. We propose hmBert, a historical multilingual
                                         BERT-based language model, and release the model in several versions of different sizes. Furthermore,
                                         we evaluate the capability of hmBert by solving downstream NER as part of this year’s HIPE-2022
                                         shared task and provide detailed analysis and insights. For the Multilingual Classical Commentary
                                         coarse-grained NER challenge, our tagger HISTeria outperforms the other teams’ models for two out of
                                         three languages.

                                         Keywords
                                         Named Entity Recognition, historical NER, Transformer-based language models, Historical texts, Flair


1. Introduction
Standard Named Entity Recognition (NER) for high resource domains has already been success-
fully addressed with performances above 90% F1-score [1, 2]. In contrast, NER taggers often fail
to achieve satisfying results in the historical domain. Since historical datasets usually stem from
Optical Character Recognition (OCR) and also include domain shifts, they contain characteristic
errors not found in modern text. Low-resource fonts like Fraktur pose additional challenges for
clean OCR. Another problem is that large amounts of labeled data are required when training
neural models and only little labeled data exists for historical NER [3]. Because of all these
challenges, systems designed for contemporary datasets cannot be applied to the historical
domain without adaptations or further training. However, in the last few years, a number of
works have shown that it is possible to adapt systems by using different approaches [4].

CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
 0000-0002-7190-2090 (S. Schweter); 0000-0003-4542-2437 (L. März); 0000-0001-6057-6640 (K. Schmid);
0000-0002-5496-3860 (E. Çano)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   In this work, we develop a new BERT-based language model [5] for the historical context:
hmBert. We tackle NER for historical German, English, French, Swedish, and Finnish. We use
self-supervised learning to pretrain our language model on unlabeled data before we fine-tune
the NER tagger on labeled data. This allows to reduce the need for large amounts of labeled
training data. To mitigate the impact of OCR noise in the pretraining corpora, we use a filtering
step that allows to control the OCR confidence of the texts.
   Another design step in training language models is the choice of the underlying vocabu-
lary. Beltagy et al. [6] showed that using a domain-specific vocabulary leads to performance
improvements compared to using a general domain vocabulary. Thus, we use a sub-corpus of
our pretraining corpus to create an in-domain vocabulary for the hmBert training. Finally,
we arrive with a powerful hmBert model that establishes state-of-the-art results for three out
of four languages on the NewsEye NER dataset [7]. As large language models require a lot of
computational resources for pretraining and during inference time, we also provide smaller
models.
   Addressing the HIPE-2022 NERC-Coarse task, we also study a single-model vs. one-model
approach. Our comparison shows that fine-tuning hmBert models for each language individu-
ally (single-model approach) improves performance compared to models that were fine-tuned
on data from all languages (one-model approach). At the same time, however, fine-tuning
individual models is much more computationally expensive. The one-model approach is more
efficient, achieving similar performance while requiring less computation.
   In addition, our final model HISTeria is trained by using multi-stage fine-tuning. We first
fine-tune the multilingual model and evaluate it over the development data of all the different
available languages. The resulting hyperparameter configuration is used for another fine-
tuning step for each monolingual model. Finally, our detailed study of hmBert also includes
experiments with a knowledge-based approach, training an ELECTRA-based language model [8],
and addressing a tokenization issue. These additional experiments did not enhance performance
but represent a suitable starting point for further research.
   Our contributions are i) the comprehensive description of the development of hmBert, ii)
the release of hmBert models of different sizes, iii) the release of the hmBert pretraining code,
and iv) extensive experiments using hmBert including detailed insights for the community.
   This paper is structured as follows: the next section (2) describes hmBert and its development
in details. We include an explanation of the used datasets, as well as processing, hyperparameter
settings, and pretraining steps. We close the section with a downstream task evaluation. Section
3 provides insights into the HIPE-2022 Multilingual Commentary Challenge1 [9]. We describe
our approach for the shared task submission in detail and provide an analysis of our results. We
conclude the paper with Section 4.


2. hmBert: Historical Multilingual BERT Model
In this section, we present hmBert which supports German, English, French, Finnish, and
Swedish. We train two different models with different vocabulary sizes: 32,000 and 64,000.
We first describe the corpora used for training hmBert, as well as preprocessing and filtering
   1
       https://hipe-eval.github.io/HIPE-2022/
Figure 1: Overall pretraining of our ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 model and fine-tuning procedure for NER downstream
tasks. CLS is a special symbol added in front of every input example, and SEP is a special separator
token.


steps to create the pretraining corpus. In addition, we explain the pretraining process and end
the section by evaluating the model on a downstream NER task. Figure 1 shows the overall
pretraining procedure for hmBert and its application on downstream tasks.

2.1. Corpora
For German, French, Swedish and Finnish we use the Europeana newspapers2 provided by
the European Library. For English we use a dataset published by the British Library [10]. The
dataset contains OCR-processed text from digitized books and has also been used by Hosseini
et al. [11] to train historical language models for English.

2.1.1. Filtering
OCR full-text for the Europeana newspapers also includes an OCR confidence value. This
measure indicates the average OCR confidence for each word of a newspaper3 . For German
and French we perform a number of characters per year analysis using different (minimum
required) OCR confidence thresholds. For German, we test three different thresholds and report
the resulting dataset size (see Table 15 in the appendix). We use an OCR confidence threshold
of 0.60 to get a final dataset of approx. 28 GB. For French, we test five different OCR confidence
values (see Table 16 in the appendix) and choose 0.70 so that the resulting dataset size of 27 GB
is comparable to the size of the German dataset. For Finnish and Swedish, we use an OCR
confidence threshold of 0.60. However, training data for Swedish and Finnish is very limited. In
total, only 1.2 GB for Finnish, and 1.1 GB for Swedish are available, thus these corpora are not
filtered any further using other OCR confidence thresholds. For English, language filtering using


   2
       http://www.europeana-newspapers.eu/
   3
       https://www.clarin.eu/sites/default/files/Nuno_Freire_Europeana_CLARINPLUS.pdf
Figure 2: Number of characters per year distribution for filtered German Europeana corpus (1683-1949).


Figure 3: Number of characters per year distribution for filtered French Europeana corpus (1814-1944).


Figure 4: Number of characters per year distribution for filtered English corpus from British Library
(1800-1899).


langdetect4 for each book in the corpus is performed. Additionally, we use books published
between 1800 and 1900 exclusively. The resulting English corpus has a total size of 24 GB.
   To get a deeper insight into the filtered corpora, we analyze the distribution of characters
over time for each language. Figure 2 shows the distribution for German. The period from 1865
to 1914 is well-covered in the dataset, while the years from 1683 to 1849 and the 20th century
are underrepresented. For French, the 20th century is highly covered, but there is only little
data available for the 19th century (see Figure 3), which contrasts with the German corpus. The
English corpus contains texts from the 19th century only and shows good coverage starting
from 1850. However, there is only little coverage from 1800 to 1849 (see Figure 4). Since both
Finnish and Swedish corpora include newspapers from 1900 to 1910 only, we do not analyze
the number of characters per year for these datasets.

2.1.2. Multilingual Vocabulary Generation
To create a BERT-compatible wordpiece-based vocabulary [12], we use 10GB of each language
and train the vocabulary using the Hugging Face Tokenizers library5 . We build a cased vocabu-
    4
        https://github.com/Mimino666/langdetect
    5
        https://github.com/huggingface/tokenizers
Table 1
NER datasets that are used for calculating subword fertility rate and portion of UNKs. For English, the
development dataset was used due to a missing training split.
                            Language                 NER Corpora
                             German      CLEF-HIPE-2020 [15], NewsEye [7]
                              French     CLEF-HIPE-2020 [15], NewsEye [7]
                             English          CLEF-HIPE-2020 [15]
                             Finnish              NewsEye [7]
                             Swedish              NewsEye [7]


Table 2
Subword fertility rate and portion of UNKs calculated on NER datasets using a 32k wordpiece-based
vocabulary.
                             Language      Subword Fertility    UNK Portion

                              German              1.43          0.0004
                               French             1.25          0.0001
                              English             1.25          0.0
                              Finnish             1.69          0.0007
                              Swedish             1.43          0.0


lary with no lower casing or accent stripping being performed. For Finnish and Swedish we
need to upsample6 the corpus because both corpora have a size of 1 GB only.
   We create a 32k and 64k vocabulary. Inspired by Rust et al. [13], we report the subword
fertility rate (SFR) and the portion of unknown (UNK) tokens per language on various historical
NER datasets (see Table 1). The SFR is defined as the average number of subwords a tokenizer
produces per word [13]. It indicates how aggressively a tokenizer splits, i.e. whether it over-
segments or not. As over-segmentation can negatively impact downstream performance, an
SFR close to 1 (indicating that the tokenizer vocabulary contains every word in the input text)
is optimal. UNK tokens are challenging because such tokens are not seen during pretraining and
the model cannot provide useful information for them during the fine-tuning phase [14]. Table
2 and Table 3 show the SFR and portion of UNKs in the 32k/64k corpus. French and English have
the lowest SFRs, whereas Finnish has the highest rate in both wordpiece-based vocabularies.

2.2. Final Pretraining Corpus
For common multilingual models such as multilingual BERT [mBERT; 5], XLM-RoBERTa [16]
or mT5 [17] different corpus sampling strategies have been developed to up-/downsample low-
/high-resource languages [18]. Since our multilingual language model includes five languages
only (mBERT covers 104 languages7 ), we use a similar size for all languages. After upsampling
the Swedish and Finnish corpora to 27GB each, we arrive at a total dataset size of 130 GB. Table
    6
      For upsampling we simply concatenate the original corpus 𝑁 -times to match the desired 10 GB size per
language.
    7
      https://github.com/google-research/bert/blob/master/multilingual.md
Table 3
Subword fertility rate and portion of UNKs calculated on NER datasets using a 64k wordpiece-based
vocabulary.
                                 Language      Subword Fertility     UNK Portion

                                  German               1.31          0.0004
                                   French              1.16          0.0001
                                  English              1.17          0.0
                                  Finnish              1.54          0.0007
                                  Swedish              1.32          0.0


Table 4
Size per language of final pretraining corpus for hmBert.
                                            Language      Dataset Size
                                             German       28GB
                                              French      27GB
                                             English      24GB
                                             Finnish      27GB
                                             Swedish      27GB
                                              Total       130GB


4 shows an overview of the sizes per language included in our final pretraining corpus. For the
hmBert model with a vocabulary size of 32k, we use the official BERT implementation8 to create
pretraining data. Detailed description of all parameters used for the creation of pretraining data
can be found in Section A.2 of the appendix.

2.3. Models
We pretrain an hmBert model with a vocabulary size of 32k, further denoted as ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 ,
and another hmBert model with a vocabulary size of 64k, further denoted as ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 .
Inspired by Hou et al. [19], we also pretrain and release smaller hmBert models, with the
number of layers ranging from 2 to 8 and hidden sizes ranging from 128 to 512. Pretraining of
the different models is described in detail in Section A.3 of the appendix.

2.4. Downstream Task Evaluation
We evaluate the ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 models on the NewsEye NER dataset [7], because this dataset
includes most of the languages that hmBert covers (except English), and compare them with
the current state-of-the-art reported by Hamdi et al. [20]. We use the Flair [21] library and
perform a hyperparameter search (see Table 18 in appendix) using the common fine-tuning
paradigm. Fine-tuning adds a single linear layer to a Transformer and fine-tunes the entire
architecture on the NER downstream task. To bridge the difference between subword modeling

    8
        https://github.com/google-research/bert#pre-training-with-bert
Table 5
Performance overview of ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 models on German NewsEye NER dataset.
                 Model Name               Development F1-Score     Test F1-Score
                 ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Tiny                    30.16               24.35
                 ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Mini                    35.74               31.54
                 ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Small                   40.27               39.04
                 ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Medium                  43.45               43,41
                 ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Base                    46.17               46.66
                 Hamdi et al. [20]                   -                 48.3


and token-level predictions, subword pooling is applied to create token-level presentations
which are then passed to the final linear layer. A common subword pooling strategy is to use
the first subtoken to represent the entire token and we also use this strategy in our experiments.
To train our architecture, we use AdamW [22] optimizer, a very small learning rate and a fixed
number of epochs as a hard-stopping criterion. We evaluate the model performance after each
training epoch on the development set and use the best model (strict micro F1-score) for final
evaluation. We adopt a one-cycle [23] training strategy, in which the learning rate linearly
decreases until it reaches 0 by the end of the training. Tables 5 - 8 show the performance of our
ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 models compared to the current state-of-the-art.
   For German, even the ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 base model could not reach the performance reported by
Hamdi et al. [20], that was based on the models developed by Boros et al. [24]. The performance
difference is 1.64 percentage points. This could be due to the fact that the German NewsEye
dataset is very large and the hyperparameter search needed to be extended. Furthermore,
Hamdi et al. [20] proposed a new architecture for handling OCR errors by adding two extra
transformer layers, whereas we only performed a standard fine-tuning approach. For French our
ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 medium sized model is very close to the result reported by Hamdi et al. [20]. The
ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 base model outperforms the current best result by +2.7 percentage points. The
same performance gain can be observed for Finnish and Swedish: The ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 base model
outperforms the current SOTA by 2.41 percentage points for Finnish, and 2.1 percentage points
for the Swedish NewsEye dataset. Figure 5 shows an overall performance comparison for the
pretrained ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 smaller models on the NewsEye dataset. On average, the performance
difference between the 8-layer ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 medium and the 12-layer ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 base model
is 2.7 percentage points.


3. HIPE-2022: Multilingual Classical Commentary Challenge
We participated in the Multilingual Classical Commentary Challenge (MCC) that was newly
introduced in the 2022 edition of HIPE [25] with our tagger being denoted as HISTeria. The
challenge requires participants to work with historical classical commentaries in at least two
different languages and to develop solutions for Named Entity Recognition, Classification,
and/or Linking. HISTeria aims to detect and classify named entities according to coarse-grained
types (NERC-Coarse task) and is described in more detail in this section.
Table 6
Performance overview of ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 models on French NewsEye NER dataset.
                    Model Name                Development F1-Score     Test F1-Score
                    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Tiny                     60.04               50.79
                    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Mini                     70.55               62.28
                    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Small                    75.72               69.02
                    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Medium                   78.99               72.51
                    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Base                     81.58               75.10
                    Hamdi et al. [20]                    -                  72.7


Table 7
Performance overview of ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 models on Finnish NewsEye NER dataset.
                    Model Name                Development F1-Score     Test F1-Score
                    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Tiny                     30.37               34.76
                    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Mini                     56.60               62.68
                    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Small                    64.31               73.20
                    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Medium                   69.95               76.34
                    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Base                     76.05               80.11
                    Hamdi et al. [20]                    -                  77.7


Table 8
Performance overview of ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 models on Swedish NewsEye NER dataset.
                    Model Name                Development F1-Score     Test F1-Score
                    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Tiny                     43.65               38.91
                    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Mini                     64.05               65.58
                    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Small                    73.47               76.29
                    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Medium                   78.07               82.47
                    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Base                     81.13               83.60
                    Hamdi et al. [20]                    -                  81.5


3.1. Data
A classical commentary is a scholarly publication that aims to facilitate the reading and under-
standing of classical works of literature by providing additional information such as translations
or bibliographic references. Apart from the challenges that are common to historical texts,
commentaries have other characteristics that may complicate Named Entity Recognition and
Classification: they frequently cite the original literary text, making them inherently mul-
tilingual, and they often use abbreviations to convey information more concisely. For the
Multilingual Classical Commentary Challenge, HIPE9 has chosen a single dataset that was

   9
       https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-ajmc.md
Figure 5: Overview of performance of ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 smaller models on NewsEye NER datasets. F1-score
on the test set is reported here.


Table 9
Dataset statistics about ajmc dataset.
                         Language     Training Sentences      Development Sentences
                         German                      1,024                      192
                         English                     1,154                      252
                         French                        894                      202


created in the context of the Ajax MultiCommentary project10 (ajmc dataset). The dataset
contains excerpts from commentaries published in the 19th century in English, French, and
German. The French texts date from 1886, the German ones from 1853 and 1894, and the English
ones from 1881 and 1896. This emphasis on the second half of the 19th century fits well with
the temporal distribution of our pretraining data for English and German. Apart from standard
entity types like person or location, the dataset also includes domain-specific annotations like
the scope and work entity type for bibliographic references. Additional dataset statistics can be
found in Table 9 and in the HIPE 2022 Overview paper [9].

3.2. Single-Models vs. One-Model Approach
In preliminary experiments models that are independently fine-tuned for each language (single-
model approach) and a model that uses training data from all languages (one-model approach)
are compared. We perform hyperparameter searches for the two approaches. The relevant
hyperparameters for fine-tuning models are shown in the appendix (Table 19). We use the Flair
library for all experiments. For the one-model approach, a breakdown analysis for each language

   10
        https://mromanello.github.io/ajax-multi-commentary/
Table 10
Performance comparison for NERC-coarse between single-model and one-model approach on ajmc
development dataset. Numbers express F1-score calculated by using the strict evaluation regime.
                                        Language   Single-Model   One-Model
                                        German        86.21         86.68
                                        English       84.98         84.85
                                        French        85.69         85.09


is performed after determining the best hyperparameter configuration. This is compared to the
three independently fine-tuned models for each language. For German, the one-model approach
is +0.47 percentage points better than the single-model approach. For English, the one-model
approach performs slightly worse (-0.13 percentage points) and for French, the single-model
approach outperforms the one-model by 0.6 percentage points. However, the single-model
approach requires fine-tuning of 120 models, whereas the one-model approach only needs 40
models to be fine-tuned for hyperparameter search. To save resources, we decided to use the
one-model approach for further experiments. The performance comparison on the ajmc dataset
is shown in Table 10.

3.3. Multi-Stage Fine-Tuning
Wang et al. [26] proposed a knowledge-based system for multilingual NER using a multi-stage
fine-tuning approach for the MultiCoNER SemEval 2022 task11 . The first stage of multi-stage
fine-tuning refers to training a multilingual model on data from different languages. In the
second stage, this fine-tuned multilingual model is used as a starting point for training a
monolingual model. We adapt this approach for our final system: in the first stage, we fine-tune
one multilingual model over the training data of all three languages (German, English, and
French) and optimize over all development data (one-model approach) using a hyperparameter
search. We select the best hyperparameter configuration as a combination of batch size, the
number of epochs, and the learning rate, which results in five models (because of five different
random seeds). The hyperparameter search grid for the different stages is shown in Section B.1
in the appendix. From these five models, we choose the one with the highest F1-score on the
development set for second stage fine-tunings. In the second stage, we use the best model from
the first stage and fine-tune single models for each language with a hyperparameter search
on the development set. For each language, we select the best hyperparameter configuration
and choose the best performing model with the highest F1-score on the development set. In
preliminary experiments, this multi-stage fine-tuning approach boosts performance by 1.23
percentage points on average compared to results in the first stage.
   For our final submission, ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 achieves the best results during the first-stage of
fine-tuning with a batch size of 4, 10 fine-tuning epochs and a learning rate of 5𝑒 − 05. This
results in an average F1-score of 86.89 on the (combined) development sets for ajmc. The best
hyperparameter configuration for ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 can be achieved when using a batch size of 8,

   11
        https://multiconer.github.io/
Table 11
Final results on ajmc development dataset for all languages using best models after multi-stage fine-
tuning. Results are reported with official HIPE scorer.
  Submission ID                      Hyperparameter Configuration   Strict F1-Score   Fuzzy F1-Score
  German (ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 ) - 1                     bs8-e05-lr3e-05            91.5             94.2
  German (ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 ) - 2                     bs8-e10-lr3e-05            92.0             93.9
  English (ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 ) - 1                    bs4-e10-lr3e-05            89.1             92.9
  English (ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 ) - 2                    bs8-e10-lr3e-05            88.0             93.8
  French (ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 ) - 1                     bs4-e10-lr3e-05            86.8             93.1
  French (ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 ) - 2                     bs4-e10-lr5e-05            85.9             93.0


10 epochs of fine-tuning and a learning rate of 3𝑒−05. This results in an overall F1-score of 86.69
percentage points. Thus, ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 is slightly worse than ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 (-0.2 percentage
points). Table 11 shows the performance for our final submissions using ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 and
ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 for all languages in the ajmc dataset. We report strict and fuzzy F1-scores using
the official HIPE-scorer12 . We exclude document-level scores for better readability.

3.4. HISTeria Results
Table 12 shows an overview of HISTeria compared to the runs of other teams in the HIPE-2022
shared task13 .
   To gain a better understanding of our models, we use the attribute-aided evaluation proposed
by Fu et al. [27]. In order to highlight the strengths and weaknesses of different models, they
analyze how model performance varies with regard to certain attributes. In the case of NER,
properties that may influence performance are i) how consistently a given surface form of a
token or an entity is labelled across a dataset (tCon and eCon), ii) how often a given token
or entity appears in the dataset (tFre and eFre), iii) the number of tokens that make up an
entity (eLen) or sentence (sLen) as well as iv) the relative number of out-of-vocabulary words
and entities per sentence (oDen and eDen). Using the implementation by Fu et al. [27], we
distribute the values into buckets and compute the strict F1-score for each bucket. Table 13
shows Spearman’s rank correlation coefficient as a measure of how well the attribute correlates
with the F1-score, and the standard deviation of the F1-score to indicate how strongly the
attribute influences performance. We omit results that are not statistically significant.
   For the two German models, none of the attributes seem to correlate with performance
in a statistically significant way. For the English and French models, performance correlates
directly and positively with the consistency of the token labels. The standard deviation of 10%
(French) and 8-9% (English) of the F1-score indicates that this attribute has a marked impact
on performance. For French ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 , entity length influences performance to the same
degree. In this case, performance gets worse the more tokens an entity has. The impact of entity
length on English ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 and ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 is less strong but still notable (standard

   12
        https://github.com/hipe-eval/HIPE-scorer
   13
        https://github.com/hipe-eval/HIPE-2022-eval/
Table 12
Final results on ajmc test dataset for all languages compared to other participants in the HIPE-2022
shared task. HISTeria denotes our system. Rank is ordered by strict F1-score.
        Rank    Language     Submission ID                    Strict F1-Score     Fuzzy F1-Score
        1       German       L3i (team 2) - 2                      93.4                95.2
        2       German       HISTeria (ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 ) - 1             91.3                93.7
        3       German       HISTeria (ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 ) - 2             91.2                94.5
        4       German       L3i (team 2) - 1                      90.8                93.4
        5       German       Neural baseline                       81.8                87.3
        1       English      HISTeria (ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 ) - 2             85.4                91.0
        2       English      L3i (team 2) - 1                      85.0                89.4
        3       English      L3i (team 2) - 2                      84.1                88.4
        4       English      HISTeria (ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 ) - 1             81.9                89.9
        5       English      Neural baseline                       73.6                82.8
        1       French       HISTeria (ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 ) - 2             84.2                88.0
        2       French       HISTeria (ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 ) - 1             83.3                88.8
        3       French       L3i (team 2) - 2                      82.6                87.2
        4       French       L3i (team 2) - 1                      79.8                86.0
        5       French       Neural baseline                       74.1                82.5


Table 13
Spearman’s rank correlation coefficient and standard deviation of models’ F1-score depending on
different attribute values. We omit results that are not statistically significant.
               Model                    Attribute   Spearman       Standard Deviation
               English ℎ𝑚𝐵𝐸𝑅𝑇32𝑘        tCon            1.0                0.09
                                        eLen           -1.0                0.06
                                        oDen           -1.0                0.09
               English ℎ𝑚𝐵𝐸𝑅𝑇64𝑘        tCon            1.0                0.08
                                        eLen           -1.0                0.01
               French ℎ𝑚𝐵𝐸𝑅𝑇32𝑘         tCon            1.0                0.10
                                        eLen           -1.0                0.10
               French ℎ𝑚𝐵𝐸𝑅𝑇64𝑘         tCon           1.0                 0.10


deviation of 6% and 1% respectively). In addition to entity length, the amount of words that
did not feature in the training set also correlates negatively with the performance of English
ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 .

3.5. Challenges
We also experimented with the knowledge-based system for multilingual NER that was proposed
by Wang et al. [26]. We used their implementation to enrich the original ajmc datasets with a
knowledge base and implemented their context approach in the Flair library. More precisely,
we used the FLERT approach [28] and utilized the knowledge-base enriched context as the left
context for each training example. A left context size of 128 performs best in the experiments.
However, the final result was slightly worse than using no context at all. This may be due to the
fact that a contemporary, general-purpose knowledge base (Wikipedia) was used. A domain-
specific knowledge base may yield better results. As the preliminary results were slightly worse
than our main baseline, we did not conduct further experiments with this knowledge-based
system.
   We calculated the portion of UNKs in the German ajmc dataset and found that the portion rate
of 16.3 % is unreasonably high. We discovered that the German ajmc dataset contains long-s
characters, unlike the Europeana Newspaper corpora which were used to train a vocabulary. As
a consequence, the hmBert tokenizer is not able to handle tokens that include long-s characters,
resulting in UNKs. For our final system, we manually replaced all long-s characters with a
normal s character to circumvent the UNK problem. In upcoming versions of our hmBert
models, we will add this replacement step in the tokenizer configuration directly. Furthermore,
we also trained an ELECTRA model [8] for 1M steps on the same pretraining corpus as the
ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 model. We found that the downstream performance on NewsEye datasets was
1 to 3 percentage points worse than ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 and -0.28 percentage points worse on the
ajmc dataset. We have therefore decided not to release the model yet.


3.6. Community Contributions
To foster research on language and NER models for the historical domain, we publicly release
our pretrained and fine-tuned models on the Hugging Face Model Hub14 under the dbmdz
namespace15 . We also publicly release all code that was used for fine-tuning models16 . Table
14 shows an overview of released models for our HIPE-2022 submission, including the model
identifier on the Hugging Face Model Hub. All models are released under a permissive MIT
license. Additionally, we added dataset support for all HIPE-2022 NER datasets into Flair
library17 .


4. Conclusion
We presented hmBert, a new multilingual BERT-based language model for historical data.
hmBert is composed of German, French, English, Finnish, and Swedish unsupervised corpora
of historical OCR-processed texts. The corpora have been filtered for OCR confidence as well
as sampled so that each language contributes a similar amount of data to the model. The
underlying vocabulary is also derived from each of the languages used for hmBert. In our
temporal analysis of the pretraining corpora, we have found that data from the 18th and 19th
century is unevenly distributed across the different languages. For future models, we are looking
for additional datasets to balance this representation. We evaluated two hmBert models of
different sizes with downstream Named Entity Recognition. For the NewsEye dataset hmBert
   14
      https://huggingface.co/
   15
      https://huggingface.co/dbmdz
   16
      https://github.com/dbmdz/clef-hipe
   17
      Added in Flair version 0.11: https://github.com/flairNLP/flair/releases/tag/v0.11
Table 14
Community contributions for our HIPE-2022 submission: Pretrained language models and fine-tuned
NER models are publicly available on the Hugging Face Model Hub.
    Model Description                         Model Name
    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Tiny Model                      dbmdz/bert-tiny-historic-multilingual-cased
    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Mini Model                      dbmdz/bert-mini-historic-multilingual-cased
    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Small Model                     dbmdz/bert-small-historic-multilingual-cased
    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Medium Model                    dbmdz/bert-medium-historic-multilingual-cased
    ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Base Model                      dbmdz/bert-base-historic-multilingual-cased
    ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 Base Model                      dbmdz/bert-base-historic-multilingual-64k-td-cased

    NER First Stage (ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 )              dbmdz/flair-hipe-2022-ajmc-all
    NER First Stage (ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 )              dbmdz/flair-hipe-2022-ajmc-all-64k

    NER Second Stage - German (ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 )    dbmdz/flair-hipe-2022-ajmc-de
    NER Second Stage - English (ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 )   dbmdz/flair-hipe-2022-ajmc-en
    NER Second Stage - French (ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 )    dbmdz/flair-hipe-2022-ajmc-fr

    NER Second Stage - German (ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 )    dbmdz/flair-hipe-2022-ajmc-de-64k
    NER Second Stage - English (ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 )   dbmdz/flair-hipe-2022-ajmc-en-64k
    NER Second Stage - French (ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 )    dbmdz/flair-hipe-2022-ajmc-fr-64k


established a new state-of-the-art for three out of four languages: French, Finnish, and Swedish.
For the 2022 HIPE Multilingual Classical Commentary Challenge, our HISTeria system could
outperform the other systems for two out of three languages. Using multi-stage fine-tuning
together with the multilingual BERT-based model led the model to its optimal performance.
Detailed analysis showed the benefits of all of hmBerts design choices, as well as interesting
findings for future research. Our contributions include all of the trained hmBert models and
our source code, which are made publicly available.


Acknowledgments
We would like to thank Google’s TPU Research Cloud (TRC) program for giving us access to
TPUs that were used for training our hmBert models. We would also like to thank Hugging
Face for providing the ability to host and perform inferencing of our models on the Hugging
Face Model Hub.


References
 [1] A. Akbik, T. Bergmann, R. Vollgraf, Pooled Contextualized Embeddings for Named
     Entity Recognition, in: Proceedings of the 2019 Conference of the North American
     Chapter of the Association for Computational Linguistics: Human Language Technologies,
     Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis,
     Minnesota, 2019, pp. 724–728. doi:10.18653/v1/N19-1078.
 [2] X. Wang, Y. Jiang, N. Bach, T. Wang, Z. Huang, F. Huang, K. Tu, Automated Concatenation
     of Embeddings for Structured Prediction, in: the Joint Conference of the 59th Annual
     Meeting of the Association for Computational Linguistics and the 11th International
     Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), Association for
     Computational Linguistics, 2021.
 [3] M. Ehrmann, G. Colavizza, Y. Rochat, F. Kaplan, Diachronic Evaluation of NER Systems
     on Old Newspapers, Bochumer Linguistische Arbeitsberichte, Bochum, Germany, 2016,
     pp. 97–107. URL: http://infoscience.epfl.ch/record/221391.
 [4] M. Ehrmann, A. Hamdi, E. Linhares Pontes, M. Romanello, A. Douvet, A Survey of Named
     Entity Recognition and Classification in Historical Documents, ACM Computing Surveys
     (2022 (to appear)). URL: https://arxiv.org/abs/2109.11406.
 [5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
     Transformers for Language Understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423.
 [6] I. Beltagy, K. Lo, A. Cohan, SciBERT: A Pretrained Language Model for Scientific Text, in:
     Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
     and the 9th International Joint Conference on Natural Language Processing (EMNLP-
     IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3615–
     3620. doi:10.18653/v1/D19-1371.
 [7] A. Hamdi, E. L. Pontes, E. Boros, T. T. H. Nguyen, G. Hackl, J. G. Moreno, A. Doucet,
     Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection
     in Historical Newspapers, 2021. doi:10.5281/zenodo.4573313.
 [8] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, ELECTRA: Pre-training Text Encoders
     as Discriminators Rather Than Generators, in: International Conference on Learning
     Representations, 2020. URL: https://openreview.net/forum?id=r1xMH1BtvB.
 [9] M. Ehrmann, M. Romanello, S. Najem-Meyer, A. Doucet, S. Clematide, Overview of HIPE-
     2022: Named Entity Recognition and Linking in Multilingual Historical Documents, in:
     A. Barrón-Cedeño, G. Da San Martino, M. Degli Esposti, F. Sebastiani, C. Macdonald, G. Pasi,
     A. Hanbury, M. Potthast, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality,
     Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of
     the CLEF Association (CLEF 2022), Lecture Notes in Computer Science (LNCS), Springer,
     2022.
[10] B. L. Labs, Digitised books. c. 1510 - c. 1946. json (ocr derived text)., 2016. doi:10.21250/
     DB14.
[11] K. Hosseini, K. Beelen, G. Colavizza, M. Coll Ardanuy, Neural Language Models for
     Nineteenth-Century English, arXiv e-prints (2021) arXiv:2105.11321. arXiv:2105.11321.
[12] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,
     K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo,
     H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick,
     O. Vinyals, G. Corrado, M. Hughes, J. Dean, Google’s Neural Machine Translation System:
     Bridging the Gap between Human and Machine Translation, CoRR abs/1609.08144 (2016).
     URL: http://arxiv.org/abs/1609.08144. arXiv:1609.08144.
[13] P. Rust, J. Pfeiffer, I. Vulić, S. Ruder, I. Gurevych, How Good is Your Tokenizer? On
     the Monolingual Performance of Multilingual Language Models, in: Proceedings of
     the 59th Annual Meeting of the Association for Computational Linguistics and the 11th
     International Joint Conference on Natural Language Processing (Volume 1: Long Papers),
     Association for Computational Linguistics, Online, 2021, pp. 3118–3135. doi:10.18653/
     v1/2021.acl-long.243.
[14] J. Pfeiffer, I. Vulić, I. Gurevych, S. Ruder, UNKs Everywhere: Adapting Multilingual
     Language Models to New Scripts, in: Proceedings of the 2021 Conference on Empirical
     Methods in Natural Language Processing, Association for Computational Linguistics,
     Online and Punta Cana, Dominican Republic, 2021, pp. 10186–10203. doi:10.18653/v1/
     2021.emnlp-main.800.
[15] M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide, Extended Overview of CLEF HIPE
     2020: Named Entity Processing on Historical Newspapers, volume 2696 of CEUR Workshop
     Proceedings. 2696, CEUR-WS, 2020, p. 38. doi:10.5281/zenodo.4117566.
[16] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
     M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised Cross-lingual Representation Learning
     at Scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational
     Linguistics, Association for Computational Linguistics, Online, 2020, pp. 8440–8451. doi:10.
     18653/v1/2020.acl-main.747.
[17] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Raffel, mT5:
     A Massively Multilingual Pre-trained Text-to-Text Transformer, in: Proceedings of the
     2021 Conference of the North American Chapter of the Association for Computational
     Linguistics: Human Language Technologies, Association for Computational Linguistics,
     Online, 2021, pp. 483–498. doi:10.18653/v1/2021.naacl-main.41.
[18] A. Conneau, G. Lample, Cross-lingual Language Model Pretraining, in: H. Wallach,
     H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural
     Information Processing Systems, volume 32, Curran Associates, Inc., 2019. URL: https:
     //proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf.
[19] L. Hou, R. Yuanzhe Pang, T. Zhou, Y. Wu, X. Song, X. Song, D. Zhou, Token Dropping for
     Efficient BERT Pretraining, arXiv e-prints (2022) arXiv:2203.13240. arXiv:2203.13240.
[20] A. Hamdi, E. Boroş, E. L. Pontes, T. T. H. Nguyen, G. Hackl, J. G. Moreno, A. Doucet, A
     Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in
     Historical Newspapers, in: Proceedings of the 44rd International ACM SIGIR Conference
     on Research and Development in Information Retrieval, 2021.
[21] A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, R. Vollgraf, FLAIR: An easy-to-use
     framework for state-of-the-art NLP, in: NAACL 2019, 2019 Annual Conference of the North
     American Chapter of the Association for Computational Linguistics (Demonstrations),
     2019, pp. 54–59.
[22] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: International
     Conference on Learning Representations, 2019. URL: https://openreview.net/forum?id=
     Bkg6RiCqY7.
[23] L. N. Smith, A disciplined approach to neural network hyper-parameters: Part 1 – learning
     rate, batch size, momentum, and weight decay, arXiv e-prints (2018) arXiv:1803.09820.
     arXiv:1803.09820.
[24] E. Boros, A. Hamdi, E. Linhares Pontes, L. A. Cabrera-Diego, J. G. Moreno, N. Sidere,
     A. Doucet, Alleviating Digitization Errors in Named Entity Recognition for Historical
     Documents, in: Proc. of the 24th Conference on Computational Natural Language Learning,
     ACL, 2020, pp. 431–441. URL: https://www.aclweb.org/anthology/2020.conll-1.35.
[25] M. Ehrmann, M. Romanello, S. Clematide, A. Doucet, Introducing the HIPE 2022 Shared
     Task:Named Entity Recognition and Linking in Multilingual Historical Documents, in:
     Proceedings of the 44d European Conference on IR Research (ECIR 2022), Lecture Notes
     in Computer Science, Springer, Stavanger, Norway, 2022. URL: https://link.springer.com/
     chapter/10.1007/978-3-030-99739-7_44.
[26] X. Wang, Y. Shen, J. Cai, T. Wang, X. Wang, P. Xie, F. Huang, W. Lu, Y. Zhuang, K. Tu,
     W. Lu, Y. Jiang, DAMO-NLP at SemEval-2022 Task 11: A Knowledge-based System
     for Multilingual Named Entity Recognition (2022). URL: https://arxiv.org/abs/2112.06482.
     arXiv:2203.00545.
[27] J. Fu, P. Liu, G. Neubig, Interpretable Multi-dataset Evaluation for Named Entity Recogni-
     tion, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language
     Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 6058–
     6069. doi:10.18653/v1/2020.emnlp-main.489.
[28] S. Schweter, A. Akbik, FLERT: Document-Level Features for Named Entity Recognition,
     2020. arXiv:2011.06993.
[29] S. Schweter, BERTurk - BERT models for Turkish, 2020. doi:10.5281/zenodo.3770924.
[30] L. Hou, R. Y. Pang, T. Zhou, Y. Wu, X. Song, X. Song, D. Zhou, Token Dropping for Efficient
     BERT Pretraining, in: Proceedings of the 60th Annual Meeting of the Association for Com-
     putational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics,
     Dublin, Ireland, 2022, pp. 3774–3784. URL: https://aclanthology.org/2022.acl-long.262.
A. hmBERT: Historical Multilingual BERT Model
A.1. Corpora Filtering

Table 15
Word level OCR confidence thresholds for German. Bold OCR confidence is used for the final corpus.
                                    OCR Confidence       Dataset Size
                                           0.60              28GB
                                           0.65              18GB
                                           0.70              13GB


Table 16
Word level OCR confidence thresholds for French. Bold OCR confidence is used for the final corpus.
                                    OCR Confidence       Dataset Size
                                           0.60              31GB
                                           0.65              27GB
                                           0.70              27GB
                                           0.75              23GB
                                           0.80              11GB


A.2. Final Pretraining Corpus
For creation of the pretraining data, we use the same parameters as BERTurk [29]: maximum
sequence length = 512, maximum predictions per sequence = 75, masked language probability
rate = 0.15, duplication factor = 5. Due to hardware limitations, we split the pretraining
corpus into chunks of 1GB and create pretraining data for each chunk individually. For the
hmBert model with a vocabulary size of 64k we use the official implementation18 with the same
parameters as for the 32k model, but we increase the maximum predictions per sequence to 76.

A.3. Models
We use the official BERT implementation19 for pretraining ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 . ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 is
trained with the recently proposed “token dropping” approach by Hou et al. [30]. Using this
approach, unimportant tokens starting from an intermediate layer in the model are dropped to
make the model focus on important tokens more efficiently, which makes model pretraining
faster compared to the original BERT implementation. For both pretraining approaches, we use a
maximum sequence length of 512 for the full training time. For the pretraining of ℎ𝑚𝐵𝐸𝑅𝑇32𝑘
a batch size of 128 is used for 3M training steps. Pretraining was done on a v3-32 TPU pod
    18
       https://github.com/tensorflow/models/blob/27fb855b027ead16d2616dcb59c67409a2176b7f/official/legacy/
bert/README.md#pre-training
    19
       https://github.com/google-research/bert
within 67 hours. The pretraining of ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 was done on a single v4-8 TPU with a batch
size of 512 for 1M steps within 114 hours. Figure 6 shows the pretraining loss for ℎ𝑚𝐵𝐸𝑅𝑇32𝑘
and ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 . The final ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 has 110.62M, whereas ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 has 135.19M
parameters due to the increased vocabulary size.
   For better comparability, we measure the number of total subtokens seen during pretraining20
and the number of total subtokens of the pretraining corpus for our two hmBert models. More
precisely, ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 has seen 196B subtokens during pretraining, whereas the pretraining
corpus has a total size of 42B subtokens. This results in 4.7 pretraining epochs over the corpus.
Our ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 model has seen 262B subtokens during pretraining. Because of the larger
vocabulary size, the number of subtokens for the corpus is 39B. This results in 6.7 pretraining
epochs over the corpus.


Figure 6: Overview of pretraining loss for ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 and ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 .


   For the smaller models, we use the same pretraining data and hyperparameter as for the
base ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 model and pretrain them on a v3-32 TPU pod. Table 17 shows an overview
of pretrained models, including their model size, number of parameters and pretraining time.
Figure 7 shows an overview of pretraining loss for all smaller ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 models.

Table 17
Overview of smaller ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 models with their corresponding model size, number of parameters
and pretraining time.
     Model Name                   Number of Layers       Hidden Size      Parameters      Pretraining Time
     ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Tiny                        2                  128               4.58M         4.3s / 1k steps
     ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Mini                        4                  256              11.55M        10.5s / 1k steps
     ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Small                       4                  512              29.52M        20.7s / 1k steps
     ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Medium                      8                  512              42.13M        35.0s / 1k steps
     ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 Base                        12                 768             110.62M        80.0s / 1k steps


   20
     Total number of subtokens during pretraining can be calculated as multiplication of training steps, batch size
and sequence length
Figure 7: Overview of pretraining loss for smaller ℎ𝑚𝐵𝐸𝑅𝑇32𝑘 models.


A.4. Downstream Task Evaluation

Table 18
Hyperparameter search for downstream evaluation on NewsEye NER dataset.
                                Parameter        Values
                                Batch Size       [4, 8]
                                Epoch            [5, 10]
                                Learning Rate    [3𝑒 − 05, 5𝑒 − 05]
                                Seed             [1, 2, 4, 5]


B. HIPE-2022: Multilingual Classical Commentary Challenge
B.1. Multi-Stage Fine-Tuning

Table 19
Hyperparameter search during the first stage of NER model fine-tuning.
                   Parameter        Values
                   Batch Size       [4, 8, 16]
                   Epoch            [10]
                   Learning Rate    [1𝑒 − 05, 2𝑒 − 05, 3𝑒 − 05, 4𝑒 − 05, 5𝑒 − 05]
                   Seed             [1, 2, 4, 5]
Table 20
Hyperparameter search during the second stage of NER model fine-tuning.
                               Parameter       Values
                               Batch Size      [4, 8]
                               Epoch           [5, 10]
                               Learning Rate   [3𝑒 − 05, 5𝑒 − 05]
                               Seed            [1, 2, 4, 5]


  As a batch size of 16 and learning rates of 1𝑒 − 05 and 2𝑒 − 05 do not perform well, we
exclude them when performing hyperparameter search with ℎ𝑚𝐵𝐸𝑅𝑇64𝑘 .