=Paper=
{{Paper
|id=Vol-2696/paper_168
|storemode=property
|title=Transfer Learning for Named Entity Recognition in Historical Corpora
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_168.pdf
|volume=Vol-2696
|authors=Konstantin Todorov,Giovanni Colavizza
|dblpUrl=https://dblp.org/rec/conf/clef/TodorovC20
}}
==Transfer Learning for Named Entity Recognition in Historical Corpora==
<pdf width="1500px">https://ceur-ws.org/Vol-2696/paper_168.pdf</pdf>
<pre>
Transfer Learning for Named Entity Recognition
             in Historical Corpora

                   Konstantin Todorov∗[0000−0002−7445−4676] and
                    Giovanni Colavizza∗[0000−0002−9806−084X]
                     ∗
                         University of Amsterdam, The Netherlands
                                  kztodorov@outlook.com
                                    g.colavizza@uva.nl


        Abstract. We report on our participation to the 2020 CLEF HIPE
        shared task as team Ehrmama, focusing on bundle 3: Named Entity
        Recognition and Classification (NERC) on coarse and fine-grained tags.
        Motivated by an interest to assess the added value of transfer learn-
        ing for NERC on historical corpora, we propose an architecture made
        of two components: (i) a modular embedding layer where we combine
        newly trained and pre-trained embeddings, and (ii) a task-specific Bi-
        LSTM-CRF layer. We find that character-level embeddings, BERT, and
        a document-level data split are the most important factors in improving
        our results. We also find that using in-domain FastText embeddings and
        a single-task as opposed to multi-task approach yields minor gains. Our
        results confirm that pre-trained language models can be beneficial for
        NERC on low-resourced historical corpora.


        Keywords: NERC · BERT · Bi-LSTM-CRF · Transfer learning.


1     Introduction

The advent of contextual language models such as Bidirectional Encoder Rep-
resentations from Transformers (BERT) [3] has furthered the adoption of trans-
fer learning in Natural Language Processing (NLP). Transfer learning aims at
transferring knowledge from a general-purpose source task to a specialised target
task [11,13]. The specialised target task is often linguistically under-resourced
(e.g., small data or lack of linguistic resources) [2]. Transfer learning further
allows saving computation resources by training once and applying the same
model widely with little or no further adaptation [15].
The increasing abundance of historical text corpora offers a compelling oppor-
tunity to apply transfer learning. Historical texts pose a set of challenges to the
NLP and Digital Humanities (DH) communities, of which the most general and
pressing are [12,4]: a) noisy inputs, for example due to Optical/Handwritten
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
Character Recognition (OCR/HTR) errors; b) linguistic change over time; c)
language variety, in the absence of mainstream languages such as English. These
challenges are not unique to historical texts, but they come to the forefront when
dealing with them.
We participated in the CLEF HIPE as team Ehrmama, focusing on bundle 3:
NERC coarse and NERC fine-grained [5], conducted over English, French and
German languages. This task bundle focuses on the recognition of named entities
in six different tag types, namely coarse- and fine-grained and their metonymic
senses, as well as components and nested entities of depth one. Our general
goal is to assess if and how transfer learning using modern-day language models
can help with tasks on OCRed historical corpora. To this end, we propose a
model composed of a general purpose embedding layer which allows to combine
character, sub-word and word-level embeddings in a modular way, equipped with
a state-of-the-art NERC-specific layer. We then explore the use of newly-trained
and pre-trained embeddings in isolation and in combination. Our code is publicly
available1 .


2     Method
Our proposed architecture is composed of two parts: an embedding layer and a
task-specific layer, in this case for NERC. An illustration is given in Figure 1.
Several embeddings can be combined into a modular embedding layer which
we use to represent input text. We broadly distinguish between (i) pre-trained
(transferred) embeddings and (ii) newly trained embeddings. While pre-trained
embeddings can either be fine-tuned or frozen during learning, newly trained
representations are learned from scratch. Furthermore, embeddings can be ap-
plied at different input granularities, including: (i) character-, (ii) sub-word- and
(iii) word-level. These are also modular, and can be used in combination.
The task-specific layer is a Bidirectional Long Short-Term Memory, using Con-
ditional Random Field (Bi-LSTM-CRF) as proposed by [9], with the additional
removal of the tanh non-linearity after the LSTM. As a sanity test, we applied
our model on the modern-day CoNLL-2003 dataset [14], achieving results com-
parable to current state of the art.

2.1    Empirical setup
Embedding layer containing four different embedding modules.

Character embeddings consist of an embedding layer, followed by a bidirectional
LSTM. The embedding layer’s size and the LSTM hidden size are both hyper-
parameters with values ranging from 16 to 128 and 16 to 256 respectively. We use
a character-level custom vocabularies for each language built from the training
and validation data sets.
1
    https://github.com/ktodorov/eval-historical-texts.
Fig. 1: NERC multi-task model architecture. Our single-task architecture is iden-
tical and only contains a fully connected layer and CRF for one entity type.


BERT embeddings work on sub-word level. We use bert-base-multilingual-cased
for French, bert-base-german-cased for German, and bert-base-cased for English,
relying on the HuggingFace Transformers library [16]. This brings the specific
limitation of only working with sequences of 512 tokens in maximum length. As
our text sequences are usually longer, we implement a sliding-window splitting
of input sequences before passing them through BERT. While splitting, we keep
the first and last 5 tokens of each chunk as overlap among sequential chunks.
After embedding each chunk, we then reconstruct the full input sequences by
averaging the embeddings of the overlapping characters.


Newly trained embeddings work on sub-word level and their weights are ran-
domly initialised and learned during training. We use the same vocabulary as
with BERT. The size of these embeddings is a hyper-parameter and ranges be-
tween 64 and 512.
In-domain pre-trained embeddings provided by the task organisers are used for
feature extraction only (frozen). These embeddings have size of 300 and work at
the sub-word level. This model uses the FastText library [7].
After testing different alternatives, we found that the simplest and fastest way to
combine these embeddings is by concatenating them, resulting in concatenated
sub-word embeddings of a size equal to the sum of the embedding sizes of all
enabled modules.


Task-specific layer based on a Bi-LSTM-CRF [9].

The Bi-LSTM-CRF uses the concatenated sub-word embeddings as its input,
and then merges the output to word level by taking the mean. Finally, the
resulting representation is pushed to a fully connected layer which then outputs
tag probabilities for each token. We tested concatenating embeddings before or
after the Bi-LSTM, or not merging at all, and found that our approach performs
best, also in accordance with previous findings [13]. A Conditional Random Field
(CRF) [8] is eventually used over the produced tag probabilities to decode the
final tag predictions.

A multi-task approach is our primary setup. We introduce additional output
heads, one for each of the different entity types that the task aims to predict.
The final two layers of the model, namely the fully connected layer and CRF, are
specific to each entity tag type, while the rest of the architecture is shared. The
individual losses for each task are summed during backpropagation. We compare
using single vs multi-task approach in what follows.


Additional resources We use the Annotated Corpus for Named Entity Recog-
nition built on top of Groningen Meaning Bank (GMB) [1]2 . This dataset is an-
notated specifically for training NER classifiers, and contains most of the coarse
grained tag types which occur in the English dataset provided by organisers.
We consolidate some tags with the same meaning but different labels ourselves.
The dataset contains in total 1,354,149 tokens of which 85% are labelled as O
originally. We convert the tag types that are not part of this challenge to O as
well, resulting in total of 94.92% tokens having O literal tags.
We used an NVidia GeForce 1080Ti GPU with 11GB GDDR5X memory for our
experiments.


2.2    Model fitting

In this section, we discuss the remaining pre-processing or hyper-parameter
choices which we assessed empirically.
2
    https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus [accessed 2020-
    07-16].
Pre-processing The input data is organised into documents, and each doc-
ument is split into multiple segments where usually one segment corresponds
to one line in the original historical source. The input can thus be split into
segments or into documents. Using segments leads to much faster convergence,
while document splitting usually yields better results in our experiments. We
further analyse the importance of splitting by introducing a multi-segment op-
tion which combines more than one consecutive segment. We perform a hard
split and pick the maximum length of one multi-segment sequence to be the
maximum length allowed by the HuggingFace Transformers library. We do this
to avoid any unwanted noise. At document level we overcome this limitation by
splitting documents using a sliding window approach where the first and last 5
tokens for each split are overlapping with the previous and next splits respec-
tively. We perform the cutting before extracting features through BERT after
which we concatenate the representations back. We take the average values for
the overlapping tokens. Finally, we replace all numbers with zeros, including
such that contain more than one digit. Besides, we do not lowercase, nor do we
remove any punctuation or other characters.

Fine-tuning vs. freezing There are two possibilities when using pre-trained
models: keep them frozen or fine-tune further more. Fine-tuning the model lets
us introduce two additional configuration options. The first one is related to
when to start fine-tuning. This is most often performed at the beginning of the
training process and until convergence. We try a second approach where firstly
the full model with frozen pre-trained weights converges. After, the pre-trained
weights are fine-tuned. This is something that we also investigate but find no
difference between the two approaches. We therefore fine-tune from the start in
the reported experiments with fine-tuning enabled.

Manually crafted features Following previous work [6], we assess the im-
portance of manually crafted features. We use AllLower, AllUpper, IsTitle,
FirstLetterUpper, FirstLetterNotUpper, IsNumeric and NoAlphaNumeric as
extra morphological features. When including these features in the model, we do
not get significant improvements.

Weighting As it is common with NERC tasks, most of the ground truth is
composed of outside or O tags. In our case, these make up for approximately
94.92%, 95.95%, and 96.5% of the total tokens for English, French and German
languages respectively. To counteract tag imbalance, we test a weighted loss
which we plug into the CRF layer, giving more weight to tags predicted as outside
ones but are in fact part of entities, and less on tokens which are predicted as
inside an entity but are actually outside. This weighted loss does not prove to
be beneficial.

Hyper-parameters We assess Adam and AdamW[10] optimizers. For the
learning rate we see that higher values benefit the model more. We pick a de-
fault value of 1e−8 for weight decay for all optimizers. The final hyper-parameter
configurations that we use are summarised in Table 1.


Table 1: Hyper-parameter configurations. Configuration I is used for Base. Con-
figuration II is used for Base + CE + BERT and Base + CE + BERT - newly.
Configuration III is used for all remaining setups.
Hyper-parameters                        Configuration I Configuration II Configuration III
RNN hidden size                               512              256              512
RNN directionality                       bi-directional   bi-directional   bi-directional
RNN dropout                                 0.5/0.8          0.5/0.8          0.5/0.8
Newly trained embeddings size                  64               64               64
Character embeddings size                       -               16               16
Character embeddings RNN hidden size            -               32               32
Replace numbers during pre-processing          yes              yes              yes
Weighted loss usage                            no               no               no
Optimizer                                   AdamW            AdamW            AdamW
Learning rate                                1e−2             1e−2             1e−2
Fine-tune learning rate                      1e−4             1e−4             1e−4


3     Results

We report results for the three languages part of the task, namely French, Ger-
man and English, using the official test set v1.33 . In addition to that, we report
results using multi-segment and document split types for French and German
and segment split type for English, since our English training data lacks the
document level.
All results are reported in the two scoring approaches used in the challenge —
fuzzy and strict. As a reminder, fuzzy scoring works in a relaxed way, allow-
ing fuzzy boundary matching of entities. That is if an entity is only partially
recognised, e.g., if 4 out of total of 6 tokens are recognised correctly, this is still
considered a successful recognition. Conversely, strict matching requires all to-
kens to match with exact boundary matching — in previous example this would
require 6 out of 6 total tokens to be predicted correctly. For each scoring ap-
proach, we provide precision (P), recall (R) and F-score (F), all reported as micro
and calculated using the original scorer, used in the competition4 . We report the
baseline model provided by organisers for reference, reminding the reader that
the baseline model always uses a document level split. We also report the baseline
model results on our English data.
We order the different configurations for all languages following our ablation
studies, which primarily focus on assessing the impact of transfer learning. We
start with the simplest Base model which is only using newly-trained sub-word
3
    https://github.com/impresso/CLEF-HIPE-2020.
4
    https://github.com/impresso/CLEF-HIPE-2020-scorer.
embeddings and no pre-trained information of any type. Then we continue by
adding Character Embeddings (CE) which use Bi-LSTM (+CE). Due to the sig-
nificant improvements observed by adding character embeddings, we keep them
enabled in all of our next reported setups. We further report results that were
achieved by adding firstly the (frozen) FastText embeddings provided by organis-
ers (+FT), then (frozen) BERT embeddings (+BERT), and finally both. Whenever
BERT is enabled, we also report runs where we disable newly trained embed-
dings (-newly). Eventually, we report three different setups where we unfreeze
BERT and fine-tune them on the task at hand. Due to the long sequence lengths
when working on document level, we are unable to perform fine-tuning of BERT
at the document level. We therefore report the results of fine-tuning BERT only
using multi-segment split. All models use the multi-task approach, except for
one single-task run, which has all available embeddings enabled (single).
We start reporting results for French, in Table 2. Firstly, adding character-level
embeddings and BERT consistently improves results. Better results overall are
obtained with a single-task approach and using all available embeddings, includ-
ing newly trained ones. A document level split, following this configuration, per-
form best across the board. We also see that most of our configurations struggle
on tasks with sparser annotations such as Metonymic and Nested. Furthermore,
fine-tuning BERT does not seem to improve results. Results for German, shown
in Table 3, are consistent with those for French. It is worth noting that our mod-
els struggle even more on the German Metonymic and Nested tasks. For nested
tags, we are not able to be predictive at all, specifically on the multi-segment
level.
For completeness, we report results for English in Table 4, limited to the Literal
coarse task. For a better comparison, we provide results from two baseline mod-
els: i) results from the organisers and ii) results training the baseline model on
the English dataset we use. Our models are mostly not able to perform beyond
the provided baseline. This is likely due to the training data we use.
We clarify that most of these results were obtained after the task submission
deadline. For the deadline, and as reported in [5], we submitted three different
runs for German and French and two for English. For German and French we
submitted one run using multi-task learning and document-level splitting; an-
other run using multi-task learning and multi-segment splitting; for English we
had one run using multi-task learning and segment splitting; finally, we submit-
ted one run for all languages where we used literal tag types from two single-task
learning runs. All of our submitted runs had all modules enabled.
                  Table 2: NERC, French. The best result per table and column is given in bold,
                  the second best result is underlined

                                            (a) multi-segment level, coarse grained entity type
                                                                          Literal coarse              Metonymic coarse
                               Configuration                           Fuzzy         Strict           Fuzzy      Strict
                                                                   P     R    F    P    R   F     P     R   F  P   R    F
                               Baseline                           .825 .721 .769 .693 .606 .646 .541 .179 .268 .541 .179 .268
                               Base                           .776 .69 .73 .618 .55 .582 .5 .424 .459 .495 .42 .454
                               Base + CE                      .806 .739 .771 .649 .594 .62 .552 .379 .45 .545 .375 .444
                               Base + CE + FT                 .789 .78 .784 .65 .642 .646 .481 .339 .398 .468 .33 .387
                               Base + CE + BERT               .886 .801 .841 .782 .707 .743 .424 .397 .41 .41 .384 .396
                               Base + CE + BERT - newly       .859 .818 .838 .719 .685 .702 .417 .384 .4 .417 .384 .4
                               Base + CE + FT + BERT          .866 .836 .851 .767 .739 .753 .664 .362 .468 .656 .357 .462
                               Base + CE + FT + BERT - newly .864 .848 .856 .765 .751 .758 .766 .321 .453 .766 .321 .453
                               Base + CE + FT + BERT (single) .872 .835 .853 .769 .737 .753 .036 .069 .000 .036 .069 .000
                               + Fine-tuning (unfreezing) BERT
                               Base + CE + BERT              .876 .824 .849 .775 .729 .751 .442 .375 .406 .432 .366 .396
                               Base + CE + BERT - newly      .877 .804 .839 .775 .711 .742 .754 .384 .509 .754 .384 .509
                               Base + CE + FT + BERT         .857 .836 .846 .759 .741 .75 .551 .482 .514 .541 .473 .505
                               Base + CE + FT + BERT - newly .845 .838 .842 .742 .737 .74 .659 .5 .569 .659 .5 .569


                                              (b) multi-segment level, fine grained entity type
                                         Literal fine                 Metonymic fine                   Component                            Nested
Configuration                        Fuzzy          Strict           Fuzzy      Strict             Fuzzy      Strict                  Fuzzy        Strict
                                   P   R    F    P    R    F       P   R   F P     R   F         P   R   F  P    R   F              P   R   F   P    R    F
Baseline                          .838 .693 .758 .644 .533 .583 .564 .196 .291 .538 .187 .278 .799 .531 .638 .733 .487 .585 .267 .049 .082 .267 .049 .082
Base                             .8 .67 .729 .548 .459 .499 .476 .451 .463 .472 .446 .459 .774 .531 .630 .692 .475 .563 .383 .140 .205 .333 .122 .179
Base + CE                      .825 .708 .762 .562 .482 .519 .594 .366 .453 .594 .366 .453 .779 .556 .649 .720 .514 .600 .5 .067 .118 .364 .049 .086
Base + CE + FT                 .801 .763 .781 .568 .541 .554 .567 .228 .325 .533 .214 .306 .762 .598 .67 .682 .535 .600 .425 .207 .279 .375 .183 .246
Base + CE + BERT               .889 .781 .831 .658 .578 .616 .532 .366 .434 .519 .357 .423 .803 .579 .673 .715 .515 .599 .000 .000 .000 .000 .000 .000
Base + CE + BERT - newly       .865 .748 .802 .613 .53 .568 .54 .241 .333 .54 .241 .333 .821 .504 .625 .732 .449 .557 .000 .000 .000 .000 .000 .000
Base + CE + FT + BERT          .866 .818 .842 .672 .634 .653 .702 .263 .383 .643 .241 .351 .804 .563 .662 .712 .499 .587 .357 .03 .056 .143 .012 .022
Base + CE + FT + BERT - newly .873 .82 .846 .672 .631 .651 .771 .241 .367 .743 .232 .354 .842 .546 .663 .774 .503 .61 .393 .067 .115 .286 .049 .083
Base + CE + FT + BERT (single) .868 .818 .842 .676 .636 .655 .538 .442 .485 .533 .438 .48 .752 .677 .713 .659 .594 .625 .000 .000 .000 .000 .000 .000
+ Fine-tuning (unfreezing) BERT
Base + CE + BERT              .877 .806 .840 .654 .600 .626 .434 .379 .405 .429 .375 .400 .77 .598 .673 .673 .523 .588 .267 .049 .082 .133 .024 .041
Base + CE + BERT - newly      .885 .782 .83 .672 .593 .63 .739 .29 .417 .705 .277 .397 .818 .524 .639 .745 .477 .582 .107 .018 .031 .071 .012 .021
Base + CE + FT + BERT         .871 .814 .842 .687 .642 .664 .568 .411 .477 .543 .393 .456 .741 .672 .705 .648 .587 .616 .232 .159 .188 .179 .122 .145
Base + CE + FT + BERT - newly .852 .837 .845 .663 .652 .658 .681 .420 .519 .609 .375 .464 .785 .626 .697 .701 .559 .622 .333 .183 .236 .244 .134 .173


                                               (c) document level, coarse grained entity type
                                                                          Literal coarse               Metonymic coarse
                               Configuration                           Fuzzy         Strict           Fuzzy       Strict
                                                                   P     R    F    P    R   F    P      R   F   P   R    F
                               Baseline                           .825 .721 .769 .693 .606 .646 .541 .179 .268 .541 .179 .268
                               Base                           .812 .686 .743 .671 .566 .614 .444 .536 .486 .444 .536 .486
                               Base + CE                      .802 .762 .782 .658 .625 .641 .575 .272 .370 .566 .268 .364
                               Base + CE + FT                 .815 .737 .774 .673 .608 .639 .510 .469 .488 .505 .464 .484
                               Base + CE + BERT               .871 .831 .851 .779 .743 .760 .684 .232 .347 .684 .232 .347
                               Base + CE + BERT - newly       .890 .828 .858 .788 .733 .759 .564 .277 .371 .545 .268 .359
                               Base + CE + FT + BERT          .872 .828 .849 .772 .733 .752 .433 .696 .534 .428 .688 .527
                               Base + CE + FT + BERT - newly .869 .872 .871 .78 .782 .781 .755 .357 .485 .755 .357 .485
                               Base + CE + FT + BERT (single) .89 .856 .873 .807 .776 .791 .699 .424 .528 .691 .420 .522


                                                 (d) document level, fine grained entity type
                                            Literal fine                 Metonymic fine                   Component                       Nested
  Configuration                         Fuzzy          Strict          Fuzzy       Strict             Fuzzy      Strict             Fuzzy        Strict
                                    P     R    F    P    R    F    P     R   F  P     R   F      P      R   F  P    R   F       P     R   F   P    R    F
  Baseline                         .838 .693 .758 .644 .533 .583 .564 .196 .291 .538 .187 .278 .799 .531 .638 .733 .487 .585 .267 .049 .082 .267 .049 .082
  Base                           .822 .672 .739 .594 .486 .534 .446 .513 .477 .419 .482 .448 .738 .6 .662 .657 .534 .589 .512 .250 .336 .350 .171 .230
  Base + CE                      .809 .752 .78 .586 .546 .565 .521 .223 .313 .521 .223 .313 .743 .618 .675 .65 .541 .59 .35 .171 .23 .275 .134 .180
  Base + CE + FT                 .811 .722 .764 .599 .534 .565 .54 .362 .433 .507 .339 .406 .759 .603 .672 .684 .544 .606 .453 .177 .254 .406 .159 .228
  Base + CE + BERT               .885 .799 .84 .696 .629 .661 .654 .304 .415 .654 .304 .415 .719 .686 .702 .625 .596 .610 .304 .104 .155 .250 .085 .127
  Base + CE + BERT - newly       .896 .790 .840 .675 .595 .633 .568 .223 .321 .568 .223 .321 .808 .603 .690 .696 .520 .595 .000 .000 .000 .000 .000 .000
  Base + CE + FT + BERT          .883 .800 .839 .717 .649 .682 .741 .371 .494 .679 .339 .452 .794 .631 .703 .715 .568 .633 .341 .183 .238 .318 .171 .222
  Base + CE + FT + BERT - newly .881 .841 .861 .703 .671 .687 .705 .384 .497 .689 .375 .486 .792 .644 .71 .704 .572 .631 .233 .043 .072 .067 .012 .021
  Base + CE + FT + BERT (single) .882 .853 .867 .729 .704 .716 .741 .357 .482 .741 .357 .482 .734 .726 .73 .650 .642 .646 .438 .299 .355 .393 .268 .319
                  Table 3: NERC, German. The best result per table and column is given in bold,
                  the second best result is underlined

                                              (a) multi-segment level, coarse grained entity type
                                                                           Literal coarse              Metonymic coarse
                                  Configuration                         Fuzzy         Strict          Fuzzy       Strict
                                                                    P     R    F    P   R    F    P     R   F  P    R    F
                                  Baseline                          .79 .464 .585 .643 .378 .476 .814 .297 .435 .814 .297 .435
                                  Base                           .698 .526 .6 .535 .404 .46 .559 .602 .58 .551 .593 .571
                                  Base + CE                      .685 .605 .642 .535 .473 .502 .588 .568 .578 .588 .568 .578
                                  Base + CE + FT                 .691 .554 .615 .528 .424 .47 .534 .602 .566 .534 .602 .566
                                  Base + CE + BERT               .801 .675 .733 .596 .502 .545 .000 .000 .000 .000 .000 .000
                                  Base + CE + BERT - newly       .759 .706 .732 .582 .541 .561 .000 .000 .000 .000 .000 .000
                                  Base + CE + FT + BERT          .784 .724 .753 .639 .589 .613 .598 .542 .569 .598 .542 .569
                                  Base + CE + FT + BERT - newly .84 .64 .726 .696 .53 .602 .696 .466 .558 .696 .466 .558
                                  Base + CE + FT + BERT (single) .827 .731 .776 .708 .625 .664 .492 .53 .51 .472 .508 .49
                                  + Fine-tuning (unfreezing) BERT
                                  Base + CE + BERT              .756 .718 .737 .546 .519 .532 .000 .000 .000 .000 .000 .000
                                  Base + CE + BERT - newly      .752 .718 .734 .56 .534 .547 .000 .000 .000 .000 .000 .000
                                  Base + CE + FT + BERT         .738 .678 .707 .575 .528 .551 .562 .500 .529 .543 .483 .511
                                  Base + CE + FT + BERT - newly .802 .689 .741 .658 .565 .608 .621 .521 .567 .616 .517 .562


                                                (b) multi-segment level, fine grained entity type
                                          Literal fine                  Metonymic fine                   Component                       Nested
Configuration                         Fuzzy          Strict           Fuzzy       Strict             Fuzzy      Strict             Fuzzy        Strict
                                    P   R    F    P    R    F       P   R   F  P     R   F         P   R   F  P    R   F         P   R   F   P    R    F
Baseline                           .792 .419 .548 .641 .339 .444 .805 .28 .415 .805 .28 .415 .783 .34 .474 .727 .316 .44 .333 .014 .026 .333 .014 .026
Base                           .723 .517 .602 .479 .343 .4 .593 .593 .593 .585 .585 .585 .589 .298 .396 .486 .246 .327 .000 .000 .000 .000 .000 .000
Base + CE                      .704 .585 .639 .466 .388 .424 .667 .559 .608 .667 .559 .608 .589 .432 .498 .506 .371 .428 .25 .014 .026 .000 .000 .000
Base + CE + FT                 .706 .521 .6 .478 .353 .406 .538 .602 .568 .538 .602 .568 .654 .266 .378 .571 .232 .33 .000 .000 .000 .000 .000 .000
Base + CE + BERT               .773 .693 .731 .348 .312 .329 .000 .000 .000 .000 .000 .000 .562 .222 .318 .382 .151 .216 .000 .000 .000 .000 .000 .000
Base + CE + BERT - newly       .800 .647 .716 .358 .289 .320 .000 .000 .000 .000 .000 .000 .455 .480 .467 .310 .327 .318 .000 .000 .000 .000 .000 .000
Base + CE + FT + BERT          .800 .626 .703 .515 .403 .452 .581 .547 .563 .568 .534 .55 .670 .471 .553 .525 .369 .433 .000 .000 .000 .000 .000 .000
Base + CE + FT + BERT - newly .816 .639 .717 .551 .432 .484 .627 .542 .582 .627 .542 .582 .533 .227 .319 .397 .169 .237 .000 .000 .000 .000 .000 .000
Base + CE + FT + BERT (single) .776 .569 .656 .477 .35 .403 .000 .000 .000 .000 .000 .000 .841 .423 .563 .751 .378 .503 .000 .000 .000 .000 .000 .000
+ Fine-tuning (unfreezing) BERT
Base + CE + BERT              .759 .703 .73 .311 .288 .299 .000 .000 .000 .000 .000 .000 .418 .295 .346 .276 .195 .229 .000 .000 .000 .000 .000 .000
Base + CE + BERT - newly      .758 .696 .726 .29 .267 .278 .000 .000 .000 .000 .000 .000 .399 .328 .36 .239 .197 .216 .000 .000 .000 .000 .000 .000
Base + CE + FT + BERT         .736 .687 .711 .433 .405 .418 .524 .551 .537 .508 .534 .521 .474 .508 .49 .338 .362 .349 .000 .000 .000 .000 .000 .000
Base + CE + FT + BERT - newly .801 .685 .738 .548 .469 .506 .691 .475 .563 .691 .475 .563 .58 .51 .543 .472 .415 .442 .000 .000 .000 .000 .000 .000


                                                 (c) document level, coarse grained entity type
                                                                         Literal coarse              Metonymic coarse
                                  Configuration                       Fuzzy         Strict          Fuzzy       Strict
                                                                    P   R    F    P   R    F      P   R   F   P   R    F
                                  Baseline                          .79 .464 .585 .643 .378 .476 .814 .297 .435 .814 .297 .435
                                  Base                           .678 .552 .609 .519 .422 .465 .571 .581 .576 .567 .576 .571
                                  Base + CE                      .688 .573 .626 .548 .456 .498 .618 .576 .596 .618 .576 .596
                                  Base + CE + FT                 .706 .548 .617 .549 .426 .48 .725 .492 .586 .725 .492 .586
                                  Base + CE + BERT               .763 .752 .758 .642 .632 .637 .714 .508 .594 .714 .508 .594
                                  Base + CE + BERT - newly       .805 .654 .722 .641 .52 .574 .433 .517 .471 .426 .508 .463
                                  Base + CE + FT + BERT          .767 .765 .766 .647 .645 .646 .622 .627 .624 .622 .627 .624
                                  Base + CE + FT + BERT - newly .799 .726 .761 .671 .609 .639 .696 .542 .610 .696 .542 .610
                                  Base + CE + FT + BERT (single) .86 .738 .795 .753 .647 .696 .709 .517 .598 .709 .517 .598


                                                   (d) document level, fine grained entity type
                                              Literal fine                Metonymic fine                  Component                        Nested
  Configuration                           Fuzzy          Strict         Fuzzy       Strict            Fuzzy      Strict              Fuzzy        Strict
                                      P     R   F    P     R    F   P     R   F  P     R   F      P     R   F  P    R   F        P     R   F   P    R    F
  Baseline                           .792 .419 .548 .641 .339 .444 .805 .28 .415 .805 .28 .415 .783 .34 .474 .727 .316 .44 .333 .014 .026 .333 .014 .026
  Base                            .69 .53 .599 .448 .344 .389 .586 .606 .596 .582 .602 .592 .592 .394 .474 .491 .327 .393 .312 .068 .112 .250 .055 .09
  Base + CE                      .706 .555 .622 .483 .380 .426 .67 .534 .594 .67 .534 .594 .683 .447 .54 .589 .385 .466 .154 .027 .047 .077 .014 .023
  Base + CE + FT                 .726 .53 .613 .527 .384 .445 .766 .500 .605 .766 .500 .605 .722 .332 .455 .636 .292 .401 .5 .082 .141 .5 .082 .141
  Base + CE + BERT               .782 .734 .757 .571 .536 .553 .75 .508 .606 .75 .508 .606 .7 .500 .583 .623 .445 .52 .333 .027 .051 .333 .027 .051
  Base + CE + BERT - newly       .806 .594 .684 .496 .365 .421 .500 .508 .504 .500 .508 .504 .565 .09 .156 .42 .067 .116 .000 .000 .000 .000 .000 .000
  Base + CE + FT + BERT          .791 .763 .777 .594 .574 .584 .649 .610 .629 .649 .610 .629 .703 .582 .637 .585 .485 .53 .250 .014 .026 .250 .014 .026
  Base + CE + FT + BERT - newly .84 .679 .751 .615 .497 .55 .744 .517 .610 .744 .517 .610 .792 .397 .529 .699 .35 .467 .250 .007 .013 .000 .000 .000
  Base + CE + FT + BERT (single) .839 .743 .788 .669 .593 .629 .667 .525 .588 .645 .508 .569 .718 .588 .647 .632 .517 .569 .000 .000 .000 .000 .000 .000
Table 4: English, segment split (The best result per table and column is given
in bold, the second best result is underlined.)
                                                 Literal coarse
         Configuration                        Fuzzy         Strict
                                            P   R    F    P    R   F
         Baseline (organisers)             .736 .454 .562 .531 .327 .405
         Baseline (ours)                   .377 .612 .466 .190 .31 .236
         Base                            .32 .444 .372 .143 .198 .166
         Base + CE                      .315 .576 .407 .132 .241 .17
         Base + CE + FT                 .261 .611 .366 .106 .247 .148
         Base + CE + BERT               .442 .442 .442 .174 .174 .174
         Base + CE + BERT - newly       .419 .568 .482 .195 .265 .225
         Base + CE + FT + BERT          .391 .506 .441 .191 .247 .216
         Base + CE + FT + BERT - newly .457 .455 .456 .237 .236 .237
         Base + CE + FT + BERT (single) .396 .508 .445 .22 .283 .248
         + Fine-tuning (unfreezing) BERT
         Base + CE + BERT              .374 .566 .450 .118 .178 .142
         Base + CE + BERT - newly      .442 .509 .473 .143 .165 .153
         Base + CE + FT + BERT         .403 .530 .458 .171 .225 .194
         Base + CE + FT + BERT - newly .399 .518 .451 .187 .243 .211


4   Conclusion

Our model achieves its best performance on French followed by German, while
we do not consider our English results to be yet useful for comparison, given the
limitations of the dataset we used. A few results clearly emerge:
 – Each embedding module contributes to the overall performance on most
   sub-tasks and evaluation metrics. Character-level and BERT embeddings
   are particularly important for performance, while in-domain FastText em-
   beddings seem to help in particular for tags other than literal.
 – Fine-tuning pre-trained embeddings in general does not improve perfor-
   mance, despite requiring more computation resources.
 – A single-task approach performs better than multi-task in general, even if
   the differences are often minor. It must be noted that, with our setup, six
   single-task runs require 2.5 times more time on average to converge than one
   multi-task run using a document split. Instead, six single-task runs using a
   multi-segment split are as fast as one multi-task run. Comparing one-to-one,
   a single-task run is on average twice as fast.
 – A document-level split of the data is in general better than a multi-segment
   split, highlighting how larger windows of context are helpful with our model.
When compared to the results of the other task participants [5], our model
performs well in general, and notably well on the NERC-Fine sub-task where we
achieve best performance on several evaluation metrics and in particular a high
precision.


References
 1. Bos, J., Basile, V., Evang, K., Venhuizen, N.J., Bjerva, J.: The Groningen Meaning
    Bank. In: Handbook of linguistic annotation, pp. 463–496. Springer (2017)
 2. Chronopoulou, A., Baziotis, C., Potamianos, A.: An embarrassingly simple ap-
    proach for transfer learning from pretrained language models. In: Proceedings of
    the 2019 Conference of the North American Chapter of the Association for Com-
    putational Linguistics: Human Language Technologies, Volume 1 (Long and Short
    Papers). pp. 2089–2095 (2019)
 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep
    bidirectional transformers for language understanding. In: Proceedings of the 2019
    Conference of the North American Chapter of the Association for Computational
    Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
    pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota
    (Jun 2019). https://doi.org/10.18653/v1/N19-1423
 4. Ehrmann, M., Colavizza, G., Rochat, Y., Kaplan, F.: Diachronic evaluation of
    ner systems on old newspapers (2016), https://infoscience.epfl.ch/record/
    221391
 5. Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Extended Overview of
    CLEF HIPE 2020: Named Entity Processing on Historical Newspapers. In: Cap-
    pellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Working Notes.
    Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum.
    CEUR-WS (2020)
 6. Ghaddar, A., Langlais, P.: Robust lexical features for improved neural network
    named-entity recognition. In: Proceedings of the 27th International Conference on
    Computational Linguistics. p. 1896–1907. Association for Computational Linguis-
    tics (Aug 2018), https://www.aclweb.org/anthology/C18-1161
 7. Joulin, A., Grave, É., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text
    classification. In: Proceedings of the 15th Conference of the European Chapter
    of the Association for Computational Linguistics: Volume 2, Short Papers. pp.
    427–431 (2017)
 8. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Proba-
    bilistic models for segmenting and labeling sequence data. In: Proceedings of the
    Eighteenth International Conference on Machine Learning. p. 282–289. ICML ’01,
    Morgan Kaufmann Publishers Inc. (Jun 2001)
 9. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural
    architectures for named entity recognition. In: Proceedings of the 2016 Conference
    of the North American Chapter of the Association for Computational Linguis-
    tics: Human Language Technologies. p. 260–270. Association for Computational
    Linguistics (Jun 2016). https://doi.org/10.18653/v1/N16-1030
10. Loshchilov, I., Hutter, F.: Fixing Weight Decay Regularization in Adam (Feb 2018),
    https://openreview.net/forum?id=rk6qdGgCZ
11. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transac-
    tions on Knowledge and Data Engineering 22(10), 1345–1359 (Oct 2010).
    https://doi.org/10.1109/TKDE.2009.191
12. Piotrowski, M.: Natural language processing for historical texts. Synthe-
    sis Lectures on Human Language Technologies 5(2), 1–157 (Sep 2012).
    https://doi.org/10.2200/S00436ED1V01Y201207HLT017
13. Ruder, S.: Neural Transfer Learning for Natural Language Processing p. 329 (Feb
    2019)
14. Sang, E.T.K., De Meulder, F.: Introduction to the CoNLL-2003 Shared Task:
    Language-Independent Named Entity Recognition. In: Proceedings of CoNLL-
    2003, Edmonton, Canada. pp. 142–145. Morgan Kaufman Publishers (2003)
15. Sun, C., Qiu, X., Xu, Y., Huang, X.: How to Fine-Tune BERT for Text Classifica-
    tion? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) Chinese Computational
    Linguistics. p. 194–206. Lecture Notes in Computer Science, Springer International
    Publishing (2019). https://doi.org/10.1007/978-3-030-32381-3 16
16. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P.,
    Rault, T., Louf, R., Funtowicz, M., et al.: HuggingFace’s Transformers: State-
    of-the-art Natural Language Processing. arXiv:1910.03771 [cs] (Feb 2020), http:
    //arxiv.org/abs/1910.03771, arXiv: 1910.03771

</pre>