Named Entity Recognition and Linking on
     Historical Newspapers: UvA.ILPS & REL at
                  CLEF HIPE 2020

      Vera Provatorova1 , Svitlana Vakulenko1 , Evangelos Kanoulas1 , Koen
                    Dercksen2 , and Johannes M van Hulst2
              1
                  University of Amsterdam, Amsterdam, The Netherlands
                   2
                     Radboud University, Nijmegen, The Netherlands


        Abstract. This paper describes our submission to the CLEF HIPE 2020
        shared task on identifying named entities in multi-lingual historical news-
        papers in French, German and English. The subtasks we addressed in our
        submission include coarse-grained named entity recognition, entity men-
        tion detection and entity linking. For the task of named entity recogni-
        tion we used an ensemble of fine-tuned BERT models; entity linking was
        approached by three different methods: (1) a simple method relying on
        ElasticSearch retrieval scores, (2) an approach based on contextualised
        text embeddings, and (3) REL, a modular entity linking system based
        on several state-of-the-art components.

        Keywords: Named Entity Linking · Named Entity Recognition


1     Introduction
Named entity identification is an important task in information extraction. De-
tecting, classifying and linking named entities helps to enable semantic search,
which can be used for different domain applications, such as digital humani-
ties [13]. One example is information retrieval from historical corpora. Identifying
entities in historical documents poses several important challenges due to the na-
ture of historical texts. These challenges include OCR errors in document scans,
historical spelling variations and semantic shifts [12, 5]. This paper describes the
submissions prepared by our joint team from the University of Amsterdam and
Radboud University for the CLEF HIPE shared task. The main focus of CLEF
HIPE is on systematic evaluation of named entity recognition and linking meth-
ods on multilingual diachronic historical data [6]. The shared task consists of
several subtasks grouped into five bundles. Every team was allowed to submit
one bundle per language, with the exception of bundle 5 (named entity linking
given canonical mention spans), which was evaluated separately and could be
combined with any other bundle.
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
    Our submission targeted three of the subtasks in HIPE: (1) coarse-grained
named entity recognition (NERC), (2) end-to-end named entity linking (NEL)
using a modified NERC task for entity mention detection, and (3) named en-
tity linking using mention spans provided by the organisers (NEL-only). Entity
mention detection in this case was a supplementary task: it was not evaluated
directly within the system submissions, but served as a preparation step for NEL
in the setting of bundle 2, where entity mention boundaries were not given in the
test data. In all the subtasks, we only considered the literal sense of the entities.
    For the first phase of the shared task, we designed solutions for English,
German and French languages within bundle 2, which included identifying, clas-
sifying and linking coarse-grained entities. For the second phase, bundle 5, we
focused on one language only (English) and compared our results to the out-of-
the-box tool, Radboud Entity Linker (REL) [10], as a competitive baseline.


2     Bundle 2: Named Entity Recognition and Linking
2.1   Experimental setup
Datasets and resources. The dataset provided by the CLEF HIPE organis-
ers consists of diachronically organised digitised historical newspaper articles
in English, German and French. The data is annotated using the standard
inside–outside–beginning (IOB) format and presented as tab-separated values,
where each row corresponds to a single token.
    While validation datasets are provided for all of the three languages, training
data are only available for German and French. To provide the token classifi-
cation model with a sufficient amount of training data for English, we used
CoNLL-03 [14] as an auxiliary dataset.

Approach. We consider both NERC and entity mention detection tasks as in-
stances of the sequence classification task. For the NERC task, 5 entity types
(org, pers, prod, loc, and time) form 11 classes when annotated in the IOB for-
mat: each of the types has its ”B-” and ”I-” labels corresponding to the tokens
at the beginning and inside of an entity (e.g., ”B-pers” and ”I-pers”), while the
”O” label marks the remaining tokens which are outside of named entities. For
mention detection, 3 classes are considered: “B-entity”, “I-entity”, and “O”. To
perform sequence classification, we fine-tuned two pretrained BERT models [3]
provided by the Hugging Face Transformers library [15]: bert-base-cased for
English and bert-base-multilingual-cased for French and German. To im-
prove robustness of the approach, we used a majority vote ensemble of 5 model
instances per language fine-tuned on the training data with different numbers of
epochs, as well as different random seed values, where 5 ≤ num epochs ≤ 9
and random seed = 42 + num epochs.
    To perform entity linking, we used ElasticSearch [4] to index all Wikidata
entity labels and search for each of the entity mentions extracted from the input
data to retrieve candidate entities. All the retrieved entities were included as
candidates, without filtering on type. Candidate entity ranking was performed
based on ElasticSearch retrieval scores combined with several heuristics, prefer-
ring precise matching and shorter entity IDs (assuming that the entities with
shorter IDs that were added to Wikidata earlier are typically more general and
therefore more likely to be correct in many cases). We used the latest Wiki-
data dump from 9th of March 2020 which contains more than 55M entities.
An important limitation of our approach is that it relied solely on the English-
language labels, which is likely to hinder its performance on some of the named
entities that vary across languages, such as “Geneva” in English versus “Genf”
in German.

2.2    Results and discussion
The submissions were evaluated with the HIPE scorer, which is provided by
the shared task organisers and available on github.3 The scores achieved by our
submissions on the NERC task are presented in Table 1.


              Table 1. NERC-coarse results (literal sense, micro average)

                         English                       French                        German
                  strict         fuzzy          strict         fuzzy          strict         fuzzy
                F P R F P R F P R F P R F P R F P R
  best HIPE .632 .623 .641 .786 .775 .797 .840 .831 .849 .921 .912 .931 .797 .790 .805 .878 .870 .886
  UvA.ILPS .473 .443 .508 .678 .635 .728 .686 .656 .719 .830 .794 .869 .526 .499 .556 .726 .689 .768
baseline HIPE .405 .531 .327 .562 .736 .454 .646 .693 .606 .769 .825 .721 .476 .643 .378 .585 .790 .464


    The baseline provided by the HIPE organisers for the NERC-coarse task
uses a traditional CRF sequence classification method. The top solution for
all languages is developed by the L3i team, with extra layers added on top of
several pre-trained BERT models and trained in a multi-task learning setting to
minimize the impact of OCR-generated noise, historical spelling variations and
other challenges specific to the data [2]. Our approach outperforms the baseline
but achieves significantly lower results in comparison with the top solution. It
shows that, while transformer-based approaches are a promising direction for
named entity recognition, using a majority vote ensemble of fine-tuned models
without any extra modifications is not likely to be sufficient for the setting of
noisy historical data.


            Table 2. End-to-end NEL results (literal sense, micro average)

                         English                       French                        German
                 strict @1     relaxed @1     strict @1      relaxed @1     strict @1      relaxed @1
                F P R F P R F P R F P R F P R F P R
  best HIPE .531 .523 .539 .531 .523 .539 .594 .602 .598 .613 .622 .617 .534 .531 .538 .557 .553 .561
  UvA.ILPS .300 .249 .375 .300 .249 .375 .251 .352 .195 .252 .353 .196 .254 .241 .269 .264 .250 .279
baseline HIPE .220 .263 .239 .220 .263 .239 .206 .342 .257 .257 .358 .270 .173 .187 .180 .188 .203 .195

3
    https://github.com/impresso/CLEF-HIPE-2020-scorer
   For the end-to-end NEL task, the HIPE baseline is AIDA-light trained on
English Wikipedia. The best solution was submitted by the L3i team using entity
embeddings trained on Wikipedia and Wikidata, combined with probabilistic
mapping. The results achieved by our submissions are presented in Table 2 and
compared with these two approaches.
   For English and German, our submission scores above the baseline but far
below the top solution, which is not surprising given the simplicity of our ap-
proach. For French the recall values of our submission are below the baseline.
We assume that the main reason for this performance drop is due to the fact
that most of the French entities could not be found in the English-only Wikidata
index used in our system. We conclude that the bottleneck of our approach is
the entity retrieval rather than entity mention detection.


3     Bundle 5: Named Entity Linking with Correct Mention
      Spans

3.1   Experimental setup

Datasets and resources. Our system runs were prepared using the same HIPE
corpora as in bundle 2, with no extra training data. The algorithm designed for
the first two runs used pre-trained contextualised Flair string embeddings [1]
provided by the task organisers.

Methods. For the first two runs, candidate entity retrieval was done the same
way as in bundle 2. To perform candidate entity ranking, we calculated cosine
similarity between contextual embeddings of a sentence containing the target
entity mention and a modified sentence, where the target entity mention is re-
placed with candidate entity description extracted from Wikidata. For example,
if the target sentence is ”We went to London for a weekend” and a candidate
entity is Q84 with the label London and the description ”capital and largest
city of the United Kingdom”, then the modified sentence would be ”We went to
capital and largest city of the United Kingdom for a weekend”.
     The idea behind our approach resides upon two basic assumptions: (1) Wiki-
data entity descriptions are semantically similar to the corresponding entity
labels, and (2) contextualised string embeddings capture similarity between en-
tity descriptions and entity labels. After calculating the cosine similarity score,
it is multiplied by the Levenshtein similarity ratio between target and candidate
entity labels to prefer precise matching where possible. In the example above,
if one of the candidates is Q23306: Greater London then its score would be
multiplied by sim(’London’, ’Greater London’) = 0.6, while the score for Q84:
London would remain the same, as sim(’London’, ’London’) = 1. The similarity
ratio was calculated using the FuzzyWuzzy string matching library [7].
     After using the resulting score to rank the list of candidate entities, a NIL
value is inserted to the list before the first candidate that has a score below
threshold. We chose the threshold value of 0.7 after tuning this parameter on the
development set. For submission 2 only, we added historical spelling variations
to the step of candidate retrieval using Natas library that performs historical
normalisation via neural machine translation [9].
    The third run was prepared using REL [10] – a modular system that is
based on several state-of-the-art components, available as a Python library as
well as a web API4 . Entity linking in REL is divided into three components: (i)
mention detection, (ii) candidate selection, and (iii) entity disambiguation. For
this submission, mention detection was skipped since the mention spans were
already provided by the organisers as the ground truth. Candidate selection
consists of retrieving seven candidates for each mention. The first four candidates
are retrieved based on the co-occurence probability of entities given a specific
mention (a so called p(e|m) index ). The remaining three are selected based on
their contextual similarity to the mention in an embedding space.
    Entity disambiguation decisions are made by combining local compatibility
(which includes prior importance and contextual similarity) and coherence with
the other entity linking decisions in a document (global context).


3.2    Results and discussion


Run 1: Baseline. While the results @1 are below the HIPE baseline (Table 3),
the performance @3 and @5 is better (Table 4). Similar results were achieved on
the development set: while the correct entity would often make it to the top-5 or
top-3 of the ranked candidate list, it was rarely selected by the algorithm as the
most relevant answer, and the difference between candidate scores was usually
small. The algorithm was not directly optimised for top-1 candidate selection.
Another obstacle for the algorithm was NIL detection: as 30% of the mentions
were not linkable [6], simply adding the NIL value to the ranked list of candidates
based on the fixed threshold value was not a sufficient approach and resulted in
an overwhelming number of false positives.


            Table 3. Named entity linking results (English, literal sense)


                                        strict @1     fuzzy @1
                                       F P R F P R
                   best HIPE         .633 .685 .658 .633 .685 .658
                   Run #3 REL        .593 .607 .580 .593 .607 .580
                   Run #1 Baseline .367 .365 .369 .367 .365 .369
                   Run #2 Historical .348 .344 .353 .348 .344 .353
                   baseline HIPE     .506 .506 .506 .506 .506 .506


4
    https://github.com/informagi/REL
           Table 4. NEL-only results @3 and @5 (English, literal sense)


                                          @3             @5
                                      F P R F P R
                  Run #1 Baseline .463 .467 .465 .552 .557 .555
                  Run #2 Historical .451 .463 .457 .540 .555 .548


Run 2: Historical normalisation. Adding extra candidate entities by means
of historical normalisation in the second submission has resulted in more false
positives and slightly decreased overall performance in comparison to the first
submission. A likely explanation is that the normalisation algorithm was focusing
on infrequent historical spellings [9], most of which are not likely to be present
in the HIPE dataset.

Run 3: REL. REL performs very well and takes the second place in the scoring
table, which is rather remarkable for an out-of-the box linking system. We showed
that REL provides a strong baseline for the NEL task on historical documents,
demonstrating the state-of-the-art performance that can be reached without
accounting for additional properties, such as OCR errors and language change.

4   Conclusion and future work
Our contributions within the CLEF HIPE shared task approached coarse-grained
named entity recognition (NERC) and two settings of entity linking: end-to-end
and NEL-only. The results for NERC show that although fine-tuning BERT
models for sequence classification is enough to outperform the baselines for all
three languages, achieving top performance requires extra modifications in or-
der to deal with the challenges specific to historical data. The NEL results show
that, while using an embedding-based approach that takes historical spelling
variations into account is better than relying solely on ElasticSearch retrieval
scores, this approach is clearly outperformed by REL, as well as by many other
solutions – mostly due to its poor performance on NIL prediction and an over-
whelming number of false positives on the candidate selection step. REL, in
its turn, proves very efficient in the setting of the shared task, even without
specifically addressing the challenges of the historical data.
    There are several possible directions for future work considering all the sub-
tasks that we approached in the context of the shared task:

Entity recognition and classification. Some examples of the ways to achieve
improvements over the state-of-the-art sequence classification methods within
the given task setup include (i) performing a more extensive parameter search
for the Transformer models; (ii) fine-tuning more advanced pre-trained models
(such as RoBERTa [11]), and (iii) reducing the impact of the noise in the training
data by using OCR correction algorithms, such as [8].
Entity linking. Since the task of entity linking consists of several steps, includ-
ing candidate generation and entity disambiguation, we see further opportunities
for improvement on each of these steps. Firstly, candidate generation can be im-
proved to increase recall. One of the ways to achieve this goal is to use OCR
correction as a pre-processing step in the algorithm. Secondly, entity disambigua-
tion should be improved upon in order to increase precision by decreasing the
number of false positives. We consider graph-based disambiguation methods as
a promising research direction. Thirdly, using entity types as features instead of
only relying on mention boundaries could also improve entity disambiguation in
the end-to-end setting.


Acknowledgements
This research was supported by the NWO Innovational Research Incentives
Scheme Vidi (016.Vidi.189.039), the NWO Smart Culture - Big Data / Digi-
tal Humanities (314-99-301), the H2020-EU.3.4. - SOCIETAL CHALLENGES -
Smart, Green And Integrated Transport (814961) the Google Faculty Research
Awards program. All content represents the opinion of the authors, which is not
necessarily shared or endorsed by their respective employers and/or sponsors.


References
 1. Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., Vollgraf, R.: Flair:
    An easy-to-use framework for state-of-the-art nlp. In: Proceedings of the 2019
    Conference of the North American Chapter of the Association for Computational
    Linguistics (Demonstrations). pp. 54–59 (2019)
 2. Boros, E., Linhares Pontes, E., Cabrera-Diego, L.A., Hamdi, A., Moreno, J.G.,
    Sidère, N., Doucet, A.: Robust Named Entity Recognition and Linking on Histori-
    cal Multilingual Documents. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A.
    (eds.) CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and
    Labs of the Evaluation Forum. CEUR-WS (2020)
 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
    tional transformers for language understanding. arXiv preprint arXiv:1810.04805
    (2018)
 4. Divya, M.S., Goyal, S.K.: Elasticsearch: An advanced and quick search technique
    to handle voluminous data. Compusoft 2(6), 171 (2013)
 5. Ehrmann, M., Colavizza, G., Rochat, Y., Kaplan, F.: Diachronic Evaluation of
    NER Systems on Old Newspapers. In: Proceedings of the 13th Conference on Nat-
    ural Language Processing (KONVENS 2016)). pp. 97–107. Bochumer Linguistische
    Arbeitsberichte (2016), https://infoscience.epfl.ch/record/221391?ln=en
 6. Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Overview of CLEF
    HIPE 2020: Named Entity Recognition and Linking on Historical Newspapers. In:
    Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C.,
    Eickhoff, C., Névéol, A., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets
    Multilinguality, Multimodality, and Interaction. Proceedings of the 11th Interna-
    tional Conference of the CLEF Association (CLEF 2020). Lecture Notes in Com-
    puter Science (LNCS), vol. 12260. Springer (2020)
 7. Gonzalez, J., Rodrigues, P., Cohen, A.: Fuzzywuzzy: Fuzzy string matching in
    python (2017)
 8. Hämäläinen, M., Hengchen, S.: From the paft to the fiiture: a fully automatic
    NMT and word embeddings method for OCR post-correction. In: Proceedings of
    the International Conference on Recent Advances in Natural Language Process-
    ing (RANLP 2019). pp. 431–436. INCOMA Ltd., Varna, Bulgaria (Sep 2019),
    https://www.aclweb.org/anthology/R19-1051
 9. Hämäläinen, M., Säily, T., Rueter, J., Tiedemann, J., Mäkelä, E.: Revisiting NMT
    for normalization of early English letters. In: Proceedings of the 3rd Joint SIGHUM
    Workshop on Computational Linguistics for Cultural Heritage, Social Sciences,
    Humanities and Literature. pp. 71–75. Association for Computational Linguistics,
    Minneapolis, USA (Jun 2019), https://www.aclweb.org/anthology/W19-2509
10. van Hulst, J.M., Hasibi, F., Dercksen, K., Balog, K., de Vries, A.P.: REL: An entity
    linker standing on the shoulders of giants. In: Proceedings of the 43rd International
    ACM SIGIR Conference on Research and Development in Information Retrieval.
    SIGIR ’20, ACM (2020)
11. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
    Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining
    approach. arXiv preprint arXiv:1907.11692 (2019)
12. Piotrowski, M.: Natural language processing for historical texts. Synthesis lectures
    on human language technologies 5(2), 1–157 (2012)
13. Provatorova, V., Kanoulas, E., Carlgren, A., Dupré, S., Hendriksen, M.: Art
    DATIS: Improving search in multilingual corpora to support art historians. Digital
    Humanities Benelux ’19 (2019)
14. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared
    task: Language-independent named entity recognition. In: Proceedings of the Sev-
    enth Conference on Natural Language Learning at HLT-NAACL 2003. pp. 142–147
    (2003), https://www.aclweb.org/anthology/W03-0419
15. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P.,
    Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-
    the-art natural language processing. ArXiv pp. arXiv–1910 (2019)