=Paper=
{{Paper
|id=Vol-2696/paper_163
|storemode=property
|title=Named Entity Disambiguation and Linking Historic Newspaper OCR with BERT
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_163.pdf
|volume=Vol-2696
|authors=Kai Labusch,Clemens Neudecker
|dblpUrl=https://dblp.org/rec/conf/clef/LabuschN20
}}
==Named Entity Disambiguation and Linking Historic Newspaper OCR with BERT==
<pdf width="1500px">https://ceur-ws.org/Vol-2696/paper_163.pdf</pdf>
<pre>
    Named Entity Disambiguation and Linking on
       Historic Newspaper OCR with BERT

                    Kai Labusch1 and Clemens Neudecker1

                          Staatsbibliothek zu Berlin -
                           Preußischer Kulturbesitz
                            10785 Berlin, Germany
             {kai.labusch,clemens.neudecker}@sbb.spk-berlin.de


      Abstract. In this paper, we propose a named entity disambiguation
      and linking (NED, NEL) system that consists of three components: (i)
      Lookup of possible candidates in an approximative nearest neighbour
      (ANN) index that stores BERT-embeddings. (ii) Evaluation of each can-
      didate by comparison of text passages of Wikipedia performed by a
      purpose-trained BERT model. (iii) Final ranking of candidates on the
      basis of information gathered from previous steps. We participated in
      the CLEF 2020 HIPE NERC-COARSE and NEL-LIT tasks for German,
      French, and English. The CLEF HIPE 2020 results show that our NEL
      approach is competitive in terms of precision but has low recall perfor-
      mance due to insufficient knowledge base coverage of the test data.

      Keywords: Named Entity Recognition· Entity Linking· BERT· OCR


1    Introduction

Our participation in the CLEF HIPE 2020 NER-COARSE and NEL-LIT task1
has been conducted as part of the Qurator2 project within the Berlin State Li-
brary (Staatsbibliothek zu Berlin - Preußischer Kulturbesitz, SBB). One goal of
the SBB in the Qurator project is the development of a system that identifies
persons, locations and organizations within digitized historical text material ob-
tained by Optical Character Recognition (OCR) and then links recognized enti-
ties to their corresponding Wikidata-IDs. Here, we provide a high-level overview
of the functionality of our system; for details, take a deeper look at the informa-
tion provided together with the source code3 .
    The paper is structured as follows: after a brief introduction of the back-
ground and use case, a short summary of the Named Entity Recognition system
  Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
  ber 2020, Thessaloniki, Greece.
1
  https://impresso.github.io/CLEFHIPE2020/
2
  https://qurator.ai
3
  https://github.com/qurator-spk/sbb ned
is provided in chapter 2. Chapter 3 outlines the Entity Linking approach de-
veloped in greater detail. Chapter 4 covers the chosen method for evaluation of
candidates for entity linking and chapter 5 continues with a description of their
ranking. Following a discussion of the results obtained in the NER-COARSE
and NEL-LIT tasks in chapter 6, we wrap up with some concluding remarks and
potentials for further improvement in chapter 7.


1.1   Background

The SBB is continuously digitizing its copyright-free holdings and making them
publicly available online in various formats for viewing and browsing4 or auto-
mated5 download. As part of an on-going process, a growing amount of OCR-
derived full-texts of the digitized printed material is provided in ALTO6 format
for internal use cases such as full-text indexing and other information retrieval
tasks.
    With an increasing amount of digitized sources becoming available online, the
need for automated ways of extracting additional information from these sources
increases as well. Disciplines such as the Digital Humanities create use cases for
text and data mining or the semantic enrichment of the full-texts with e.g.
Named Entity Recognition and Linking (e.g. for the re-construction of historical
social networks7 ).
    The boost in popularity of neural networks in the early 2010s, which are
not only capable of dealing with large amounts of data (i. e., big data), but also
require enormous amounts of data to be trained on in order to produce high qual-
ity results, has addressed this need. However, due to the historical nature of the
documents being digitized in libraries, standard methods and procedures from
the NLP domain typically require additional adaptation in order to successfully
deal with the historical spelling variation and the remaining noise resulting from
OCR errors.


1.2   Qurator

The Qurator project[9], funded by the German Federal Ministry of Education
and Research (BMBF), for a timeframe of three years (11/2018-10/2021), is
based in the metropolitan region Berlin/Brandenburg. The consortium of ten
project partners from research and industry combines vast expertise in areas
such as Language as well as Knowledge Technologies, Artificial Intelligence and
Machine Learning.
   The project’s main goal is the development of a sustainable technology plat-
form that supports knowledge workers in various industries. The platform will
4
  https://digital.staatsbibliothek-berlin.de
5
  https://oai.sbb.berlin
6
  https://www.loc.gov/standards/alto/
7
  https://sonar.fh-potsdam.de/
simplify the curation of digital content and accelerate it dramatically. AI tech-
niques are integrated into curation technologies and curation workflows in the
form of domain specific solutions covering the entire life cycle of content cura-
tion. The solutions being developed focus on curation services for the domains
of culture, media, health and industry.
    Within the Qurator consortium, the SBB is responsible for the task are ”Cu-
ration Technologies for Digitized Cultural Heritage”. The main goals of this task
area lie in the development and adaptation of novel, AI/ML-based approaches
from the document analysis and NLP domains for the improvement of the qual-
ity of OCR full-texts and the semantic enrichment of the derived full-texts with
NER and NEL. The baseline for this development are the digitized collections of
the SBB, with approximately 175,000 (August 2020) digitized documents from
the timeframe 1400–1920. While most of the documents are in German, there
is great variation with many other European and also Asian languages being
present in the collection. The collection comprises documents from a wide ar-
ray of publication formats, including books, newspapers, journals, maps, letters,
posters, and many more.

1.3    HIPE
The introduction of the CLEF HIPE 2020 shared task provided a welcome oppor-
tunity to assess the performance of our own NER and NED systems in compar-
ison with others within the frame of a common and realistic benchmark setting.
HIPE proposes two tasks, NER and NEL, for French, German and English,
with OCRed historical newspapers as input. The SBB’s digitization strategy
has traditionally put a strong focus on historic newspapers, with projects like
Europeana Newspapers[8] producing millions of pages of OCR from digitized
newspapers.
    Recent years have also brought about the application of deep learning models
for NER and NEL, where HIPE first puts these developments to the test for
more challenging historical and noisy materials. We therefore expect that many
valuable insights and directions for future work will results from participation
in the HIPE shared task.


2     Named Entity Recognition
Before entity disambiguation starts, the input text is run through a named en-
tity recognition (NER) system that tags all person (PER), location (LOC) and
organization (ORG) entities. For the CLEF HIPE 2020 task, we used a BERT[3]
based NER-system that has been developed previously at SBB and described in
[5].
     We employed our off-the-shelf system8 and did not use CLEF HIPE 2020
NER training data for fine-tuning. Our off-the-shelf system does not currently
7
    https://impresso.github.io/CLEF-HIPE-2020/
8
    https://github.com/qurator-spk/sbb ner
support product (PROD) and time (TIME) entities. The German NER system
has been trained simultaneously on recent and historical German NER ground
truth. In case of French and English, we used our multilingual model, i.e., a
single BERT model that was trained for NER on combined German, French,
Dutch and English NER labeled data.
    Starting from multilingual BERT-Base Cased, we applied unsupervised pre-
training composed of the “Masked-LM” and “Next Sentence Prediction” tasks
proposed by [3] using 2,333,647 pages of unlabeled historical German text from
the DC-SBB dataset [6]. Furterhmore, we performed supervised pre-training on
NER ground truth using the Europeana Newspapers [7], ConLL-2003 [12] and
GermEval-2014 [1] datasets.
    In the according cross-evaluation, it was found that unsupervised pre-training
on DC-SBB data worsens BERT performance in the case of contemporary train-
ing/test pairs while the performance improves for most experiments that test on
historical ground truth. The best performance for our model is achieved by com-
bining pre-training using DC-SBB + GermEval + CoNLL and results obtained
from that are comparable to the state-of-the-art (see table 1). For the discussion
of the performance of our NER system in the particular context of HIPE, please
see chapter 6.


                                             P          R          F1
[5]          DC-SBB+GermEval+CoNLL 81.1 ±1.2 87.8 ±1.4 84.3 ±1.1
[10]         Newspaper (1703-1875) -         -         85.31
[11]         Newspaper (1888-1945) -         -         77.51

Table 1. Performance comparison of different historical German NER BERT models.
Results in [5] were obtained by 5-fold cross validation, results in [10] and [11] have been
obtained for some 80/20 training/test split.


3      Entity Linking: Lookup of Candidates

3.1    Construction of knowledge base for PER, LOC and ORG

Our entity linking and disambiguation works by comparison of continuous text
snippets where the entities in question are mentioned. A purpose-trained BERT
model (the evaluation model) performs that text comparison task (see chap-
ter 4). Therefore, a knowledge base that contains structured information like
Wikidata is not sufficient. Instead we need additional continuous text where the
entities that are part of the knowledge base are discussed, mentioned and ref-
erenced. Hence, we derive the knowledge base such that each entity in it has a
corresponding Wikipedia page since the Wikipedia articles contain continuous
                       Lang    PER     LOC    ORG coverage of
                                                    test data
                       DE     671398 374048 136044         71%
                       FR     217383 155856 39305          68%
                       EN     324607 198570 58730          47%

Table 2. Size of knowledge-base per category per language. French and English know-
ledge bases are significantly smaller than the German one due to loss of entities in the
Wikipedia - Wikidata mapping. Coverage of CLEF HIPE 2020 NEL-LIT test data
Q-IDs is similar for German and French while being significantly worse for English.


text that have been annotated by human authors with references that can serve
as ground truth.
    The knowledge base has been directly derived from Wikipedia through the
identification of persons, locations and organizations within the German Wikipedia
by recursive traversal of its category structure:
 – PER: All pages that are part of the categories “Frau” or “Mann” or of one
   of the reachable sub-categories of “Frau” and “Mann”. One problem with
   this approach is that fictional “persons” are typically not contained in that
   selection.
 – LOC: All pages that are part of the category “Geographisches Objekt” or
   one of its sub-categories. We exclude everything that is part of “Geographis-
   cher Begriff” or one of its sub-categories.
 – ORG: All pages that are part of the category “Organisation” or one of its
   sub-categories.
Note: we plan to use the structured information of Wikidata in order to more
reliably identify PER, LOC and ORG entities within Wikipedia which should
make the heuristic approach of knowledge base creation obsolete.
     Some pages might end up in multiple entity classes at the same time due
to the category structure of the German Wikipedia. In order to create disjunct
entity classes, we first remove from the entity class ORG everything that is also
included in PER or LOC. In a second step, we remove everything from the entity
class LOC that is also part of PER or ORG. It has been pointed out by one of
our reviewers that this step is conceptually not required by our approach and it
actually will be obsolete as soon as we identify PER, LOC and ORG entities on
the basis of Wikidata.
     To construct knowledge bases for French and English, we first map the iden-
tified German Wikipedia entity pages to their corresponding Wikidata-IDs and
then the Wikidata-IDs back to the corresponding French and English Wikipedia
pages. Table 2 shows the size of the knowledge bases per category and language.
Note, that the knowledge bases for French and English are significantly smaller
than the German one due to loss of many entities within the Wikipedia - Wiki-
data mapping.
    What is the cause of that loss? We checked at random a number of entities
of all types (PER,LOC,ORG) that had been lost in the mapping between the
German and the French or English Wikipedia. In all cases Wikidata actually did
not contain a reference to some English or French version of Wikipedia. Hence
either there actually is not a French or English version of that Wikipedia page
available or the correct linking has not been established so far. We expect to end
up with much larger knowledge bases by use of structured data from Wikidata
for identification of entities.
    After the unmasked CLEF HIPE 2020 test data had been published, we
computed the coverage of our per language knowledge bases, i.e., the percentage
of Wikidata entity IDs (NEL-LIT) in the test data that actually can be found
in the corresponding knowledge base. That percentage is an upper bound on
the systems performance. As you can see in table 2, the coverage is similar for
German and French (roughly 70%) whereas it is significantly worse for English
(roughly 50%).


3.2   Entity Lookup Index

After the knowledge bases have been established, for each of them an entity
lookup index is created by computation of BERT embeddings of the page titles
of the identified PER, LOC and ORG Wikipedia pages. The BERT embeddings
are obtained from a combination of different layers of the evaluation model (see
chapter 4). The embedding vectors of the tokens of the page titles are stored in
an approximative nearest neighbour (ANN) index [2]. We use cosine similarity as
distance measure and the ANN index uses 100 random projection search trees.
There are separate ANN indices per supported language and per supported
entity category.
    Given some NER-tagged surface that is part of the input text, up to 400
linking candidates below a cut-off distance of 0.1 are selected by lookup of the
nearest neighbours of the surface’s embedding within the approximative nearest
neighbour index of the corresponding language and entity category.
    According to our observations the performance of our system improves with
a higher number of candidates considered. Of course there is some upper limit to
that, however, more important is the computational complexity that grows with
the number of linking candidates. We did not systematically evaluate the effect
of the number of linking candidates but we used a number of linking candidates
that is sufficiently high. Note that there is also an interaction with the cut-off
distance since in many cases there are not as many as 400 nearest neighbours
within a distance of less than 0.1.


4     Evaluation of Candidates

For each entity of the knowledge bases (see chapter 3.1) there are text passages
in Wikipedia where some human Wikipedia editor has linked to that particular
entity. How many linked text passages we have for some particular entity differs
 SQL-table “sentences”
 id        : A unique number that identifies each sentence.
 text      : A JSON-array that contains the tokens of the sentence. Example:
            [”Der”, ”Begriff”, ”wurde”, ”von”, ”Georg”, ”Christoph”, ”Lichtenberg”, ”einge-
            bracht”, ”.”]
 entities : A JSON-array of same length as “text” that contains for each token of
            the sentence the target entity if the token is part of an Wikipedia link
            that has been created by some Wikipedia author. If a token is not part of
            an Wikipedia link its corresponding entity is empty. Example:
            [””, ””, ””, ””, ”Georg Christoph Lichtenberg”, ”Georg Christoph Lichtenberg”,
            ”Georg Christoph Lichtenberg”, ””, ””]


 SQL-table “links”
 id       : A unique number that identifies each Wikipedia entity reference.
 sentence : The sentence-id of the sentence where the reference occurs (sentences.id).
 target : The target entity of the reference. Example:
            ”Georg Christoph Lichtenberg”


Table 3. The SQLITE sentence database consists of two tables. The “sentences” table
contains all the sentences of the Wikipedia where some Wikipedia author referenced a
PER, LOC, or ORG entity. The “links” table enumerates all references to PER, LOC or
ORG entities in the Wikipedia. In order to get all the sentences of the German wikipedia
where “Georg Christoph Lichtenberg” has been referenced by some Wikipedia author,
the following SQL statement is used:
SELECT links.target, sentences.id, sentences.text, sentences.entities FROM links JOIN sentences
ON links.sentence=sentences.id WHERE links.target==”Georg Christoph Lichtenberg”


widely depending on the entity. Some entities have thousands of links available
whereas other entities have only very few.
    We created a SQLITE database that provides quick access to the mentions
of some particular entity. Using the Wikipedia page title of the entity as key,
for instance, ”Georg Christoph Lichtenberg”, the database returns all sentences
where some human editor explicitly linked to ”Georg Christoph Lichtenberg”.
The database can be derived programmatically from the Wikipedia without
any human annotation being involved. Table 3 gives a short description of the
structure of the SQLITE database.
    Using that database, we created a training dataset that consists of random
sentence pairs (A,B) where sentences (A,B) either reference the same entity or
different entities. That training dataset defines a binary classification problem:
Do sentences A and B refer to the same item or not?
    We trained a BERT model with respect to this binary classification problem
per supported language that we call the “evaluation model” in the following.
Given some arbitrary sentence pair (A,B), the evaluation model outputs the
probability of the two sentences refering to the same item.
    During entity disambiguation, we build up to 50 sentence pairs (A,B) for
each candidate that has been found in the lookup step (see chapter 3.2). The
sentence pairs are composed in such a way that sentence A is part of the input
text where the entity that is to be linked is mentioned and sentence B is a
sentence from Wikipedia where that particular candidate has been linked to
by a Wikipedia author. The higher the number of evaluated sentence pairs per
candidate is, the more reliable the ranking model (see Section 5) can determine
the overall matching probability. Again, the computational complexity increases
with the number of sentence pairs. Additionally, in most cases there is only a
very limited number of reference sentences from Wikipedia available such that it
is not possible to generate a large number of unique sentence pairs. The choice
of 50 sentence pairs is a trade-off that takes into account these considerations.
    Application of the evaluation model to each sentence pair results in a corre-
sponding matching probability. The sets of sentence pair matching probabilities
of all candidates are then further processed by the ranking model (see Section
5).


5   Ranking of Candidates

During previous steps, sets of possible entity candidates have been obtained for
all the parts of the input text that have been NER-tagged. For each candidate, a
number of sentence pairs have been examined by the evaluation model, resulting
in a set of sentence pair probabilities per candidate.
    The ranking step finally determines an ordering of the candidates per linked
entity according to the probability that it is the “correct” entity the part of the
input text is actually referring to.
    We compute statistical features of the sets of sentence pair probabilities of
the candidates, among them: mean, median, min, max, standard deviation as
well as various quantiles. Additionally we sort all the sentence pair probabilities
and compute ranking statistics over all the candidates.
    Then, based on the statistical features that describe the set of sentence pair
probabilities of each candidate, a random forest model computes the overall
probability that some particular candidate is actually the “correct” correspond-
ing entity. The random forest model is the only component of our system where
the CLEF HIPE 2020 data was used for training.
    Finally the candidates are sorted according to the overall matching proba-
bilities that have been estimated by the random forest model. The final output
of our NED system is the sorted list of candidates where candidates that have a
matching probability less than 0.2 are cut off.
    Our NED system does not implement the NIL entity that means either it
returns a non-empty list of Wikidata IDs that have been sorted in descending
order according to their overall matching probabilities or the result is “-” if there
is not any candidate that has matching probability above 0.2.


6   Results

Table 4 lists the NER performance of our off-the-shelf NER system (SBB) on
the CLEF HIPE 2020 test data in the NER-COARSE-LIT task. Additionally, it
also contains the results of the best performing system (L3i). In case of the SBB
system, strict NER performance is significantly worse than fuzzy NER perfor-
mance. That observation holds for the L3i system too, however, for our system
the effect is much more pronounced. Strict NER is a much more demanding task,
nevertheless, we partly attribute the difference in performance to the training
data of our NER system (see [5]), which has been created according to multiple
slightly different NER annotation standards and also to the fact that we did not
fine tune the NER system by using training data provided by the CLEF HIPE
2020 task organizers.

    According to our observations, the OCR quality of the French data is slightly
better than the German one and both French and German have better OCR
quality than the English text material. By OCR quality, we primarily mean
the overall quality of the entire text but not the mean Levenshtein distances
of the entity text passages with respect to the original text. NER performance
resembles that observation, i.e., French and German are comparable whereas
English is significantly worse. Therefore our current hypothesis is that these
differences are partly caused by the sensitivity of our NER-tagger to OCR noise
within the surrounding text.


         Lang Team Evaluation                      Label    P     R    F1
         DE   L3i NE-COARSE-LIT-micro-fuzzy ALL 0.870 0.886 0.878
         DE   SBB NE-COARSE-LIT-micro-fuzzy ALL 0.730 0.708 0.719
         DE   L3i NE-COARSE-LIT-micro-strict ALL 0.790 0.805 0.797
         DE   SBB NE-COARSE-LIT-micro-strict ALL 0.499 0.484 0.491
         FR   L3i NE-COARSE-LIT-micro-fuzzy ALL 0.912 0.931 0.921
         FR   SBB NE-COARSE-LIT-micro-fuzzy ALL 0.765 0.689 0.725
         FR   L3i NE-COARSE-LIT-micro-strict ALL 0.831 0.849 0.840
         FR   SBB NE-COARSE-LIT-micro-strict ALL 0.530 0.477 0.502
         EN   L3i NE-COARSE-LIT-micro-fuzzy ALL 0.794 0.817 0.806
         EN   SBB NE-COARSE-LIT-micro-fuzzy ALL 0.642 0.572 0.605
         EN   L3i NE-COARSE-LIT-micro-strict ALL 0.623 0.641 0.632
         EN   SBB NE-COARSE-LIT-micro-strict ALL 0.347 0.310 0.327

Table 4. NER-COARSE results of our (SBB) off-the-shelf BERT based NER sys-
tem on the CLEF HIPE 2020 test data in comparison to the best performing system
(L3i). The SBB system has not been trained on the CLEF HIPE 2020 data and does
not support PROD and TIME entities. For German, the system has been trained on
recent and historical German data simultaneously whereas for French and English, we
employed a multilingual system that has been trained on German, Dutch, French and
English data at the same time.
Table 5 shows the NEL performance of our system if our BERT based NER-
tagging is used as input whereas table 6 contains the results that have been
reported when the NER ground truth had been provided to the NEL system.
The two tables show that NEL performance significantly improves if NER ground
truth is provided.
    Interestingly, the recall of the French and German NEL system is similar
although the French knowledge base is significantly smaller than the German
one. This observation can be explained by the fact that coverage of the test
data of the knowledge bases for German in French is similar (see Table 2). We
attribute the much lower recall for the English test data to much lower coverage
of the test data of the English knowledge base (see Table 2).
    Precision of the German and French SBB system is comparable, again preci-
sion of the English system is significantly worse, even if NER-groundtruth is pro-
vided. We explain in Section 5 that our system provides a list of candidates that
have matching probability above 0.2 that is sorted in descending order according
to the matching probability. Hence, given a bad coverage of the knowledge base,
as it is the case for English, non matching candidates will inevitably move up
in that sorted list, i.e., the drop in precision can also be explained by the bad
coverage of the knowledge base.


        Lang Team Evaluation                          Label      P     R    F1
        DE    SBB   NEL-LIT-micro-fuzzy-@1         ALL        0.540 0.304 0.389
        DE    SBB   NEL-LIT-micro-fuzzy-relaxed-@1 ALL        0.561 0.315 0.403
        DE    SBB   NEL-LIT-micro-fuzzy-relaxed-@3 ALL        0.590 0.332 0.425
        DE    SBB   NEL-LIT-micro-fuzzy-relaxed-@5 ALL        0.601 0.338 0.432
        FR    SBB   NEL-LIT-micro-fuzzy-@1         ALL        0.594 0.310 0.407
        FR    SBB   NEL-LIT-micro-fuzzy-relaxed-@1 ALL        0.616 0.321 0.422
        FR    SBB   NEL-LIT-micro-fuzzy-relaxed-@3 ALL        0.624 0.325 0.428
        FR    SBB   NEL-LIT-micro-fuzzy-relaxed-@5 ALL        0.629 0.328 0.431
        EN    SBB   NEL-LIT-micro-fuzzy-@1         ALL        0.257 0.097 0.141
        EN    SBB   NEL-LIT-micro-fuzzy-relaxed-@1 ALL        0.257 0.097 0.141
        EN    SBB   NEL-LIT-micro-fuzzy-relaxed-@3 ALL        0.299 0.112 0.163
        EN    SBB   NEL-LIT-micro-fuzzy-relaxed-@5 ALL        0.299 0.112 0.163

Table 5. NEL-LIT results with NER-tagging performed by our off-the-shelf system.
French and German performance is similar, English is significantly worse. The stark
performance differences between German and French versus English can mainly be
explained by differences in coverage of the test data of the knowledge bases (see Table
2).


Table 7 reports on the best NEL-LIT results per team where the NER task has
been performed by each teams own NER system. Table 8 reports on the best
NEL-LIT results per team where the NER ground truth has been provided to
        Lang Team Evaluation                          Label      P     R    F1
        DE    SBB   NEL-LIT-micro-fuzzy-@1         ALL        0.615 0.349 0.445
        DE    SBB   NEL-LIT-micro-fuzzy-relaxed-@1 ALL        0.636 0.361 0.461
        DE    SBB   NEL-LIT-micro-fuzzy-relaxed-@3 ALL        0.673 0.382 0.488
        DE    SBB   NEL-LIT-micro-fuzzy-relaxed-@5 ALL        0.686 0.389 0.497
        FR    SBB   NEL-LIT-micro-fuzzy-@1         ALL        0.677 0.371 0.480
        FR    SBB   NEL-LIT-micro-fuzzy-relaxed-@1 ALL        0.699 0.383 0.495
        FR    SBB   NEL-LIT-micro-fuzzy-relaxed-@3 ALL        0.710 0.390 0.503
        FR    SBB   NEL-LIT-micro-fuzzy-relaxed-@5 ALL        0.716 0.393 0.507
        EN    SBB   NEL-LIT-micro-fuzzy-@1         ALL        0.344 0.119 0.177
        EN    SBB   NEL-LIT-micro-fuzzy-relaxed-@1 ALL        0.344 0.119 0.177
        EN    SBB   NEL-LIT-micro-fuzzy-relaxed-@3 ALL        0.390 0.135 0.200
        EN    SBB   NEL-LIT-micro-fuzzy-relaxed-@5 ALL        0.390 0.135 0.200

Table 6. NEL-LIT results with NER ground truth provided. As expected, availability
of NER ground truth significantly improves NEL results (see Table 5 for comparison).
The stark performance differences between German and French versus English can
mainly be explained by differences in coverage of the test data of the knowledge bases
(see Table 2).


each team. In both tables, i.e., Table 7 and Table 8, the results have been sorted
according to precision. It turns out that our SBB NEL-system performed quite
competitively in terms of precision but rather abysmally in terms of recall.
We attribute the bad recall performance to multiple reasons:

 – Due to the construction of the knowledge bases, many entities end up without
   representation. Even for German, that has the best coverage, coverage is only
   71%.
 – The lookup step of our NEL system has not been extensively optimized up
   to now. The embeddings, for instance, that are stored in the approximative
   nearest neighbour indices have been selected only on an initial guess basis
   and have not been optimized for performance. Which layers of the model to
   use and how to combine them heavily impacts the properties of the lookup
   step. Additionally the parameters of the approximative nearest neighbour
   indices such as type of similarity measure, number of lookup trees and cut-
   off distance, have been chosen on a initial guess basis too and could be further
   optimized.


7    Conclusion

The results of our participation in the HIPE task highlight where the biggest
potential for improvement of our NER / NEL / NED system is to be expected:
       Lang Team     Evaluation                      Label       P       R   F1
       DE   L3i      NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.627 0.636 0.632
       DE   SBB      NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.601 0.338 0.432
       DE   UvA.ILPS NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.311 0.345 0.327
       FR   L3i      NEL-LIT-micro-fuzzy-relaxed-@5 ALL      0.695 0.705 0.700
       FR   SBB      NEL-LIT-micro-fuzzy-relaxed-@5 ALL      0.629 0.328 0.431
       FR   IRISA    NEL-LIT-micro-fuzzy-relaxed-@5 ALL      0.560 0.490 0.523
       FR   UvA.ILPS NEL-LIT-micro-fuzzy-relaxed-@5 ALL      0.397 0.220 0.283
       FR   ERTIM    NEL-LIT-micro-fuzzy-relaxed-@5 ALL      0.150 0.084 0.108
       EN   L3i      NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.651 0.674 0.662
       EN   UvA.ILPS NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.304 0.458 0.366
       EN   SBB      NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.299 0.112 0.163

Table 7. NEL-LIT results per team with NER-tagging performed by the teams own
NER system. The results have been sorted according to precision.


  Lang Team              Evaluation                      Label       P       R    F1
  DE    L3i                NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.696 0.696 0.696
  DE    SBB                NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.686 0.389 0.497
  DE    aidalight-baseline NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.440 0.435 0.437
  FR    L3i                NEL-LIT-micro-fuzzy-relaxed-@5 ALL    0.746 0.743 0.744
  FR    SBB                NEL-LIT-micro-fuzzy-relaxed-@5 ALL    0.716 0.393 0.507
  FR    Inria-DeLFT        NEL-LIT-micro-fuzzy-relaxed-@5 ALL    0.604 0.670 0.635
  FR    IRISA              NEL-LIT-micro-fuzzy-relaxed-@5 ALL    0.590 0.588 0.589
  FR    aidalight-baseline NEL-LIT-micro-fuzzy-relaxed-@5 ALL    0.516 0.508 0.512
  EN    L3i                NEL-LIT-micro-fuzzy-relaxed-@5 ALL    0.744 0.744 0.744
  EN    Inria-DeLFT        NEL-LIT-micro-fuzzy-relaxed-@5 ALL    0.633 0.685 0.658
  EN    UvA.ILPS           NEL-LIT-micro-fuzzy-relaxed-@5 ALL    0.607 0.580 0.593
  EN    aidalight-baseline NEL-LIT-micro-fuzzy-relaxed-@5 ALL    0.506 0.506 0.506
  EN    SBB                NEL-LIT-micro-fuzzy-relaxed-@5 ALL    0.390 0.135 0.200

Table 8. NEL-LIT results per team with NER ground truth provided. The results
have been sorted according to precision.
 – OCR-performance is crucial since it is the start of the processing chain and
   OCR noise causes, as expected, bad results in all the subsequent processing
   steps.
 – NEL recall performance of the SBB system has the biggest potential for
   improvement. An obvious path to improvement of recall performance is a
   better construction of the knowledge bases that should lead to a better over-
   all representation of entities.
 – An extensive evaluation and optimization of the lookup step that includes
   hardening against OCR noise could improve recall.
 – NER results of other teams show that huge improvements in terms of NER
   performance even under the presence of noise are possible[4]. That improve-
   ment directly benefits the NED/NEL steps. We will therefore carefully eval-
   uate how these improvements have been achieved in order to optimize our
   own NER-tagger.
Due to the diverse nature of the CLEF HIPE 2020 task data, in particular due to
the differences in OCR quality, for us, the performance evaluation has resulted
in valuable insights into our NER/NED/NEL system. The HIPE task data is
in our opinion quite realistic, which means that we expect our system will have
to handle similar data in the real world. Hence, we consider our participation
in the HIPE competition as an important and constructive step on the path
towards improving NER/NED processing of real world text material that has
been obtained by OCR of historical documents.


References
 1. Benikova, D., Biemann, C., Kisselew, M., Padó, S.: Germeval 2014 named entity
    recognition: Companion paper. Proceedings of the KONVENS GermEval Shared
    Task on Named Entity Recognition, Hildesheim, Germany pp. 104–112 (2014)
 2. Bernhardsson, E.: Annoy: Approximate Nearest Neighbors in C++/Python (2018),
    https://github.com/spotify/annoy
 3. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirec-
    tional Transformers for Language Understanding. CoRR abs/1810.04805 (2018),
    http://arxiv.org/abs/1810.04805
 4. Ehrmann, M., Romanello, M., Bircher, S., Clematide, S.: Introducing the CLEF
    2020 HIPE shared task: Named entity recognition and linking on historical news-
    papers. In: Jose, J.M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva,
    M.J., Martins, F. (eds.) Advances in information retrieval. pp. 524–532. Springer
    International Publishing, Cham (2020)
 5. Labusch, K., Neudecker, C., Zellhöfer, D.: BERT for Named Entity Recogni-
    tion in Contemporary and Historic German. In: Proceedings of the 15th Con-
    ference on Natural Language Processing (KONVENS 2019): Long Papers. p. 1–9.
    German Society for Computational Linguistics & Language Technology, Erlan-
    gen, Germany (2019), https://corpora.linguistik.uni-erlangen.de/data/konvens/
    proceedings/papers/KONVENS2019 paper 4.pdf
 6. Labusch, K., Zellhöfer, D.: OCR Fulltexts of the Digital Collections of the
    Berlin State Library (DC-SBB) (June 26th 2019), https://doi.org/10.5281/zenodo.
    3257041
 7. Neudecker, C.: An open corpus for named entity recognition in historic newspapers.
    In: Proceedings of the Tenth International Conference on Language Resources and
    Evaluation (LREC 2016). pp. 4348–4352. European Language Resources Associa-
    tion (ELRA), Portorož, Slovenia (May 2016), https://www.aclweb.org/anthology/
    L16-1689
 8. Neudecker, C., Antonacopoulos, A.: Making europe’s historical newspapers search-
    able. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS). pp.
    405–410. IEEE, New York, NY, USA (April 2016), https://doi.org/10.1109/DAS.
    2016.83
 9. Rehm, G., Bourgonje, P., Hegele, S., Kintzel, F., Schneider, J.M., Ostendorff,
    M., Zaczynska, K., Berger, A., Grill, S., Räuchle, S., Rauenbusch, J., Ruten-
    burg, L., Schmidt, A., Wild, M., Hoffmann, H., Fink, J., Schulz, S., Seva, J.,
    Quantz, J., Böttger, J., Matthey, J., Fricke, R., Thomsen, J., Paschke, A., Qun-
    dus, J.A., Hoppe, T., Karam, N., Weichhardt, F., Fillies, C., Neudecker, C.,
    Gerber, M., Labusch, K., Rezanezhad, V., Schaefer, R., Zellhöfer, D., Siewert,
    D., Bunk, P., Pintscher, L., Aleynikova, E., Heine, F.: QURATOR: Innovative
    Technologies for Content and Data Curation. CoRR abs/2004.12195 (2020),
    https://arxiv.org/abs/2004.12195
10. Riedl, M., Padó, S.: A named entity recognition shootout for German. In: Pro-
    ceedings of ACL. pp. 120–125. Melbourne, Australia (2018), http://aclweb.org/
    anthology/P18-2020.pdf
11. Schweter, S., Baiter, J.: Towards robust named entity recognition for historic ger-
    man. arXiv preprint arXiv:1906.07592 (2019), https://arxiv.org/abs/1906.07592
12. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the conll-2003 shared task:
    Language-independent named entity recognition. In: Proceedings of the Seventh
    Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4. pp.
    142–147. CONLL ’03, Association for Computational Linguistics, Stroudsburg, PA,
    USA (2003), https://doi.org/10.3115/1119176.1119195

</pre>