<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Named Entity Disambiguation and Linking on Historic Newspaper OCR with BERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kai Labusch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Clemens Neudecker</string-name>
          <email>clemens.neudeckerg@sbb.spk-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Staatsbibliothek zu Berlin - Preu ischer Kulturbesitz 10785 Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we propose a named entity disambiguation and linking (NED, NEL) system that consists of three components: (i) Lookup of possible candidates in an approximative nearest neighbour (ANN) index that stores BERT-embeddings. (ii) Evaluation of each candidate by comparison of text passages of Wikipedia performed by a purpose-trained BERT model. (iii) Final ranking of candidates on the basis of information gathered from previous steps. We participated in the CLEF 2020 HIPE NERC-COARSE and NEL-LIT tasks for German, French, and English. The CLEF HIPE 2020 results show that our NEL approach is competitive in terms of precision but has low recall performance due to insu cient knowledge base coverage of the test data.</p>
      </abstract>
      <kwd-group>
        <kwd>Named Entity Recognition Entity Linking BERT OCR</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Our participation in the CLEF HIPE 2020 NER-COARSE and NEL-LIT task1
has been conducted as part of the Qurator2 project within the Berlin State
Library (Staatsbibliothek zu Berlin - Preu ischer Kulturbesitz, SBB). One goal of
the SBB in the Qurator project is the development of a system that identi es
persons, locations and organizations within digitized historical text material
obtained by Optical Character Recognition (OCR) and then links recognized
entities to their corresponding Wikidata-IDs. Here, we provide a high-level overview
of the functionality of our system; for details, take a deeper look at the
information provided together with the source code3.</p>
      <p>The paper is structured as follows: after a brief introduction of the
background and use case, a short summary of the Named Entity Recognition system
Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25
September 2020, Thessaloniki, Greece.
1 https://impresso.github.io/CLEFHIPE2020/
2 https://qurator.ai
3 https://github.com/qurator-spk/sbb ned
is provided in chapter 2. Chapter 3 outlines the Entity Linking approach
developed in greater detail. Chapter 4 covers the chosen method for evaluation of
candidates for entity linking and chapter 5 continues with a description of their
ranking. Following a discussion of the results obtained in the NER-COARSE
and NEL-LIT tasks in chapter 6, we wrap up with some concluding remarks and
potentials for further improvement in chapter 7.
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>The SBB is continuously digitizing its copyright-free holdings and making them
publicly available online in various formats for viewing and browsing4 or
automated5 download. As part of an on-going process, a growing amount of
OCRderived full-texts of the digitized printed material is provided in ALTO6 format
for internal use cases such as full-text indexing and other information retrieval
tasks.</p>
      <p>With an increasing amount of digitized sources becoming available online, the
need for automated ways of extracting additional information from these sources
increases as well. Disciplines such as the Digital Humanities create use cases for
text and data mining or the semantic enrichment of the full-texts with e.g.
Named Entity Recognition and Linking (e.g. for the re-construction of historical
social networks7).</p>
      <p>The boost in popularity of neural networks in the early 2010s, which are
not only capable of dealing with large amounts of data (i. e., big data), but also
require enormous amounts of data to be trained on in order to produce high
quality results, has addressed this need. However, due to the historical nature of the
documents being digitized in libraries, standard methods and procedures from
the NLP domain typically require additional adaptation in order to successfully
deal with the historical spelling variation and the remaining noise resulting from
OCR errors.
1.2</p>
    </sec>
    <sec id="sec-3">
      <title>Qurator</title>
      <p>The Qurator project[9], funded by the German Federal Ministry of Education
and Research (BMBF), for a timeframe of three years (11/2018-10/2021), is
based in the metropolitan region Berlin/Brandenburg. The consortium of ten
project partners from research and industry combines vast expertise in areas
such as Language as well as Knowledge Technologies, Arti cial Intelligence and
Machine Learning.</p>
      <p>The project's main goal is the development of a sustainable technology
platform that supports knowledge workers in various industries. The platform will
4 https://digital.staatsbibliothek-berlin.de
5 https://oai.sbb.berlin
6 https://www.loc.gov/standards/alto/
7 https://sonar.fh-potsdam.de/
simplify the curation of digital content and accelerate it dramatically. AI
techniques are integrated into curation technologies and curation work ows in the
form of domain speci c solutions covering the entire life cycle of content
curation. The solutions being developed focus on curation services for the domains
of culture, media, health and industry.</p>
      <p>Within the Qurator consortium, the SBB is responsible for the task are
"Curation Technologies for Digitized Cultural Heritage". The main goals of this task
area lie in the development and adaptation of novel, AI/ML-based approaches
from the document analysis and NLP domains for the improvement of the
quality of OCR full-texts and the semantic enrichment of the derived full-texts with
NER and NEL. The baseline for this development are the digitized collections of
the SBB, with approximately 175,000 (August 2020) digitized documents from
the timeframe 1400{1920. While most of the documents are in German, there
is great variation with many other European and also Asian languages being
present in the collection. The collection comprises documents from a wide
array of publication formats, including books, newspapers, journals, maps, letters,
posters, and many more.
1.3</p>
    </sec>
    <sec id="sec-4">
      <title>HIPE</title>
      <p>The introduction of the CLEF HIPE 2020 shared task provided a welcome
opportunity to assess the performance of our own NER and NED systems in
comparison with others within the frame of a common and realistic benchmark setting.
HIPE proposes two tasks, NER and NEL, for French, German and English,
with OCRed historical newspapers as input. The SBB's digitization strategy
has traditionally put a strong focus on historic newspapers, with projects like
Europeana Newspapers[8] producing millions of pages of OCR from digitized
newspapers.</p>
      <p>Recent years have also brought about the application of deep learning models
for NER and NEL, where HIPE rst puts these developments to the test for
more challenging historical and noisy materials. We therefore expect that many
valuable insights and directions for future work will results from participation
in the HIPE shared task.
2</p>
      <sec id="sec-4-1">
        <title>Named Entity Recognition</title>
        <p>
          Before entity disambiguation starts, the input text is run through a named
entity recognition (NER) system that tags all person (PER), location (LOC) and
organization (ORG) entities. For the CLEF HIPE 2020 task, we used a BERT[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
based NER-system that has been developed previously at SBB and described in
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>We employed our o -the-shelf system8 and did not use CLEF HIPE 2020
NER training data for ne-tuning. Our o -the-shelf system does not currently
7 https://impresso.github.io/CLEF-HIPE-2020/
8 https://github.com/qurator-spk/sbb ner
support product (PROD) and time (TIME) entities. The German NER system
has been trained simultaneously on recent and historical German NER ground
truth. In case of French and English, we used our multilingual model, i.e., a
single BERT model that was trained for NER on combined German, French,
Dutch and English NER labeled data.</p>
        <p>
          Starting from multilingual BERT-Base Cased, we applied unsupervised
pretraining composed of the \Masked-LM" and \Next Sentence Prediction" tasks
proposed by [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] using 2,333,647 pages of unlabeled historical German text from
the DC-SBB dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Furterhmore, we performed supervised pre-training on
NER ground truth using the Europeana Newspapers [7], ConLL-2003 [12] and
GermEval-2014 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] datasets.
        </p>
        <p>In the according cross-evaluation, it was found that unsupervised pre-training
on DC-SBB data worsens BERT performance in the case of contemporary
training/test pairs while the performance improves for most experiments that test on
historical ground truth. The best performance for our model is achieved by
combining pre-training using DC-SBB + GermEval + CoNLL and results obtained
from that are comparable to the state-of-the-art (see table 1). For the discussion
of the performance of our NER system in the particular context of HIPE, please
see chapter 6.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Entity Linking: Lookup of Candidates</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Construction of knowledge base for PER, LOC and ORG</title>
      <p>Our entity linking and disambiguation works by comparison of continuous text
snippets where the entities in question are mentioned. A purpose-trained BERT
model (the evaluation model) performs that text comparison task (see
chapter 4). Therefore, a knowledge base that contains structured information like
Wikidata is not su cient. Instead we need additional continuous text where the
entities that are part of the knowledge base are discussed, mentioned and
referenced. Hence, we derive the knowledge base such that each entity in it has a
corresponding Wikipedia page since the Wikipedia articles contain continuous
671398 374048 136044
217383 155856 39305
324607 198570 58730
71%
68%
47%
text that have been annotated by human authors with references that can serve
as ground truth.</p>
      <p>The knowledge base has been directly derived from Wikipedia through the
identi cation of persons, locations and organizations within the German Wikipedia
by recursive traversal of its category structure:
{ PER: All pages that are part of the categories \Frau" or \Mann" or of one
of the reachable sub-categories of \Frau" and \Mann". One problem with
this approach is that ctional \persons" are typically not contained in that
selection.
{ LOC: All pages that are part of the category \Geographisches Objekt" or
one of its sub-categories. We exclude everything that is part of
\Geographischer Begri " or one of its sub-categories.
{ ORG: All pages that are part of the category \Organisation" or one of its
sub-categories.</p>
      <p>Note: we plan to use the structured information of Wikidata in order to more
reliably identify PER, LOC and ORG entities within Wikipedia which should
make the heuristic approach of knowledge base creation obsolete.</p>
      <p>Some pages might end up in multiple entity classes at the same time due
to the category structure of the German Wikipedia. In order to create disjunct
entity classes, we rst remove from the entity class ORG everything that is also
included in PER or LOC. In a second step, we remove everything from the entity
class LOC that is also part of PER or ORG. It has been pointed out by one of
our reviewers that this step is conceptually not required by our approach and it
actually will be obsolete as soon as we identify PER, LOC and ORG entities on
the basis of Wikidata.</p>
      <p>To construct knowledge bases for French and English, we rst map the
identi ed German Wikipedia entity pages to their corresponding Wikidata-IDs and
then the Wikidata-IDs back to the corresponding French and English Wikipedia
pages. Table 2 shows the size of the knowledge bases per category and language.
Note, that the knowledge bases for French and English are signi cantly smaller
than the German one due to loss of many entities within the Wikipedia -
Wikidata mapping.</p>
      <p>What is the cause of that loss? We checked at random a number of entities
of all types (PER,LOC,ORG) that had been lost in the mapping between the
German and the French or English Wikipedia. In all cases Wikidata actually did
not contain a reference to some English or French version of Wikipedia. Hence
either there actually is not a French or English version of that Wikipedia page
available or the correct linking has not been established so far. We expect to end
up with much larger knowledge bases by use of structured data from Wikidata
for identi cation of entities.</p>
      <p>After the unmasked CLEF HIPE 2020 test data had been published, we
computed the coverage of our per language knowledge bases, i.e., the percentage
of Wikidata entity IDs (NEL-LIT) in the test data that actually can be found
in the corresponding knowledge base. That percentage is an upper bound on
the systems performance. As you can see in table 2, the coverage is similar for
German and French (roughly 70%) whereas it is signi cantly worse for English
(roughly 50%).
3.2</p>
    </sec>
    <sec id="sec-6">
      <title>Entity Lookup Index</title>
      <p>
        After the knowledge bases have been established, for each of them an entity
lookup index is created by computation of BERT embeddings of the page titles
of the identi ed PER, LOC and ORG Wikipedia pages. The BERT embeddings
are obtained from a combination of di erent layers of the evaluation model (see
chapter 4). The embedding vectors of the tokens of the page titles are stored in
an approximative nearest neighbour (ANN) index [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We use cosine similarity as
distance measure and the ANN index uses 100 random projection search trees.
There are separate ANN indices per supported language and per supported
entity category.
      </p>
      <p>Given some NER-tagged surface that is part of the input text, up to 400
linking candidates below a cut-o distance of 0:1 are selected by lookup of the
nearest neighbours of the surface's embedding within the approximative nearest
neighbour index of the corresponding language and entity category.</p>
      <p>According to our observations the performance of our system improves with
a higher number of candidates considered. Of course there is some upper limit to
that, however, more important is the computational complexity that grows with
the number of linking candidates. We did not systematically evaluate the e ect
of the number of linking candidates but we used a number of linking candidates
that is su ciently high. Note that there is also an interaction with the cut-o
distance since in many cases there are not as many as 400 nearest neighbours
within a distance of less than 0:1.
4</p>
      <sec id="sec-6-1">
        <title>Evaluation of Candidates</title>
        <p>For each entity of the knowledge bases (see chapter 3.1) there are text passages
in Wikipedia where some human Wikipedia editor has linked to that particular
entity. How many linked text passages we have for some particular entity di ers
SQL-table \sentences"
id
text
: A unique number that identi es each sentence.
: A JSON-array that contains the tokens of the sentence. Example:
["Der", "Begri ", "wurde", "von", "Georg", "Christoph", "Lichtenberg",
"eingebracht", "."]
entities : A JSON-array of same length as \text" that contains for each token of
the sentence the target entity if the token is part of an Wikipedia link
that has been created by some Wikipedia author. If a token is not part of
an Wikipedia link its corresponding entity is empty. Example:
["", "", "", "", "Georg Christoph Lichtenberg", "Georg Christoph Lichtenberg",
"Georg Christoph Lichtenberg", "", ""]
: The target entity of the reference. Example:</p>
        <p>"Georg Christoph Lichtenberg"
SQL-table \links"
id
target</p>
        <p>: A unique number that identi es each Wikipedia entity reference.</p>
        <p>sentence : The sentence-id of the sentence where the reference occurs (sentences.id).
widely depending on the entity. Some entities have thousands of links available
whereas other entities have only very few.</p>
        <p>We created a SQLITE database that provides quick access to the mentions
of some particular entity. Using the Wikipedia page title of the entity as key,
for instance, "Georg Christoph Lichtenberg", the database returns all sentences
where some human editor explicitly linked to "Georg Christoph Lichtenberg".
The database can be derived programmatically from the Wikipedia without
any human annotation being involved. Table 3 gives a short description of the
structure of the SQLITE database.</p>
        <p>Using that database, we created a training dataset that consists of random
sentence pairs (A,B) where sentences (A,B) either reference the same entity or
di erent entities. That training dataset de nes a binary classi cation problem:
Do sentences A and B refer to the same item or not?</p>
        <p>We trained a BERT model with respect to this binary classi cation problem
per supported language that we call the \evaluation model" in the following.
Given some arbitrary sentence pair (A,B), the evaluation model outputs the
probability of the two sentences refering to the same item.</p>
        <p>During entity disambiguation, we build up to 50 sentence pairs (A,B) for
each candidate that has been found in the lookup step (see chapter 3.2). The
sentence pairs are composed in such a way that sentence A is part of the input
text where the entity that is to be linked is mentioned and sentence B is a
sentence from Wikipedia where that particular candidate has been linked to
by a Wikipedia author. The higher the number of evaluated sentence pairs per
candidate is, the more reliable the ranking model (see Section 5) can determine
the overall matching probability. Again, the computational complexity increases
with the number of sentence pairs. Additionally, in most cases there is only a
very limited number of reference sentences from Wikipedia available such that it
is not possible to generate a large number of unique sentence pairs. The choice
of 50 sentence pairs is a trade-o that takes into account these considerations.</p>
        <p>Application of the evaluation model to each sentence pair results in a
corresponding matching probability. The sets of sentence pair matching probabilities
of all candidates are then further processed by the ranking model (see Section
5).
5</p>
      </sec>
      <sec id="sec-6-2">
        <title>Ranking of Candidates</title>
        <p>During previous steps, sets of possible entity candidates have been obtained for
all the parts of the input text that have been NER-tagged. For each candidate, a
number of sentence pairs have been examined by the evaluation model, resulting
in a set of sentence pair probabilities per candidate.</p>
        <p>The ranking step nally determines an ordering of the candidates per linked
entity according to the probability that it is the \correct" entity the part of the
input text is actually referring to.</p>
        <p>We compute statistical features of the sets of sentence pair probabilities of
the candidates, among them: mean, median, min, max, standard deviation as
well as various quantiles. Additionally we sort all the sentence pair probabilities
and compute ranking statistics over all the candidates.</p>
        <p>Then, based on the statistical features that describe the set of sentence pair
probabilities of each candidate, a random forest model computes the overall
probability that some particular candidate is actually the \correct"
corresponding entity. The random forest model is the only component of our system where
the CLEF HIPE 2020 data was used for training.</p>
        <p>Finally the candidates are sorted according to the overall matching
probabilities that have been estimated by the random forest model. The nal output
of our NED system is the sorted list of candidates where candidates that have a
matching probability less than 0:2 are cut o .</p>
        <p>Our NED system does not implement the NIL entity that means either it
returns a non-empty list of Wikidata IDs that have been sorted in descending
order according to their overall matching probabilities or the result is \-" if there
is not any candidate that has matching probability above 0:2.
6</p>
      </sec>
      <sec id="sec-6-3">
        <title>Results</title>
        <p>
          also contains the results of the best performing system (L3i). In case of the SBB
system, strict NER performance is signi cantly worse than fuzzy NER
performance. That observation holds for the L3i system too, however, for our system
the e ect is much more pronounced. Strict NER is a much more demanding task,
nevertheless, we partly attribute the di erence in performance to the training
data of our NER system (see [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]), which has been created according to multiple
slightly di erent NER annotation standards and also to the fact that we did not
ne tune the NER system by using training data provided by the CLEF HIPE
2020 task organizers.
        </p>
        <p>According to our observations, the OCR quality of the French data is slightly
better than the German one and both French and German have better OCR
quality than the English text material. By OCR quality, we primarily mean
the overall quality of the entire text but not the mean Levenshtein distances
of the entity text passages with respect to the original text. NER performance
resembles that observation, i.e., French and German are comparable whereas
English is signi cantly worse. Therefore our current hypothesis is that these
di erences are partly caused by the sensitivity of our NER-tagger to OCR noise
within the surrounding text.</p>
        <p>Lang Team Evaluation
Label</p>
        <p>P</p>
        <p>R</p>
        <p>F1
DE
DE
DE
DE
FR
FR
FR
FR
EN
EN
EN
EN</p>
        <p>L3i NE-COARSE-LIT-micro-fuzzy ALL 0.870 0.886 0.878
SBB NE-COARSE-LIT-micro-fuzzy ALL 0.730 0.708 0.719
L3i NE-COARSE-LIT-micro-strict ALL 0.790 0.805 0.797
SBB NE-COARSE-LIT-micro-strict ALL 0.499 0.484 0.491
L3i NE-COARSE-LIT-micro-fuzzy ALL 0.912 0.931 0.921
SBB NE-COARSE-LIT-micro-fuzzy ALL 0.765 0.689 0.725
L3i NE-COARSE-LIT-micro-strict ALL 0.831 0.849 0.840
SBB NE-COARSE-LIT-micro-strict ALL 0.530 0.477 0.502
L3i NE-COARSE-LIT-micro-fuzzy ALL 0.794 0.817 0.806
SBB NE-COARSE-LIT-micro-fuzzy ALL 0.642 0.572 0.605
L3i NE-COARSE-LIT-micro-strict ALL 0.623 0.641 0.632</p>
        <p>SBB NE-COARSE-LIT-micro-strict ALL 0.347 0.310 0.327</p>
        <p>Table 5 shows the NEL performance of our system if our BERT based
NERtagging is used as input whereas table 6 contains the results that have been
reported when the NER ground truth had been provided to the NEL system.
The two tables show that NEL performance signi cantly improves if NER ground
truth is provided.</p>
        <p>Interestingly, the recall of the French and German NEL system is similar
although the French knowledge base is signi cantly smaller than the German
one. This observation can be explained by the fact that coverage of the test
data of the knowledge bases for German in French is similar (see Table 2). We
attribute the much lower recall for the English test data to much lower coverage
of the test data of the English knowledge base (see Table 2).</p>
        <p>Precision of the German and French SBB system is comparable, again
precision of the English system is signi cantly worse, even if NER-groundtruth is
provided. We explain in Section 5 that our system provides a list of candidates that
have matching probability above 0:2 that is sorted in descending order according
to the matching probability. Hence, given a bad coverage of the knowledge base,
as it is the case for English, non matching candidates will inevitably move up
in that sorted list, i.e., the drop in precision can also be explained by the bad
coverage of the knowledge base.</p>
        <p>Lang Team Evaluation
Label</p>
        <p>P</p>
        <p>R
DE
DE
DE
DE
FR
FR
FR
FR
EN
EN
EN
EN</p>
        <p>SBB NEL-LIT-micro-fuzzy-@1 ALL 0.540 0.304 0.389
SBB NEL-LIT-micro-fuzzy-relaxed-@1 ALL 0.561 0.315 0.403
SBB NEL-LIT-micro-fuzzy-relaxed-@3 ALL 0.590 0.332 0.425
SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.601 0.338 0.432
SBB NEL-LIT-micro-fuzzy-@1 ALL 0.594 0.310 0.407
SBB NEL-LIT-micro-fuzzy-relaxed-@1 ALL 0.616 0.321 0.422
SBB NEL-LIT-micro-fuzzy-relaxed-@3 ALL 0.624 0.325 0.428
SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.629 0.328 0.431
SBB NEL-LIT-micro-fuzzy-@1 ALL 0.257 0.097 0.141
SBB NEL-LIT-micro-fuzzy-relaxed-@1 ALL 0.257 0.097 0.141
SBB NEL-LIT-micro-fuzzy-relaxed-@3 ALL 0.299 0.112 0.163</p>
        <p>SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.299 0.112 0.163</p>
        <p>Table 7 reports on the best NEL-LIT results per team where the NER task has
been performed by each teams own NER system. Table 8 reports on the best
NEL-LIT results per team where the NER ground truth has been provided to
DE
DE
DE
DE
FR
FR
FR
FR
EN
EN
EN
EN</p>
        <p>SBB NEL-LIT-micro-fuzzy-@1 ALL 0.615 0.349 0.445
SBB NEL-LIT-micro-fuzzy-relaxed-@1 ALL 0.636 0.361 0.461
SBB NEL-LIT-micro-fuzzy-relaxed-@3 ALL 0.673 0.382 0.488
SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.686 0.389 0.497
SBB NEL-LIT-micro-fuzzy-@1 ALL 0.677 0.371 0.480
SBB NEL-LIT-micro-fuzzy-relaxed-@1 ALL 0.699 0.383 0.495
SBB NEL-LIT-micro-fuzzy-relaxed-@3 ALL 0.710 0.390 0.503
SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.716 0.393 0.507
SBB NEL-LIT-micro-fuzzy-@1 ALL 0.344 0.119 0.177
SBB NEL-LIT-micro-fuzzy-relaxed-@1 ALL 0.344 0.119 0.177
SBB NEL-LIT-micro-fuzzy-relaxed-@3 ALL 0.390 0.135 0.200</p>
        <p>SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.390 0.135 0.200
each team. In both tables, i.e., Table 7 and Table 8, the results have been sorted
according to precision. It turns out that our SBB NEL-system performed quite
competitively in terms of precision but rather abysmally in terms of recall.
We attribute the bad recall performance to multiple reasons:
{ Due to the construction of the knowledge bases, many entities end up without
representation. Even for German, that has the best coverage, coverage is only
71%.
{ The lookup step of our NEL system has not been extensively optimized up
to now. The embeddings, for instance, that are stored in the approximative
nearest neighbour indices have been selected only on an initial guess basis
and have not been optimized for performance. Which layers of the model to
use and how to combine them heavily impacts the properties of the lookup
step. Additionally the parameters of the approximative nearest neighbour
indices such as type of similarity measure, number of lookup trees and
cuto distance, have been chosen on a initial guess basis too and could be further
optimized.
7</p>
      </sec>
      <sec id="sec-6-4">
        <title>Conclusion</title>
        <p>The results of our participation in the HIPE task highlight where the biggest
potential for improvement of our NER / NEL / NED system is to be expected:</p>
        <p>Evaluation
EN
EN
EN</p>
        <p>L3i NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.627 0.636 0.632
SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.601 0.338 0.432
UvA.ILPS NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.311 0.345 0.327
L3i NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.695 0.705 0.700
SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.629 0.328 0.431
IRISA NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.560 0.490 0.523
UvA.ILPS NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.397 0.220 0.283
ERTIM NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.150 0.084 0.108
L3i NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.651 0.674 0.662
UvA.ILPS NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.304 0.458 0.366</p>
        <p>SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.299 0.112 0.163</p>
        <p>P</p>
        <p>R</p>
        <p>F1
DE
DE
DE
FR
FR
FR
FR
FR
EN
EN
EN
EN
EN</p>
        <p>L3i NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.696 0.696 0.696
SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.686 0.389 0.497
aidalight-baseline NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.440 0.435 0.437
L3i NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.746 0.743 0.744
SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.716 0.393 0.507
Inria-DeLFT NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.604 0.670 0.635
IRISA NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.590 0.588 0.589
aidalight-baseline NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.516 0.508 0.512
L3i NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.744 0.744 0.744
Inria-DeLFT NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.633 0.685 0.658
UvA.ILPS NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.607 0.580 0.593
aidalight-baseline NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.506 0.506 0.506
SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.390 0.135 0.200</p>
        <p>
          { OCR-performance is crucial since it is the start of the processing chain and
OCR noise causes, as expected, bad results in all the subsequent processing
steps.
{ NEL recall performance of the SBB system has the biggest potential for
improvement. An obvious path to improvement of recall performance is a
better construction of the knowledge bases that should lead to a better
overall representation of entities.
{ An extensive evaluation and optimization of the lookup step that includes
hardening against OCR noise could improve recall.
{ NER results of other teams show that huge improvements in terms of NER
performance even under the presence of noise are possible[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. That
improvement directly bene ts the NED/NEL steps. We will therefore carefully
evaluate how these improvements have been achieved in order to optimize our
own NER-tagger.
        </p>
        <p>Due to the diverse nature of the CLEF HIPE 2020 task data, in particular due to
the di erences in OCR quality, for us, the performance evaluation has resulted
in valuable insights into our NER/NED/NEL system. The HIPE task data is
in our opinion quite realistic, which means that we expect our system will have
to handle similar data in the real world. Hence, we consider our participation
in the HIPE competition as an important and constructive step on the path
towards improving NER/NED processing of real world text material that has
been obtained by OCR of historical documents.
7. Neudecker, C.: An open corpus for named entity recognition in historic newspapers.</p>
        <p>In: Proceedings of the Tenth International Conference on Language Resources and
Evaluation (LREC 2016). pp. 4348{4352. European Language Resources
Association (ELRA), Portoroz, Slovenia (May 2016), https://www.aclweb.org/anthology/
L16-1689
8. Neudecker, C., Antonacopoulos, A.: Making europe's historical newspapers
searchable. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS). pp.
405{410. IEEE, New York, NY, USA (April 2016), https://doi.org/10.1109/DAS.
2016.83
9. Rehm, G., Bourgonje, P., Hegele, S., Kintzel, F., Schneider, J.M., Ostendor ,
M., Zaczynska, K., Berger, A., Grill, S., Rauchle, S., Rauenbusch, J.,
Rutenburg, L., Schmidt, A., Wild, M., Ho mann, H., Fink, J., Schulz, S., Seva, J.,
Quantz, J., Bottger, J., Matthey, J., Fricke, R., Thomsen, J., Paschke, A.,
Qundus, J.A., Hoppe, T., Karam, N., Weichhardt, F., Fillies, C., Neudecker, C.,
Gerber, M., Labusch, K., Rezanezhad, V., Schaefer, R., Zellhofer, D., Siewert,
D., Bunk, P., Pintscher, L., Aleynikova, E., Heine, F.: QURATOR: Innovative
Technologies for Content and Data Curation. CoRR abs/2004.12195 (2020),
https://arxiv.org/abs/2004.12195
10. Riedl, M., Pado, S.: A named entity recognition shootout for German. In:
Proceedings of ACL. pp. 120{125. Melbourne, Australia (2018), http://aclweb.org/
anthology/P18-2020.pdf
11. Schweter, S., Baiter, J.: Towards robust named entity recognition for historic
german. arXiv preprint arXiv:1906.07592 (2019), https://arxiv.org/abs/1906.07592
12. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the conll-2003 shared task:
Language-independent named entity recognition. In: Proceedings of the Seventh
Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4. pp.
142{147. CONLL '03, Association for Computational Linguistics, Stroudsburg, PA,
USA (2003), https://doi.org/10.3115/1119176.1119195</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Benikova</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Biemann</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kisselew</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pado</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Germeval 2014 named entity recognition: Companion paper</article-title>
          .
          <source>Proceedings of the KONVENS GermEval Shared Task on Named Entity Recognition</source>
          , Hildesheim, Germany pp.
          <volume>104</volume>
          {
          <issue>112</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bernhardsson</surname>
          </string-name>
          , E.: Annoy: Approximate Nearest Neighbors in C++/Python (
          <year>2018</year>
          ), https://github.com/spotify/annoy
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          . CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1810</year>
          .04805
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ehrmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Romanello</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bircher</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clematide</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Introducing the CLEF 2020 HIPE shared task: Named entity recognition and linking on historical newspapers</article-title>
          . In: Jose,
          <string-name>
            <given-names>J.M.</given-names>
            ,
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            , Magalha~es, J.,
            <surname>Castells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.J.</given-names>
            ,
            <surname>Martins</surname>
          </string-name>
          ,
          <string-name>
            <surname>F</surname>
          </string-name>
          . (eds.)
          <article-title>Advances in information retrieval</article-title>
          . pp.
          <volume>524</volume>
          {
          <fpage>532</fpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Labusch</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neudecker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , Zellhofer, D.:
          <article-title>BERT for Named Entity Recognition in Contemporary and Historic German</article-title>
          .
          <source>In: Proceedings of the 15th Conference on Natural Language Processing (KONVENS</source>
          <year>2019</year>
          )
          <article-title>: Long Papers</article-title>
          . p.
          <volume>1</volume>
          {
          <issue>9</issue>
          . German Society for Computational Linguistics &amp; Language
          <string-name>
            <surname>Technology</surname>
          </string-name>
          , Erlangen, Germany (
          <year>2019</year>
          ), https://corpora.linguistik.uni-erlangen.de/data/konvens/ proceedings/papers/KONVENS2019 paper 4.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Labusch</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , Zellhofer, D.:
          <article-title>OCR Fulltexts of the Digital Collections of the Berlin State Library (DC-SBB) (June 26th</article-title>
          <year>2019</year>
          ), https://doi.org/10.5281/zenodo. 3257041
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>