-

Named Entity Disambiguation and Linking on Historic Newspaper OCR with BERT

Kai Labusch

Clemens Neudecker

clemens.neudeckerg@sbb.spk-berlin.de 0 0 Staatsbibliothek zu Berlin - Preu ischer Kulturbesitz 10785 Berlin , Germany

In this paper, we propose a named entity disambiguation and linking (NED, NEL) system that consists of three components: (i) Lookup of possible candidates in an approximative nearest neighbour (ANN) index that stores BERT-embeddings. (ii) Evaluation of each candidate by comparison of text passages of Wikipedia performed by a purpose-trained BERT model. (iii) Final ranking of candidates on the basis of information gathered from previous steps. We participated in the CLEF 2020 HIPE NERC-COARSE and NEL-LIT tasks for German, French, and English. The CLEF HIPE 2020 results show that our NEL approach is competitive in terms of precision but has low recall performance due to insu cient knowledge base coverage of the test data.

Named Entity Recognition Entity Linking BERT OCR

Our participation in the CLEF HIPE 2020 NER-COARSE and NEL-LIT task1 has been conducted as part of the Qurator2 project within the Berlin State Library (Staatsbibliothek zu Berlin - Preu ischer Kulturbesitz, SBB). One goal of the SBB in the Qurator project is the development of a system that identi es persons, locations and organizations within digitized historical text material obtained by Optical Character Recognition (OCR) and then links recognized entities to their corresponding Wikidata-IDs. Here, we provide a high-level overview of the functionality of our system; for details, take a deeper look at the information provided together with the source code3.

The paper is structured as follows: after a brief introduction of the background and use case, a short summary of the Named Entity Recognition system Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessaloniki, Greece. 1 https://impresso.github.io/CLEFHIPE2020/ 2 https://qurator.ai 3 https://github.com/qurator-spk/sbb ned is provided in chapter 2. Chapter 3 outlines the Entity Linking approach developed in greater detail. Chapter 4 covers the chosen method for evaluation of candidates for entity linking and chapter 5 continues with a description of their ranking. Following a discussion of the results obtained in the NER-COARSE and NEL-LIT tasks in chapter 6, we wrap up with some concluding remarks and potentials for further improvement in chapter 7. 1.1

Background

The SBB is continuously digitizing its copyright-free holdings and making them publicly available online in various formats for viewing and browsing4 or automated5 download. As part of an on-going process, a growing amount of OCRderived full-texts of the digitized printed material is provided in ALTO6 format for internal use cases such as full-text indexing and other information retrieval tasks.

With an increasing amount of digitized sources becoming available online, the need for automated ways of extracting additional information from these sources increases as well. Disciplines such as the Digital Humanities create use cases for text and data mining or the semantic enrichment of the full-texts with e.g. Named Entity Recognition and Linking (e.g. for the re-construction of historical social networks7).

The boost in popularity of neural networks in the early 2010s, which are not only capable of dealing with large amounts of data (i. e., big data), but also require enormous amounts of data to be trained on in order to produce high quality results, has addressed this need. However, due to the historical nature of the documents being digitized in libraries, standard methods and procedures from the NLP domain typically require additional adaptation in order to successfully deal with the historical spelling variation and the remaining noise resulting from OCR errors. 1.2

Qurator

The Qurator project[9], funded by the German Federal Ministry of Education and Research (BMBF), for a timeframe of three years (11/2018-10/2021), is based in the metropolitan region Berlin/Brandenburg. The consortium of ten project partners from research and industry combines vast expertise in areas such as Language as well as Knowledge Technologies, Arti cial Intelligence and Machine Learning.

The project's main goal is the development of a sustainable technology platform that supports knowledge workers in various industries. The platform will 4 https://digital.staatsbibliothek-berlin.de 5 https://oai.sbb.berlin 6 https://www.loc.gov/standards/alto/ 7 https://sonar.fh-potsdam.de/ simplify the curation of digital content and accelerate it dramatically. AI techniques are integrated into curation technologies and curation work ows in the form of domain speci c solutions covering the entire life cycle of content curation. The solutions being developed focus on curation services for the domains of culture, media, health and industry.

Within the Qurator consortium, the SBB is responsible for the task are "Curation Technologies for Digitized Cultural Heritage". The main goals of this task area lie in the development and adaptation of novel, AI/ML-based approaches from the document analysis and NLP domains for the improvement of the quality of OCR full-texts and the semantic enrichment of the derived full-texts with NER and NEL. The baseline for this development are the digitized collections of the SBB, with approximately 175,000 (August 2020) digitized documents from the timeframe 1400{1920. While most of the documents are in German, there is great variation with many other European and also Asian languages being present in the collection. The collection comprises documents from a wide array of publication formats, including books, newspapers, journals, maps, letters, posters, and many more. 1.3

HIPE

The introduction of the CLEF HIPE 2020 shared task provided a welcome opportunity to assess the performance of our own NER and NED systems in comparison with others within the frame of a common and realistic benchmark setting. HIPE proposes two tasks, NER and NEL, for French, German and English, with OCRed historical newspapers as input. The SBB's digitization strategy has traditionally put a strong focus on historic newspapers, with projects like Europeana Newspapers[8] producing millions of pages of OCR from digitized newspapers.

Recent years have also brought about the application of deep learning models for NER and NEL, where HIPE rst puts these developments to the test for more challenging historical and noisy materials. We therefore expect that many valuable insights and directions for future work will results from participation in the HIPE shared task. 2

Named Entity Recognition

Before entity disambiguation starts, the input text is run through a named entity recognition (NER) system that tags all person (PER), location (LOC) and organization (ORG) entities. For the CLEF HIPE 2020 task, we used a BERT[ 3 ] based NER-system that has been developed previously at SBB and described in [ 5 ].

We employed our o -the-shelf system8 and did not use CLEF HIPE 2020 NER training data for ne-tuning. Our o -the-shelf system does not currently 7 https://impresso.github.io/CLEF-HIPE-2020/ 8 https://github.com/qurator-spk/sbb ner support product (PROD) and time (TIME) entities. The German NER system has been trained simultaneously on recent and historical German NER ground truth. In case of French and English, we used our multilingual model, i.e., a single BERT model that was trained for NER on combined German, French, Dutch and English NER labeled data.

Starting from multilingual BERT-Base Cased, we applied unsupervised pretraining composed of the \Masked-LM" and \Next Sentence Prediction" tasks proposed by [ 3 ] using 2,333,647 pages of unlabeled historical German text from the DC-SBB dataset [ 6 ]. Furterhmore, we performed supervised pre-training on NER ground truth using the Europeana Newspapers [7], ConLL-2003 [12] and GermEval-2014 [ 1 ] datasets.

In the according cross-evaluation, it was found that unsupervised pre-training on DC-SBB data worsens BERT performance in the case of contemporary training/test pairs while the performance improves for most experiments that test on historical ground truth. The best performance for our model is achieved by combining pre-training using DC-SBB + GermEval + CoNLL and results obtained from that are comparable to the state-of-the-art (see table 1). For the discussion of the performance of our NER system in the particular context of HIPE, please see chapter 6.

Entity Linking: Lookup of Candidates Construction of knowledge base for PER, LOC and ORG

Our entity linking and disambiguation works by comparison of continuous text snippets where the entities in question are mentioned. A purpose-trained BERT model (the evaluation model) performs that text comparison task (see chapter 4). Therefore, a knowledge base that contains structured information like Wikidata is not su cient. Instead we need additional continuous text where the entities that are part of the knowledge base are discussed, mentioned and referenced. Hence, we derive the knowledge base such that each entity in it has a corresponding Wikipedia page since the Wikipedia articles contain continuous 671398 374048 136044 217383 155856 39305 324607 198570 58730 71% 68% 47% text that have been annotated by human authors with references that can serve as ground truth.

The knowledge base has been directly derived from Wikipedia through the identi cation of persons, locations and organizations within the German Wikipedia by recursive traversal of its category structure: { PER: All pages that are part of the categories \Frau" or \Mann" or of one of the reachable sub-categories of \Frau" and \Mann". One problem with this approach is that ctional \persons" are typically not contained in that selection. { LOC: All pages that are part of the category \Geographisches Objekt" or one of its sub-categories. We exclude everything that is part of \Geographischer Begri " or one of its sub-categories. { ORG: All pages that are part of the category \Organisation" or one of its sub-categories.

Note: we plan to use the structured information of Wikidata in order to more reliably identify PER, LOC and ORG entities within Wikipedia which should make the heuristic approach of knowledge base creation obsolete.

Some pages might end up in multiple entity classes at the same time due to the category structure of the German Wikipedia. In order to create disjunct entity classes, we rst remove from the entity class ORG everything that is also included in PER or LOC. In a second step, we remove everything from the entity class LOC that is also part of PER or ORG. It has been pointed out by one of our reviewers that this step is conceptually not required by our approach and it actually will be obsolete as soon as we identify PER, LOC and ORG entities on the basis of Wikidata.

To construct knowledge bases for French and English, we rst map the identi ed German Wikipedia entity pages to their corresponding Wikidata-IDs and then the Wikidata-IDs back to the corresponding French and English Wikipedia pages. Table 2 shows the size of the knowledge bases per category and language. Note, that the knowledge bases for French and English are signi cantly smaller than the German one due to loss of many entities within the Wikipedia - Wikidata mapping.

What is the cause of that loss? We checked at random a number of entities of all types (PER,LOC,ORG) that had been lost in the mapping between the German and the French or English Wikipedia. In all cases Wikidata actually did not contain a reference to some English or French version of Wikipedia. Hence either there actually is not a French or English version of that Wikipedia page available or the correct linking has not been established so far. We expect to end up with much larger knowledge bases by use of structured data from Wikidata for identi cation of entities.

After the unmasked CLEF HIPE 2020 test data had been published, we computed the coverage of our per language knowledge bases, i.e., the percentage of Wikidata entity IDs (NEL-LIT) in the test data that actually can be found in the corresponding knowledge base. That percentage is an upper bound on the systems performance. As you can see in table 2, the coverage is similar for German and French (roughly 70%) whereas it is signi cantly worse for English (roughly 50%). 3.2

Entity Lookup Index

After the knowledge bases have been established, for each of them an entity lookup index is created by computation of BERT embeddings of the page titles of the identi ed PER, LOC and ORG Wikipedia pages. The BERT embeddings are obtained from a combination of di erent layers of the evaluation model (see chapter 4). The embedding vectors of the tokens of the page titles are stored in an approximative nearest neighbour (ANN) index [ 2 ]. We use cosine similarity as distance measure and the ANN index uses 100 random projection search trees. There are separate ANN indices per supported language and per supported entity category.

Given some NER-tagged surface that is part of the input text, up to 400 linking candidates below a cut-o distance of 0:1 are selected by lookup of the nearest neighbours of the surface's embedding within the approximative nearest neighbour index of the corresponding language and entity category.

According to our observations the performance of our system improves with a higher number of candidates considered. Of course there is some upper limit to that, however, more important is the computational complexity that grows with the number of linking candidates. We did not systematically evaluate the e ect of the number of linking candidates but we used a number of linking candidates that is su ciently high. Note that there is also an interaction with the cut-o distance since in many cases there are not as many as 400 nearest neighbours within a distance of less than 0:1. 4

Evaluation of Candidates

For each entity of the knowledge bases (see chapter 3.1) there are text passages in Wikipedia where some human Wikipedia editor has linked to that particular entity. How many linked text passages we have for some particular entity di ers SQL-table \sentences" id text : A unique number that identi es each sentence. : A JSON-array that contains the tokens of the sentence. Example: ["Der", "Begri ", "wurde", "von", "Georg", "Christoph", "Lichtenberg", "eingebracht", "."] entities : A JSON-array of same length as \text" that contains for each token of the sentence the target entity if the token is part of an Wikipedia link that has been created by some Wikipedia author. If a token is not part of an Wikipedia link its corresponding entity is empty. Example: ["", "", "", "", "Georg Christoph Lichtenberg", "Georg Christoph Lichtenberg", "Georg Christoph Lichtenberg", "", ""] : The target entity of the reference. Example:

"Georg Christoph Lichtenberg" SQL-table \links" id target

: A unique number that identi es each Wikipedia entity reference.

sentence : The sentence-id of the sentence where the reference occurs (sentences.id). widely depending on the entity. Some entities have thousands of links available whereas other entities have only very few.

We created a SQLITE database that provides quick access to the mentions of some particular entity. Using the Wikipedia page title of the entity as key, for instance, "Georg Christoph Lichtenberg", the database returns all sentences where some human editor explicitly linked to "Georg Christoph Lichtenberg". The database can be derived programmatically from the Wikipedia without any human annotation being involved. Table 3 gives a short description of the structure of the SQLITE database.

Using that database, we created a training dataset that consists of random sentence pairs (A,B) where sentences (A,B) either reference the same entity or di erent entities. That training dataset de nes a binary classi cation problem: Do sentences A and B refer to the same item or not?

We trained a BERT model with respect to this binary classi cation problem per supported language that we call the \evaluation model" in the following. Given some arbitrary sentence pair (A,B), the evaluation model outputs the probability of the two sentences refering to the same item.

During entity disambiguation, we build up to 50 sentence pairs (A,B) for each candidate that has been found in the lookup step (see chapter 3.2). The sentence pairs are composed in such a way that sentence A is part of the input text where the entity that is to be linked is mentioned and sentence B is a sentence from Wikipedia where that particular candidate has been linked to by a Wikipedia author. The higher the number of evaluated sentence pairs per candidate is, the more reliable the ranking model (see Section 5) can determine the overall matching probability. Again, the computational complexity increases with the number of sentence pairs. Additionally, in most cases there is only a very limited number of reference sentences from Wikipedia available such that it is not possible to generate a large number of unique sentence pairs. The choice of 50 sentence pairs is a trade-o that takes into account these considerations.

Application of the evaluation model to each sentence pair results in a corresponding matching probability. The sets of sentence pair matching probabilities of all candidates are then further processed by the ranking model (see Section 5). 5

Ranking of Candidates

During previous steps, sets of possible entity candidates have been obtained for all the parts of the input text that have been NER-tagged. For each candidate, a number of sentence pairs have been examined by the evaluation model, resulting in a set of sentence pair probabilities per candidate.

The ranking step nally determines an ordering of the candidates per linked entity according to the probability that it is the \correct" entity the part of the input text is actually referring to.

We compute statistical features of the sets of sentence pair probabilities of the candidates, among them: mean, median, min, max, standard deviation as well as various quantiles. Additionally we sort all the sentence pair probabilities and compute ranking statistics over all the candidates.

Then, based on the statistical features that describe the set of sentence pair probabilities of each candidate, a random forest model computes the overall probability that some particular candidate is actually the \correct" corresponding entity. The random forest model is the only component of our system where the CLEF HIPE 2020 data was used for training.

Finally the candidates are sorted according to the overall matching probabilities that have been estimated by the random forest model. The nal output of our NED system is the sorted list of candidates where candidates that have a matching probability less than 0:2 are cut o .

Our NED system does not implement the NIL entity that means either it returns a non-empty list of Wikidata IDs that have been sorted in descending order according to their overall matching probabilities or the result is \-" if there is not any candidate that has matching probability above 0:2. 6

Results

also contains the results of the best performing system (L3i). In case of the SBB system, strict NER performance is signi cantly worse than fuzzy NER performance. That observation holds for the L3i system too, however, for our system the e ect is much more pronounced. Strict NER is a much more demanding task, nevertheless, we partly attribute the di erence in performance to the training data of our NER system (see [ 5 ]), which has been created according to multiple slightly di erent NER annotation standards and also to the fact that we did not ne tune the NER system by using training data provided by the CLEF HIPE 2020 task organizers.

According to our observations, the OCR quality of the French data is slightly better than the German one and both French and German have better OCR quality than the English text material. By OCR quality, we primarily mean the overall quality of the entire text but not the mean Levenshtein distances of the entity text passages with respect to the original text. NER performance resembles that observation, i.e., French and German are comparable whereas English is signi cantly worse. Therefore our current hypothesis is that these di erences are partly caused by the sensitivity of our NER-tagger to OCR noise within the surrounding text.

Lang Team Evaluation Label

F1 DE DE DE DE FR FR FR FR EN EN EN EN

L3i NE-COARSE-LIT-micro-fuzzy ALL 0.870 0.886 0.878 SBB NE-COARSE-LIT-micro-fuzzy ALL 0.730 0.708 0.719 L3i NE-COARSE-LIT-micro-strict ALL 0.790 0.805 0.797 SBB NE-COARSE-LIT-micro-strict ALL 0.499 0.484 0.491 L3i NE-COARSE-LIT-micro-fuzzy ALL 0.912 0.931 0.921 SBB NE-COARSE-LIT-micro-fuzzy ALL 0.765 0.689 0.725 L3i NE-COARSE-LIT-micro-strict ALL 0.831 0.849 0.840 SBB NE-COARSE-LIT-micro-strict ALL 0.530 0.477 0.502 L3i NE-COARSE-LIT-micro-fuzzy ALL 0.794 0.817 0.806 SBB NE-COARSE-LIT-micro-fuzzy ALL 0.642 0.572 0.605 L3i NE-COARSE-LIT-micro-strict ALL 0.623 0.641 0.632

SBB NE-COARSE-LIT-micro-strict ALL 0.347 0.310 0.327

Table 5 shows the NEL performance of our system if our BERT based NERtagging is used as input whereas table 6 contains the results that have been reported when the NER ground truth had been provided to the NEL system. The two tables show that NEL performance signi cantly improves if NER ground truth is provided.

Interestingly, the recall of the French and German NEL system is similar although the French knowledge base is signi cantly smaller than the German one. This observation can be explained by the fact that coverage of the test data of the knowledge bases for German in French is similar (see Table 2). We attribute the much lower recall for the English test data to much lower coverage of the test data of the English knowledge base (see Table 2).

Precision of the German and French SBB system is comparable, again precision of the English system is signi cantly worse, even if NER-groundtruth is provided. We explain in Section 5 that our system provides a list of candidates that have matching probability above 0:2 that is sorted in descending order according to the matching probability. Hence, given a bad coverage of the knowledge base, as it is the case for English, non matching candidates will inevitably move up in that sorted list, i.e., the drop in precision can also be explained by the bad coverage of the knowledge base.

Lang Team Evaluation Label

R DE DE DE DE FR FR FR FR EN EN EN EN

SBB NEL-LIT-micro-fuzzy-@1 ALL 0.540 0.304 0.389 SBB NEL-LIT-micro-fuzzy-relaxed-@1 ALL 0.561 0.315 0.403 SBB NEL-LIT-micro-fuzzy-relaxed-@3 ALL 0.590 0.332 0.425 SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.601 0.338 0.432 SBB NEL-LIT-micro-fuzzy-@1 ALL 0.594 0.310 0.407 SBB NEL-LIT-micro-fuzzy-relaxed-@1 ALL 0.616 0.321 0.422 SBB NEL-LIT-micro-fuzzy-relaxed-@3 ALL 0.624 0.325 0.428 SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.629 0.328 0.431 SBB NEL-LIT-micro-fuzzy-@1 ALL 0.257 0.097 0.141 SBB NEL-LIT-micro-fuzzy-relaxed-@1 ALL 0.257 0.097 0.141 SBB NEL-LIT-micro-fuzzy-relaxed-@3 ALL 0.299 0.112 0.163

SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.299 0.112 0.163

Table 7 reports on the best NEL-LIT results per team where the NER task has been performed by each teams own NER system. Table 8 reports on the best NEL-LIT results per team where the NER ground truth has been provided to DE DE DE DE FR FR FR FR EN EN EN EN

SBB NEL-LIT-micro-fuzzy-@1 ALL 0.615 0.349 0.445 SBB NEL-LIT-micro-fuzzy-relaxed-@1 ALL 0.636 0.361 0.461 SBB NEL-LIT-micro-fuzzy-relaxed-@3 ALL 0.673 0.382 0.488 SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.686 0.389 0.497 SBB NEL-LIT-micro-fuzzy-@1 ALL 0.677 0.371 0.480 SBB NEL-LIT-micro-fuzzy-relaxed-@1 ALL 0.699 0.383 0.495 SBB NEL-LIT-micro-fuzzy-relaxed-@3 ALL 0.710 0.390 0.503 SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.716 0.393 0.507 SBB NEL-LIT-micro-fuzzy-@1 ALL 0.344 0.119 0.177 SBB NEL-LIT-micro-fuzzy-relaxed-@1 ALL 0.344 0.119 0.177 SBB NEL-LIT-micro-fuzzy-relaxed-@3 ALL 0.390 0.135 0.200

SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.390 0.135 0.200 each team. In both tables, i.e., Table 7 and Table 8, the results have been sorted according to precision. It turns out that our SBB NEL-system performed quite competitively in terms of precision but rather abysmally in terms of recall. We attribute the bad recall performance to multiple reasons: { Due to the construction of the knowledge bases, many entities end up without representation. Even for German, that has the best coverage, coverage is only 71%. { The lookup step of our NEL system has not been extensively optimized up to now. The embeddings, for instance, that are stored in the approximative nearest neighbour indices have been selected only on an initial guess basis and have not been optimized for performance. Which layers of the model to use and how to combine them heavily impacts the properties of the lookup step. Additionally the parameters of the approximative nearest neighbour indices such as type of similarity measure, number of lookup trees and cuto distance, have been chosen on a initial guess basis too and could be further optimized. 7

Conclusion

The results of our participation in the HIPE task highlight where the biggest potential for improvement of our NER / NEL / NED system is to be expected:

Evaluation EN EN EN

L3i NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.627 0.636 0.632 SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.601 0.338 0.432 UvA.ILPS NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.311 0.345 0.327 L3i NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.695 0.705 0.700 SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.629 0.328 0.431 IRISA NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.560 0.490 0.523 UvA.ILPS NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.397 0.220 0.283 ERTIM NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.150 0.084 0.108 L3i NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.651 0.674 0.662 UvA.ILPS NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.304 0.458 0.366

SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.299 0.112 0.163

F1 DE DE DE FR FR FR FR FR EN EN EN EN EN

L3i NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.696 0.696 0.696 SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.686 0.389 0.497 aidalight-baseline NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.440 0.435 0.437 L3i NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.746 0.743 0.744 SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.716 0.393 0.507 Inria-DeLFT NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.604 0.670 0.635 IRISA NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.590 0.588 0.589 aidalight-baseline NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.516 0.508 0.512 L3i NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.744 0.744 0.744 Inria-DeLFT NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.633 0.685 0.658 UvA.ILPS NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.607 0.580 0.593 aidalight-baseline NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.506 0.506 0.506 SBB NEL-LIT-micro-fuzzy-relaxed-@5 ALL 0.390 0.135 0.200

{ OCR-performance is crucial since it is the start of the processing chain and OCR noise causes, as expected, bad results in all the subsequent processing steps. { NEL recall performance of the SBB system has the biggest potential for improvement. An obvious path to improvement of recall performance is a better construction of the knowledge bases that should lead to a better overall representation of entities. { An extensive evaluation and optimization of the lookup step that includes hardening against OCR noise could improve recall. { NER results of other teams show that huge improvements in terms of NER performance even under the presence of noise are possible[ 4 ]. That improvement directly bene ts the NED/NEL steps. We will therefore carefully evaluate how these improvements have been achieved in order to optimize our own NER-tagger.

Due to the diverse nature of the CLEF HIPE 2020 task data, in particular due to the di erences in OCR quality, for us, the performance evaluation has resulted in valuable insights into our NER/NED/NEL system. The HIPE task data is in our opinion quite realistic, which means that we expect our system will have to handle similar data in the real world. Hence, we consider our participation in the HIPE competition as an important and constructive step on the path towards improving NER/NED processing of real world text material that has been obtained by OCR of historical documents. 7. Neudecker, C.: An open corpus for named entity recognition in historic newspapers.

In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). pp. 4348{4352. European Language Resources Association (ELRA), Portoroz, Slovenia (May 2016), https://www.aclweb.org/anthology/ L16-1689 8. Neudecker, C., Antonacopoulos, A.: Making europe's historical newspapers searchable. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS). pp. 405{410. IEEE, New York, NY, USA (April 2016), https://doi.org/10.1109/DAS. 2016.83 9. Rehm, G., Bourgonje, P., Hegele, S., Kintzel, F., Schneider, J.M., Ostendor , M., Zaczynska, K., Berger, A., Grill, S., Rauchle, S., Rauenbusch, J., Rutenburg, L., Schmidt, A., Wild, M., Ho mann, H., Fink, J., Schulz, S., Seva, J., Quantz, J., Bottger, J., Matthey, J., Fricke, R., Thomsen, J., Paschke, A., Qundus, J.A., Hoppe, T., Karam, N., Weichhardt, F., Fillies, C., Neudecker, C., Gerber, M., Labusch, K., Rezanezhad, V., Schaefer, R., Zellhofer, D., Siewert, D., Bunk, P., Pintscher, L., Aleynikova, E., Heine, F.: QURATOR: Innovative Technologies for Content and Data Curation. CoRR abs/2004.12195 (2020), https://arxiv.org/abs/2004.12195 10. Riedl, M., Pado, S.: A named entity recognition shootout for German. In: Proceedings of ACL. pp. 120{125. Melbourne, Australia (2018), http://aclweb.org/ anthology/P18-2020.pdf 11. Schweter, S., Baiter, J.: Towards robust named entity recognition for historic german. arXiv preprint arXiv:1906.07592 (2019), https://arxiv.org/abs/1906.07592 12. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the conll-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4. pp. 142{147. CONLL '03, Association for Computational Linguistics, Stroudsburg, PA, USA (2003), https://doi.org/10.3115/1119176.1119195

1. Benikova , D. , Biemann , C. , Kisselew , M. , Pado , S. : Germeval 2014 named entity recognition: Companion paper . Proceedings of the KONVENS GermEval Shared Task on Named Entity Recognition , Hildesheim, Germany pp. 104 { 112 ( 2014 )

2. Bernhardsson , E.: Annoy: Approximate Nearest Neighbors in C++/Python ( 2018 ), https://github.com/spotify/annoy

3. Devlin , J. , Chang , M. , Lee , K. , Toutanova , K. : BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . CoRR abs/ 1810 .04805 ( 2018 ), http://arxiv.org/abs/ 1810 .04805

4. Ehrmann , M. , Romanello , M. , Bircher , S. , Clematide , S. : Introducing the CLEF 2020 HIPE shared task: Named entity recognition and linking on historical newspapers . In: Jose, J.M. , Yilmaz , E. , Magalha~es, J., Castells , P. , Ferro , N. , Silva , M.J. , Martins , F . (eds.) Advances in information retrieval . pp. 524 { 532 . Springer International Publishing, Cham ( 2020 )

5. Labusch , K. , Neudecker , C. , Zellhofer, D.: BERT for Named Entity Recognition in Contemporary and Historic German . In: Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019 ) : Long Papers . p. 1 { 9 . German Society for Computational Linguistics & Language Technology , Erlangen, Germany ( 2019 ), https://corpora.linguistik.uni-erlangen.de/data/konvens/ proceedings/papers/KONVENS2019 paper 4.pdf

6. Labusch , K. , Zellhofer, D.: OCR Fulltexts of the Digital Collections of the Berlin State Library (DC-SBB) (June 26th 2019 ), https://doi.org/10.5281/zenodo. 3257041