<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>N. Garay);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Recognition for digitised archival documents in German</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nele Garay</string-name>
          <email>nele.garay@fiz-karlsruhe.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mahsa Vafaie</string-name>
          <email>mahsa.vafaie@fiz-karlsruhe.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Sack</string-name>
          <email>harald.sack@fiz-karlsruhe.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Amsterdam, Netherlands</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Applied Informatics and Formal Description Methods (AIFB), Karlsruhe Institute of Technology (KIT)</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>FIZ Karlsruhe - Leibniz Institute for Information Infrastructure</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Named Entity Recognition (NER)</institution>
          ,
          <addr-line>Optical Character Recognition (OCR), Digital Cultural Heritage</addr-line>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper presents an experiment that evaluates the efectiveness of two diferent Named Entity Recognition (NER) tools at extracting entities directly from the output of an Optical Character Recognition (OCR) workflow. The authors initially developed a test dataset comprising both raw and corrected OCR outputs, which were manually annotated with tags for Person, Location, and Organisation. Subsequently, they applied each NER tool to both the raw and corrected OCR outputs, evaluating their performance by comparing the precision, recall, and F1 scores against the manually annotated data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>(H. Sack)</p>
      <p>CEUR</p>
      <p>ceur-ws.org
to the goal of making vast amounts of noisy textual data more accessible. This assessment is only
possible through the creation of gold standard datasets. In this work, we present two openly available
datasets from the collection of Wiedergutmachung documents, manually annotated with the most
commonly used NER tags. Moreover, we compare two open-source NER models, a general model and a
domain-specific one, to provide insights and discussions on the eficacy of these tools for NER with
noisy OCR textin a prticular domain. In Section 2, a brief overview of similar attempts to assess NER
quality on OCR generated text is provided. Section 3 introduces the two datasets and the NER tools
with which the experiments have been conducted. Section 4 presents the results and a comparison
between the two models and across diferent tags, with noisy and clean transcripts. Section 5 concludes
the paper by emphasising the discussion points and proposing future directions for the combination of
OCR and NER techniques.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Named Entity Recognition (NER) tries to identify words and expressions that belong to particular
categories of Named Entities. Most commonly the categories include names of persons, location and
organisation. Identifying Named Entities can help with finding specific documents in collections[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Optical Character Recognition (OCR) systems transcribe pictures and scans of text documents. Due
to diferent factors such as bad quality of digitisation, or special fonts that cannot be easily recognised
by OCR engines, OCR- generated texts often contain errors, that can potentially have a negative impact
on the output of NER. This challenge has been studied and addressed by many researchers [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In [8]
the authors tested diferent tools of extracting person names from historical OCRed documents. When
comparing with hand annotated texts they found that OCR mistakes in word order had a bigger impact
in NER results than character recognition errors. Rodriquez et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] compared NER tools on OCRed text
of historical Holocaust related documents from the European Holocaust Reasearch Infrastructure (EHRI
documents), which include among others newspaper articles, victim testimonies and diplomatic reports.
They noted that the correction of the OCR text does not increase the performance of their NER tools
by a significant amount. Ruokolainen et al. [ 9] trained two NER-models on OCRed Finnish language
newspaper text. Evaluation results show F1-scores above 0.72 for Location and Person tags. To increase
the score for Organisation tags, a nested entity approach was used which resulted in an F1-score of 0.44.
Koudoro-Parfait et al. [10] tested the impact of diferent OCR systems on NER evaluation of french
novels. They found a negative correlation between OCR quality and NER quality, but missing blank
spaces, faulty first characters, and wrong word order are the OCR errors of the greatest impact. Hamdi
et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] studied five types of OCR errors and their impact on the performance of NER. They found that
segmentation errors (wrong word order) and errors in the first character had a strong impact on the
performance of NER.
      </p>
      <p>Recent advances in NER have led to the development of systems that are pre-trained on large amounts
of contemporary datasets and are ready for use in various languages [11, 12]. These tools leverage
state-of-the-art techniques, including Transformer-based models and deep learning architectures, to
enhance their performance across diferent linguistic contexts [ 13]. Two examples for such NER tools
are described in more details below.</p>
      <p>Flair [14] proposes a solution to challenges and problems with contextualised embeddings of words
by using a simple, unified interface for word embeddings. Flair also provides pre-trained models for
diferent languages and use-cases, including the NER-tagger for German language.</p>
      <p>European Holocaust Research Infrastructure (EHRI) has recently developed a single multilingual NER
model from a multilingual dataset of Holocaust related documents. With this dataset, the multilingual
Transformer-based masked language model XML-RoBERTa-large has been fine-tuned. The EHRI-NER
model performs well on Holocaust specific datasets with an F1-score above 0.80 [ 15].</p>
      <p>In this study we will use these two pre-trained NER systems mentioned above for our experiments
on historical data from Germany between 1950s and 1980s.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets and Experiments</title>
      <p>The dataset used in this work consists of text files acquired with Optical Character Recognition (OCR)
from one of the document collections of the Wiedergutmachung project. This collection, called
“Bundeszentalkartei” [Central Federal Index for Compensation] or BZK in short, is a central card file of most
of applications for compensation in the Federal Republic of Germany. These card files are in the form
of pre-printed index cards, filled in either by typewriter or by hand.</p>
      <p>The dataset collected for the evaluation consists of the OCR transcripts from 135 documents from
the BZK collection, here referred to as BZK, and the manually corrected version of the same OCR
transcripts, here referred to as BZK-GT. Both datasetes are collected such that they do not fall under
strict data privacy restrictions and are therefore, openly available 2. A Transformer-based OCR model
from Transkribus 3, called TextTitan, has been used for text recognition, since the documents contain
a mix of machine-printed and handwritten text, and Transformer-based OCR models have shown to
perform well on documents with multiple text types [16, 17].</p>
      <p>Both BZK and BZK-GT datasets have been manually annotated with entity labels by one annotator.
Before the manual annotation, the documents were tokenised as follows: First, all the dots and space
characters that occur more than three times were discarded and replaced “/” with a space character. We
also replaced abbreviations (e.g. geb. for geboren, str. for straße, verst. for verstorben) with their full
form for better readability and recognition by the NER model. After these pre-processing steps, spaCy
core4 was used for tokenisation.</p>
      <p>The named entity models we used contain the entity classes Person (PER), Location (LOC) and
Organisation (ORG). For the manually created NER ground truth for both datasets, the NER classes are
defined as follows:
• PER includes persons’ first and last names without titles.
• LOC includes country, state, city and street names, as well as names of deportation and
concentration camps.
• ORG includes names of governmental ofices ( Entschädigungsämter [Ofices for Compensation of</p>
      <p>National Socialist Injustice]).</p>
      <sec id="sec-3-1">
        <title>All datasets are created in the CoNLL 2003 format [18].</title>
        <p>Since many organisation names in German also contain city names, we used a nested entity approach
[19] when labelling organisations, i.e., a token can be both part of an organisation entity and a location
entity. The phrase “Entschädigungsamt Berlin” is therefore tagged as:</p>
        <p>Entschädigungsamt O B-ORG
Berlin B-LOC I-ORG</p>
        <p>Tokens in the BZK dataset with OCR errors only get an NER tag if a maximum of two characters
deviate from the original word (e.g., Lonkom instead of London) or if the entity is within a string next to
some wrongly identified characters (e.g., Londonpsc instead of London).</p>
        <p>Because the structure of the text on the cards is not always line by line, the OCR text sometimes has a
wrong order. This led to challenges during the annotation process, especially for the BIO tagging, since
without information about the document layout it was unclear if two entity tokens that appeared right
next to each other belonged to the same entity or not. Another problem during annotating existed for
multiple word entities that had been separated by the wrong word order. As a solution, while manually
tagging the BZK dataset, the OCR text was used as a stand-alone text and the basis for the BIO tagging
by itself, without information about the image and document layout. This led to a diferent BIO-tagging
and fewer ORG-tags (23.65% less) in the BZK dataset, compared to the BZK-GT dataset, which better
adheres to the document structure and layout.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2https://github.com/ISE-FIZKarlsruhe/Wiedergutmachung/tree/main/NER 3https://www.transkribus.org/de 4https://huggingface.co/spacy/en_core_web_sm</title>
        <p>For the NER task two models were used: the German language model from Flair, called
ner-germanlarge5 as a general model, and the EHRI-NER model6 which is trained on multilingual (including
Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) Holocaust related textual
data, as a domain-specific tool. The results of these experiments and a discussion of the results follow
in the next section.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>The results of the NER evaluation using the two diferent NER tools are summarised in Table 1 for
both the BZK and BZK-GT datasets. The evaluation indicates that the BZK-GT dataset achieves higher
F1-scores compared to the BZK dataset. Both NER tools exhibit poorer performance on the noisy text
generated by the OCR system, highlighting the need for OCR post-processing in the NER pipeline for
raw OCR text.
5https://huggingface.co/flair/ner-german-large
6https://huggingface.co/ehri-ner/xlm-roberta-large-ehri-ner-all
discrepancy can be attributed to the fact that the most prominent organisation types mentioned on BZK
cards are now considered historical and most no longer exist. Consequently, language models trained
on non-historical data struggle to recognise these historical organisations. Therefore, EHRI-NER, which
is fine-tuned on historical data from the same period, outperforms the general pre-trained model in
recognition of ORG entities in this dataset.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>In this study, we compared the performance of two Named Entity Recognition (NER) tools, the Flair
German model and the EHRI-NER model, on German historical OCRed text, to determine if the
Holocaust-specific EHRI-NER model outperforms the general model on our dataset, and to assess the
impact of OCR noise on NER quality, using the two aforementioned models. A significant contribution
of this work is the creation of two datasets, BZK and BZK-GT, which involved the manual annotation
of raw and corrected OCR texts, as well as the evaluation of NER predictions from both models.</p>
      <p>Our findings confirm that OCR errors degrade the quality of NER predictions for both models.
However, the EHRI-NER model demonstrated strong performance on our datasets, particularly in
recognising historical Organisations, in comparison to the Flari NER German model.</p>
      <p>One of the primary challenges encountered during our experiments was the annotation of raw OCR
text with entity labels. To address this challenge, developing comprehensive guidelines for annotation
could significantly streamline this process. Such guidelines would not only speed up the annotation
phase, but also enhance the consistency and comparability of annotated datasets, and provide the
possibility to engage multiple annotators in the process, thereby improving the overall quality of future
research.</p>
      <p>In future research, fine-tuning the models using our specific datasets could potentially enhance the
NER results we have achieved. Additionally, conducting experiments with other Large Language Models
would provide valuable comparative insights. This approach could help identify the most efective
models for our tasks and further improve the robustness and accuracy of NER predictions.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments References</title>
      <p>This work is funded by the German Federal Ministry of Finance (Bundesministerium der Finanzen).
[8] T. L. Packer, J. F. Lutes, A. P. Stewart, D. W. Embley, E. K. Ringger, K. D. Seppi, L. S. Jensen,
Extracting person names from diverse and noisy ocr text, in: Proceedings of the fourth workshop
on Analytics for noisy unstructured text data, 2010, pp. 19–26.
[9] T. Ruokolainen, K. Kettunen, Name the name-named entity recognition in ocred 19th and early
20th century finnish newspaper and journal collection data., in: DHN, 2020, pp. 137–156.
[10] C. Koudoro-Parfait, G. Lejeune, G. Roe, Spatial named entity recognition in literary texts: What is
the influence of ocr noise?, in: Proceedings of the 5th ACM SIGSPATIAL International Workshop
on Geospatial Humanities, GeoHumanities ’21, Association for Computing Machinery, New York,
NY, USA, 2021, p. 13–21. URL: https://doi.org/10.1145/3486187.3490206. doi:10.1145/3486187.
3490206.
[11] V. Yadav, S. Bethard, A survey on recent advances in named entity recognition from deep learning
models, arXiv preprint arXiv:1910.11470 (2019).
[12] K. Pakhale, Comprehensive overview of named entity recognition: Models, domain-specific
applications and challenges, arXiv preprint arXiv:2309.14084 (2023).
[13] M. Monteiro, C. Zanchettin, Optimization strategies for bert-based named entity recognition, in:</p>
      <p>Brazilian Conference on Intelligent Systems, Springer, 2023, pp. 80–94.
[14] A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, R. Vollgraf, Flair: An easy-to-use
framework for state-of-the-art nlp, in: Proceedings of the 2019 conference of the North American
chapter of the association for computational linguistics (demonstrations), 2019, pp. 54–59.
[15] M. Dermentzi, H. Scheithauer, Repurposing holocaust-related digital scholarly editions to develop
multilingual domain-specific named entity recognition tools, in: Proceedings of the First Workshop
on Holocaust Testimonies as Language Resources (HTRes)@ LREC-COLING 2024, 2024, pp. 18–28.
[16] P. B. Ströbel, T. Hodel, W. Boente, M. Volk, The Adaptability of a Transformer-Based OCR Model
for Historical Documents, in: Intl. Conf. on Document Analysis and Recognition, Springer, 2023,
pp. 34–48.
[17] M. Li, T. Lv, J. Chen, L. Cui, Y. Lu, D. Florencio, C. Zhang, Z. Li, F. Wei, Trocr: Transformer-based
optical character recognition with pre-trained models, in: Proceedings of the AAAI Conference
on Artificial Intelligence, volume 37, 2023, pp. 13094–13102.
[18] E. F. Sang, F. De Meulder, Introduction to the conll-2003 shared task: Language-independent
named entity recognition, arXiv preprint cs/0306050 (2003).
[19] J. R. Finkel, C. D. Manning, Nested named entity recognition, in: Proceedings of the 2009
conference on empirical methods in natural language processing, 2009, pp. 141–150.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Nadeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sekine</surname>
          </string-name>
          ,
          <article-title>A survey of named entity recognition and classification</article-title>
          ,
          <source>Lingvisticae Investigationes</source>
          <volume>30</volume>
          (
          <year>2007</year>
          )
          <fpage>3</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Rodriquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bryant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Blanke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Luszczynska</surname>
          </string-name>
          ,
          <article-title>Comparison of named entity recognition tools for raw ocr text</article-title>
          ., in: Konvens,
          <year>2012</year>
          , pp.
          <fpage>410</fpage>
          -
          <lpage>414</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vafaie</surname>
          </string-name>
          , J. Waitelonis,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sack</surname>
          </string-name>
          ,
          <article-title>Improvements in handwritten and printed text separation in historical archival documents</article-title>
          ,
          <source>in: Archiving Conference</source>
          , volume
          <volume>20</volume>
          ,
          <source>Society for Imaging Science and Technology</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vafaie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bruns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pilz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dessí</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sack</surname>
          </string-name>
          ,
          <article-title>Modelling archival hierarchies in practice: Key aspects and lessons learned (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vafaie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bruns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pilz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Waitelonis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sack</surname>
          </string-name>
          , Courtdocs ontology:
          <article-title>Towards a data model for representation of historical court proceedings</article-title>
          ,
          <source>in: Proceedings of the 12th Knowledge Capture Conference</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>175</fpage>
          -
          <lpage>179</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mikheev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Moens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Grover</surname>
          </string-name>
          ,
          <article-title>Named entity recognition without gazetteers</article-title>
          ,
          <source>in: Ninth Conference of the European Chapter of the Association for Computational Linguistics</source>
          ,
          <year>1999</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hamdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Pontes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sidere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Coustaty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doucet</surname>
          </string-name>
          ,
          <article-title>In-depth analysis of the impact of ocr errors on named entity recognition and linking</article-title>
          ,
          <source>Natural Language Engineering</source>
          <volume>29</volume>
          (
          <year>2023</year>
          )
          <fpage>425</fpage>
          -
          <lpage>448</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>