=Paper= {{Paper |id=Vol-3884/paper8 |storemode=property |title=German National Socialist Injustice on the Semantic Web: from Archival Records to a Knowledge Graph |pdfUrl=https://ceur-ws.org/Vol-3884/paper8.pdf |volume=Vol-3884 |authors=Mahsa Vafaie |dblpUrl=https://dblp.org/rec/conf/semweb/Vafaie24 }} ==German National Socialist Injustice on the Semantic Web: from Archival Records to a Knowledge Graph== https://ceur-ws.org/Vol-3884/paper8.pdf
                                German National Socialist Injustice on the Semantic
                                Web: from Archival Records to a Knowledge Graph
                                Mahsa Vafaie*1,2
                                1
                                  FIZ Karlsruhe – Leibniz Institute for Information Infrastructure, Hermann-von-Helmholtz-Platz 1, 76344
                                Eggenstein-Leopoldshafen, Germany
                                2
                                  Applied Informatics and Formal Description Methods (AIFB), Karlsruhe Institute of Technology (KIT), Kaiserstraße 89,
                                76133 Karlsruhe, Germany


                                            Abstract
                                            Archival repositories contain vast amounts of historical data within unstructured textual documents,
                                            posing significant challenges for extracting coherent insights. This paper presents ongoing work towards
                                            an optimised workflow for constructing a knowledge graph from millions of archival records related to
                                            the “Wiedergutmachung” process in Germany. These records, documenting compensation and restitution
                                            efforts following World War II, offer insights into the aftermath of the National Socialist regime. The
                                            proposed workflow involves converting document images to machine-readable formats, ontology design,
                                            information extraction, and entity linking. Leveraging both traditional methods and transformer-based
                                            technologies, the workflow addresses unique challenges inherent in historical documents.

                                            Keywords
                                            Semantic Web, Digital Cultural Heritage, Digital Humanities, Linked Open Data, Optical Character
                                            Recognition, Information Extraction, Wiedergutmachung, Compensation for National Socialist Injustice.



                                1. Introduction
                                Since the 1990s, researchers from various domains have been increasingly engaged with provid-
                                ing access to information held within archival records, through online channels [1]. Archival
                                repositories hold invaluable historical data, often in the form of unstructured textual documents
                                that span decades. Extracting coherent insights from these records presents a formidable chal-
                                lenge due to the lack of standardised formats and the sheer volume of the data. Digitalisation
                                pipelines emerge as a computational solution to this challenge, leveraging a combination of
                                computer vision, natural language processing, information extraction, machine learning, and
                                semantic analysis techniques. These pipelines, through sophisticated algorithms and method-
                                ologies, facilitate the transformation of archival documents into structured data points. The true
                                power of digitalisation manifests in the construction of knowledge graphs -— a dynamic frame-
                                work that transforms discrete data points into interconnected nodes. Knowledge graphs play a
                                pivotal role in bridging the gap between archival records and the Semantic Web. By interlinking
                                the extracted information from archival records on the Semantic Web, knowledge graphs enable
                                researchers to discern intricate relationships, uncover latent patterns, and traverse historical

                                Proceedings of the Doctoral Consortium at ISWC 2024, co-located with the 23rd International Semantic Web Conference
                                (ISWC 2024)
                                $ mahsa.vafaie@fiz-karlsruhe.de (M. Vafaie* )
                                 0000-0002-7706-8340 (M. Vafaie* )
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
narratives that go beyond individual documents [2]. Furthermore, the existence of knowledge
graphs as the backbones of information systems allows for inference of previously undiscovered
knowledge from the given statements, and provides means for conducting exploratory search
and knowledge discovery. On the other hand, enhancing accessibility of archival data leads to
increased public participation and a better understanding of archival materials [3].
   This paper proposes an optimised workflow for constructing a knowledge graph from millions
of archival records from the “Wiedergutmachung” 1 process in Germany, for utilisation in
the semantic portal “Themenportal Wiedergutmachung” 2 . The term “Wiedergutmachung
compensation records” in this paper, refers to collections of documents, records, and materials
related to the process of compensation and restitution efforts that followed World War II
and the fall of the National Socialist regime in Germany. Wiedergutmachung compensation
records which contain an estimated amount of 100 km of archival documents, originate from
the State Offices for Compensation (Ämter für Wiedergutmachung in German) installed by the
German government in every German Land after the war. These collections contain a wide
range of documents, including index cards, application forms, documents on legal proceedings,
correspondence, testimonies, and other materials that pertain to individuals, families, and
communities seeking compensation for forced labour, imprisonment, injuries, and other damages
caused by National Socialist Injustice.
   Wiedergutmachung compensation records serve as a historical account of the processes that
took place to acknowledge and address the immense human suffering caused by the Nazi regime.
They play a crucial role in documenting the complex journey of survivors and their families in
seeking justice, recognition, and support for the harm they endured. They also provide insights
into the legal, bureaucratic, and social challenges faced by those seeking compensation in the
aftermath of such a devastating period in history. The Wiedergutmachung knowledge graph
(Wiedergutmachung KG) contributes to our understanding of the impact of the totalitarian
government in Germany on individuals and society and the ongoing efforts to address historical
injustices. Construction of the Wiedergutmachung KG, illuminates the historical information
hidden within unstructured Wiedergutmachung records, which were formerly tucked away in
archival repositories, only accessible to archivists and a limited number of individuals entitled
to see them for family or scientific research purposes. For instance, Wiedergutmachung KG can
display connections between “claimants” and “compensation decisions”, elucidating trends and
disparities within the Wiedergutmachung process.
   The proposed digitalisation workflow starts with conversion of document images (i.e., scanned
documents) to machine-readable formats through Optical Character Recognition (OCR). Subse-
quently, the workflow entails ontology design, information extraction, and linking entities with
external data sources, to construct the Wiedergutmachung KG. Due to the historical nature of
the documents, each of these stages present unique challenges that can be tackled using the more
traditional methods, while the advancements in transformer-based technologies can also be
used to address them with a more direct approach. The speed of technological advancements in
the field of AI calls for a dynamic pipeline with a modular design that allows for substitution or
1
 “Wiedergutmachung” is a German word that translates to “making good again” or “making amends”. In the context
  of National Socialism in Germany and its aftermath, it specifically refers to the efforts made to compensate survivors
  and victims for the losses they suffered during the rule of the Nazi regime.
2
  https://www.archivportal-d.de/themenportale/wiedergutmachung
combination of rule-based methods for accuracy with nascent AI technologies for optimisation.
Keeping this consideration in mind, the same digitalisation workflow can be extended to further
use cases and Digital Humanities research can benefit from the lessons learnt during the design
of such a pipeline.
   The remainder of this paper delves into details of the different components of the proposed
digitalisation pipeline for Wiedergutmachung compensation records and discusses the intricacies
of working with historical archival records. Section 2 outlines similar efforts for construction of
domain-specific KGs in the context of Cultural Heritage. In Section 3 the research questions are
introduced and methodologies for addressing them are discussed. Section 4 concludes the paper
and sketches the next steps for future work.

2. Related Work
As archives undergo mass digitisation and the volume of digital records grows, there arises a
rich but underutilised resource for researchers in the Digital Humanities. Integration of data
from historical archival records into the Semantic Web and transformation of archival data
according to Linked Open Data (LOD) principles has been receiving significant attention by
scholars [4].
   The Sampo series of semantic portals is a pioneering effort in applying Semantic Web technolo-
gies to showcase Finland’s national heritage [5]. These portals utilise the modular FinnONTO,
as a taxonomy of Cultural Heritage Objects [6]. WarSampo for example focuses on harmonising
and publishing heterogenous datasets related to World War II in Finland as LOD [7].
   The Jewish Contemporary Documentation Center has created an online LOD database 3
focused on Italian Holocaust victims and persecution events. Additionally, they’ve developed
an associated application for utilising this valuable data [8].
   In the Netherlands, “Oorlog voor de Rechter” (“War in Court”) 4 aims to unlock historical
knowledge by making The Central Archives of the Special Jurisdiction (CABR) accessible online,
leveraging advanced technologies and a user-centric design. CABR is the largest war archive
in the Netherlands, containing files of over 400,000 people suspected of collaboration with the
National Socialist regime in Germany.
   The European Holocaust Research Infrastructure (EHRI) Portal 5 serves as a valuable resource
for researchers and historians interested in Holocaust-related archival material. It provides
access to electronic finding aids, inventory information on institutions holding Holocaust-
related records, and vocabularies related to archival descriptions [9]. Researchers can use these
vocabularies to improve searchability and interoperability.
   In Germany, to the best of our knowledge, this work marks the first effort to develop an
LOD-based semantic portal from archival records pertaining to World War II and National
Socialist Injustices.




3
  http://dati.cdec.it/lod/shoah/website/html
4
  https://www.huygens.knaw.nl/en/projecten/war-in-court/
5
  https://portal.ehri-project.eu/
3. An LODification workflow for Wiedergutmachung
   compensation records
Transformation of data hidden within Wiedergutmachung compensation records into LOD
for increased accessibility, interoperability, explorability, and semantic enrichment is the main
goal of this work. Therefore, the overarching research question in this work is: What is
the most efficient pipeline for transformation of historical archival records into a
Knowledge Graph-based information system, for integration into the Semantic Web
and publication as Linked Open data? In order to accurately address the overarching
research question, it seems necessary to break it down into the different components of such
a pipeline, for an informed design decision. The research questions derived from a modular
design for this pipeline are as follows:

RQ1: What advanced techniques and methodologies can be developed to improve the accuracy
     of text recognition for digitised archival records with challenging characteristics such as
     faded ink, handwritten annotations, and non-standard fonts?

RQ2: What are the most effective Information Extraction (IE) techniques for accurately and
     efficiently identifying and retrieving structured data, such as names, dates, and locations,
     from digitised archival records?

RQ3: How can existing ontologies be adapted and extended to develop an ontology for repre-
     sentation of archival historical records, that accurately reflects the hierarchical structure
     of archival records, semantic annotation of the records, and information about the agents
     involved in the creation and archiving of the records

RQ4: How can we establish reliable links between historical entities (e.g., people, places, and
     events) extracted from digitised archival records and relevant external databases, authority
     files, or reference materials?
The methodologies and initial experiments for addressing RQ1, RQ2, and RQ3 are explained
below. Solutions for RQ4 are yet to be explored as a part of future work.

3.1. RQ1: Ontology Development
The development of Wiedergutmachung KG hinges on the creation of an ontology capable of
modelling relationships among archival documents, court proceedings, individuals, and organi-
sations involved in the compensation application, decision-making, and document archiving.
Ensuring the validity and reliability of this model requires incorporation of domain experts’
requirements and knowledge. Archivists from the State Archives of Baden-Württemberg 6
collaborated with the author to formulate a list of competency questions 7 , serving as the
foundation for the ontology’s conceptual modeling. These questions, catering to researcher-
s/historians and relatives/dependents of persecuted persons, provide insights into the domain’s
scope, structure, and concepts.
6
    https://www.landesarchiv-bw.de/
7
    The full list of competency questions is published on GitHub, in the Wiedergutmachung repository.
   According to the competency questions and based on the best practices in the field of Ontology
Design, the CourtDocs Ontology [10] is created to consist of three main building blocks, reusing
existing ontologies in order to avoid redundancy and thereby to also enable interoperability
with external data sources. Each of these building blocks and the ontologies that have been
reused for their creation are described below.
Archival Hierarchy and Provenance. The Records in Context ontology (RiC-O8 ) [11] is
employed to model the hierarchical structure of Wiedergutmachung compensation records, due
to its inclusion of named individuals, which makes it adaptable across institutes with different
archival systems and practices. On the other hand, RiC-O’s incorporation of smaller entities
improves findability and enables a more detailed representation of archival resources, including
constituent parts like stamps [12].
Court Procedures. The PROV Ontology (PROV-O9 ) [13] is reused to depict the Wiedergut-
machung process within the court system. It offers a standardised approach for modelling how
entities and activities evolve over time, making it effective for process modelling. Additionally,
PROV-O is widely recognised for representing provenance information, making it suitable for
capturing the relationships between Wiedergutmachung procedures and the records generated
or utilised at each stage of the process.
Biographical Information of Persons Involved. CIDOC-CRM is employed as the ontolog-
ical foundation for representing the biographical information of individuals involved in the
compensation process 10 . The use of a harmonising data model facilitates connection with other
materials and external sources. Moreover, the event-centric approach of CIDOC-CRM, in which
an individual’s existence is perceived as a series of interconnected events spanning across time
and space [14], enables the representation of crucial life events in prosopographical research on
victims of National Socialism, including events like deportation and imprisonment.

3.2. RQ2: OCR Quality Enhancement
Creation of transcripts from scanned documents using OCR systems, greatly accelerates and
streamlines the retrieval of information. However, OCR system efficacy is contingent upon fac-
tors such as text and font styles. OCR systems are typically specialised for either machine-printed
or handwritten text due to their distinct visual characteristics [15]. Yet, archival documents
often feature mixed text. Traditionally, workflows dealing with a variety of text types on
scanned documents employed distinct text recognition models. In [16] and [17] we propose
a pipeline for separation of machine-printed text and handwritten text on historical archival
documents that contain both text types. This OCR pre-processing step helps improve the quality
of the transcripts, by breaking down each document image into two layers, each containing
a particular text type, namely, handwritten text, or machine-printed text, and consequently,
feeding the layers into the appropriate OCR or Handwritten Text Recognition (HTR) engines.
In our preliminary work, we achieved an increase of 16% compared to the baseline systems for
separation of text types, with models trained on modern documents. In a more recent develop-
ment, the new Transformer-based OCR (TrOCR) models have demonstrated the capability to
8
 https://www.ica.org/standards/RiC/ontology
9
 https://www.w3.org/TR/prov-o/
10
   https://cidoc-crm.org/
adapt to variations in fonts, text types, styles, and languages [18, 19], skipping the pre-requisite
steps of dataset synthesis and model training for text type separation. With word accuracy as
an OCR evaluation metric ranging between 75% and 85%, TrOCR engines from Transkribus 11
have the potential to optimise the OCR quality improvement step. A qualitative evaluation of
the text-type separated transcripts is yet to be done, for comparison with TrOCR results.

3.3. RQ3: Information Extraction
Wiedergutmachung compensation records constitute of different document types, such as
application forms and index cards, and different layouts for each document type that vary based
on time and across different institutes. In a traditional information extraction pipeline, this
necessitates implementation of an automatic document identification system that classifies
the documents based on their type or layout, and feeds them into the respective information
extraction script customised for each document type. Our experiments on rule-based information
extraction using Apache Uima Ruta [20] with a subset of 75 documents with three different
layouts show an accuracy of 75%, for exact matches only. However, with the advance of Large
Language Models (LLMs), there is an opportunity to streamline the information extraction
process, instead of laboriously crafting multiple scripts for each document and layout type. In
the proposed LLM-based approach, a unified prompt, coupled with the appropriate context, can
facilitate the extraction of information from all the documents spanning across document types
and layouts. The quality of information extraction from archival records with LLMs is still to be
evaluated for a more accurate analysis and comparison with the rule-based methods.

4. Conclusion and Future Work
The presented research provides a significant contribution to Digital Humanities research,
particularly on topics related to World War II and National Socialist Injustice. Furthermore, the
findings from implementations of different techniques on archival records can be extended to
similar efforts on transformation of archival records into LOD.
   In the next stages of the research, the focus will shift towards implementation of transformer-
based technologies in the LODification pipeline and comparing the performance of these
methods against the more traditional methods for each of the constituent pipeline components.
Moreover, RQ4 from Section 3 will be addressed to interconnect the KG with external sources
and authority files. This is a crucial step to facilitate content-based and federated semantic
search and to enrich the KG. In case of the Wiedergutmachung KG, apart from Wikidata 12 ,
there are several other knowledge- and databases, representing data on German figures and the
victims of National Socialism that can be interlinked with the KG. It is also necessary to map all
the extracted information to specific unique entities (e.g., persons) by means of disambiguation
and entity resolution techniques.

Acknowledgments
This work is funded by the German Federal Ministry of Finance (Bundesministerium der Finanzen)
11
     https://www.transkribus.org/de
12
     https://www.wikidata.org/
and supervised by Prof. Dr. Harald Sack.

References
 [1] W. Duff, Archival mediation, Currents of archival thinking (2010) 115–136.
 [2] J. Waitelonis, H. Sack, Towards exploratory video search using linked data, Multimedia
     Tools and Applications 59 (2012) 645–672.
 [3] J. Oomen, M. van Erp, L. Baltussen, Sharing cultural heritage the linked open data way:
     why you should sign up, in: Museums and the Web 2012, 2012.
 [4] A. Hawkins, Archives, linked data and the digital humanities: increasing access to digitised
     and born-digital archives via the semantic web, Archival Science 22 (2022) 319–344.
 [5] E. Hyvönen, Digital humanities on the semantic web: Sampo model and portal series,
     Semantic Web 14 (2023) 729–744.
 [6] E. Hyvönen, K. Viljanen, J. Tuominen, K. Seppälä, Building a national semantic web
     ontology and ontology service infrastructure–the finnonto approach, in: The Semantic
     Web: Research and Applications: 5th European Semantic Web Conference, ESWC 2008,
     Tenerife, Canary Islands, Spain, June 1-5, 2008 Proceedings 5, Springer, 2008, pp. 95–109.
 [7] M. Koho, E. Ikkala, P. Leskinen, M. Tamper, J. Tuominen, E. Hyvönen, Warsampo knowl-
     edge graph: Finland in the second world war as linked open data, Semantic Web – Interoper-
     ability, Usability, Applicability 12 (2021) 265–278. URL: https://doi.org/10.3233/SW-200392.
     doi:10.3233/SW-200392.
 [8] R. Sprugnoli, G. Moretti, S. Tonelli, et al., Lod navigator: tracing movements of italian
     shoah victims, Umanistica Digitale (2019) N–A.
 [9] T. Blanke, M. Bryant, M. Frankl, C. Kristel, R. Speck, V. V. Daelen, R. V. Horik, The european
     holocaust research infrastructure portal, Journal on Computing and Cultural Heritage
     (JOCCH) 10 (2017) 1–18.
[10] M. Vafaie, O. Bruns, N. Pilz, J. Waitelonis, H. Sack, CourtDocs Ontology: Towards a Data
     Model for Representation of Historical Court Proceedings, in: Proc. of the 12th Knowledge
     Capture Conference 2023, 2023, pp. 175–179.
[11] F. Clavaud, T. Wildi, ICA records in contexts-ontology (RiC-O): a semantic framework for
     describing archival resources, in: Proc. of Linked Archives Int. Workshop 2021, 2021, pp.
     79–92.
[12] M. Vafaie, O. Bruns, N. Pilz, D. Dessí, H. Sack, Modelling Archival Hierarchies in Practice:
     Key Aspects and Lessons Learned, in: 6th Intl. Workshop on Computational History
     (HistoInformatics 2021), Online event, September 30-October 1, 2021, volume 2981, Aachen,
     Germany: RWTH Aachen, 2021, p. 6.
[13] T. Lebo, S. Sahoo, et al., PROV-O: The PROV ontology, W3C recommendation 30 (2013).
[14] J. A. Tuominen, E. A. Hyvönen, P. Leskinen, Bio CRM: A data model for representing
     biographical data for prosopographical research, in: Proc. of the 2nd Conf. on Biographical
     Data in a Digital World 2017 (BD2017), CEUR Workshop Proceedings, 2018.
[15] N. Islam, Z. Islam, N. Noor, A survey on optical character recognition system, arXiv
     preprint arXiv:1710.05703 (2017).
[16] M. Vafaie, O. Bruns, N. Pilz, J. Waitelonis, H. Sack, Handwritten and printed text identifi-
     cation in historical archival documents, in: Archiving Conference, volume 19, Society for
     Imaging Science and Technology, 2022, pp. 15–20.
[17] M. Vafaie, J. Waitelonis, H. Sack, Improvements in Handwritten and Printed Text Separation
     in Historical Archival Documents, in: Archiving Conference, volume 20, Society for
     Imaging Science and Technology, 2023, pp. 36–41.
[18] M. Li, T. Lv, et al., Trocr: Transformer-based optical character recognition with pre-
     trained models, in: Proc. of the AAAI Conf. on Artificial Intelligence, volume 37, 2023, pp.
     13094–13102.
[19] P. B. Ströbel, T. Hodel, W. Boente, M. Volk, The Adaptability of a Transformer-Based OCR
     Model for Historical Documents, in: Intl. Conf. on Document Analysis and Recognition,
     Springer, 2023, pp. 34–48.
[20] P. Kluegl, M. Toepfer, P.-D. Beck, G. Fette, F. Puppe, Uima ruta: Rapid development of
     rule-based information extraction applications, Natural Language Engineering 22 (2016)
     1–40.